Databricks, announced that it has open-sourced Delta Lake, a storage layer that makes it easier to ensure data integrity as new data flows into an enterprise’s data lake by bringing ACID transactions to these big data repositories. TechCrunch has an article detailing on why this is a big deal.

The tool provides the ability to enforce specific schemas (which can be changed as necessary), to create snapshots and to ingest streaming data or backfill the lake as a batch job. Delta Lake also uses the Spark engine to handle the metadata of the data lake (which by itself is often a big data problem). Over time, Databricks also plans to add an audit trail, among other things.

As Apache Spark is 10 years old. This article in Analytics India Magazine explores what led to Spark’s widespread adoption and what will keep it going into the future.

Dubbed as the official “in-memory replacement for MapReduce”, the disk-based computational engine is at the heart of early Hadoop clusters. Why Spark took off was because it reflects the changing processing paradigm to a more memory intensive pipeline, so if your cluster has a decent memory and an API simpler than MapReduce, processing in Spark will be faster. The reason why Spark is faster is because most of the operations (including reads) decrease in processing time roughly linearly with the number of machines since it’s all distributed.

Forbes points out that the term “Big Data” has been eclipsed by “Data Science” in the hype cycle. However, the Great Hype Cycle resembles Game of Thrones and I think we can all agree that “AI” or “Machine Learning” is next to sit on the Iron Throne of Hype.

In a world in which “big data” and “data science” seem to adorn every technology-related news article and social media post, have the terms finally reached public interest saturation? As the use of large amounts of data has become mainstream, is the role of “data science” replacing the hype of “big data?”

It never hurts to practice the fundamentals and understanding SQL is fundamental to any well-rounded data scientist. Here’s an interesting closeup look at T-SQL, the SQL “dialect” found in SQL Server.

Like any programming language, T-SQL has its share of common bugs and pitfalls, some of which cause incorrect results and others cause performance problems. In many of those cases, there are best practices that can help you avoid getting into trouble. I surveyed fellow Microsoft Data Platform MVPs asking […]