No need for a DeLorean, you can time travel with Databricks.

While you can use the features of Delta Lake, what is actually happening underneath the covers? We will walk you through the concepts of ACID transactions, Delta time machine, Transaction protocol and how Delta brings reliability to data lakes. Organizations can finally standardize on a clean, centralized, versioned big data repository in their own cloud storage for analytics

Data engineers can simplify their pipelines and roll back bad writes.
Data scientists can manage their experiments better.
Data analysts can do easy reporting.

Delta Lake is an open-source storage management system (storage layer) that brings ACID transactions and time travel to Apache Spark and big data workloads.

The latest and greatest of Delta Lake 0.7.0 requires Apache Spark 3 and among the features is a full coverage of SQL DDL and DML commands.

On the latest episode of Data Brew, Denny Lee talks to Michael Armbrust about Delta Lake.

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

For our “Demystifying Delta Lake” session, we will interview Michael Armbrust – committer and PMC member of Apache Spark™ and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Delta Lake.

Change Data Capture (CDC) is a typical use case in Real-Time Data Warehousing. It tracks the data change log (binlog) of a relational database (OLTP), and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu.

To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code. This talk will share the practice for simplify CDC pipeline with SparkStreaming SQL and Delta Lake.

Databricks recently streamed this tech chat on SCD, or Slowly Changing Dimensions.

We will discuss a popular online analytics processing (OLAP) fundamental – slowly changing dimensions (SCD) – specifically Type-2.

As we have discussed in various other Delta Lake tech talks, the reliability brought to data lakes by Delta Lake has brought a resurgence of many of the data warehousing fundamentals such as Change Data Capture in data lakes.

Type 2 SCD within data warehousing allows you to keep track of both the history and current data over time. We will discuss how to apply these concepts to your data lake within the context of the market segmentation of a climbing eCommerce site.

 

Databricks  recently hosted this online tech talk on Delta Lake.

The General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) both aim to guarantee strong protection for individuals regarding their personal data and apply to businesses that collect, use, or share consumer data, whether the information was obtained online or offline. This remains one of the top priorities for the companies to be compliant and they are spending a lot of time and resources on being GDPR and CCPA compliant.

Your organization may manage hundreds of terabytes worth of personal information in your cloud. Bringing these datasets into GDPR and CCPA compliance is of paramount importance, but this can be a big challenge, especially for larger datasets stored in data lakes.

Databricks recently hosted an online tech talk on Delta Lake.

Predictive Maintenance (PdM) is different from other routine or time-based maintenance approaches as it combines various sensor readings and sophisticated analytics on thousands of logged events in near real time and promises several fold improvements in cost savings because tasks are performed only when warranted.

The top industries leading the IoT revolution include manufacturing, transportation, utilities, healthcare, consumer electronics & cars. The global market size for this is expected to grow at a CAGR of 28%. PdM plays a key role in Industry 4.0 to help corporations not only reduce unplanned downtimes, but also improve productivity and safety.

The collaborative Data and Analytics platform from Databricks is a great technology fit to facilitate these use cases by providing a single unified platform to ingest the sensor data, perform the necessary transformations and exploration, run ML and generate valuable insights.