Change Data Capture (CDC) is a typical use case in Real-Time Data Warehousing. It tracks the data change log (binlog) of a relational database (OLTP), and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu.

To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code. This talk will share the practice for simplify CDC pipeline with SparkStreaming SQL and Delta Lake.

Databricks recently streamed this tech chat on SCD, or Slowly Changing Dimensions.

We will discuss a popular online analytics processing (OLAP) fundamental – slowly changing dimensions (SCD) – specifically Type-2.

As we have discussed in various other Delta Lake tech talks, the reliability brought to data lakes by Delta Lake has brought a resurgence of many of the data warehousing fundamentals such as Change Data Capture in data lakes.

Type 2 SCD within data warehousing allows you to keep track of both the history and current data over time. We will discuss how to apply these concepts to your data lake within the context of the market segmentation of a climbing eCommerce site.

 

Databricks  recently hosted this online tech talk on Delta Lake.

The General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) both aim to guarantee strong protection for individuals regarding their personal data and apply to businesses that collect, use, or share consumer data, whether the information was obtained online or offline. This remains one of the top priorities for the companies to be compliant and they are spending a lot of time and resources on being GDPR and CCPA compliant.

Your organization may manage hundreds of terabytes worth of personal information in your cloud. Bringing these datasets into GDPR and CCPA compliance is of paramount importance, but this can be a big challenge, especially for larger datasets stored in data lakes.

Databricks recently hosted an online tech talk on Delta Lake.

Predictive Maintenance (PdM) is different from other routine or time-based maintenance approaches as it combines various sensor readings and sophisticated analytics on thousands of logged events in near real time and promises several fold improvements in cost savings because tasks are performed only when warranted.

The top industries leading the IoT revolution include manufacturing, transportation, utilities, healthcare, consumer electronics & cars. The global market size for this is expected to grow at a CAGR of 28%. PdM plays a key role in Industry 4.0 to help corporations not only reduce unplanned downtimes, but also improve productivity and safety.

The collaborative Data and Analytics platform from Databricks is a great technology fit to facilitate these use cases by providing a single unified platform to ingest the sensor data, perform the necessary transformations and exploration, run ML and generate valuable insights.

Databricks hosted this tech talk on Delta Lake.

Data, like our experiences, is always evolving and accumulating. To keep up, our mental models of the world must adapt to new data, some of which contains new dimensions – new ways of seeing things we had no conception of before. These mental models are not unlike a table’s schema, defining how we categorize and process new information.

This brings us to schema management. As business problems and requirements evolve over time, so too does the structure of your data. With Delta Lake, as the data changes, incorporating new dimensions is easy. Users have access to simple semantics to control the schema of their tables. These tools include schema enforcement, which prevents users from accidentally polluting their tables with mistakes or garbage data, as well as schema evolution, which enables them to automatically add new columns of rich data when those columns belong. In this webinar, we’ll dive into the use of these tools.

In this webinar you will learn about:

  • Understanding table schemas and schema enforcement
  • How does schema enforcement work?
  • How is schema enforcement useful?
  • Preventing data dilution
  • How does schema evolution work?
  • How is schema evolution useful?

Related Resources:

Heres’s an online Tech Talk hosted by Denny Lee, Developer Advocate at Databricks with Burak Yavuz, Software Engineer also of Databricks

Link to Notebook.

The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel, and more. In this session, we’ll explore what the Delta Lake transaction log is, how it works at the file level, and how it offers an elegant solution to the problem of multiple concurrent reads and writes.

In this tech talk you will learn about:

  • What is the Delta Lake Transaction Log
  • What is the transaction log used for?
  • How does the transaction log work?
  • Reviewing the Delta Lake transaction log at the file level
  • Dealing with multiple concurrent reads and writes
  • How the Delta Lake transaction log solves other use cases including Time Travel and Data Lineage and Debugging