Apache Spark has become the de-facto open-source standard for big data processing for its ease of use and performance.

The open-source Delta Lake project improves upon Spark’s data reliability, with new capabilities like ACID transactions, Schema Enforcement, and Time Travel.

Watch this webinar to learn how Apache Spark 3.0 and Delta Lake enhance Data Lake reliability. We will also walk through updates in the Apache Spark 3.0.release as part of our new Databricks Runtime 7.0 Beta.

Databricks hosted this tech talk on Delta Lake.

Data, like our experiences, is always evolving and accumulating. To keep up, our mental models of the world must adapt to new data, some of which contains new dimensions – new ways of seeing things we had no conception of before. These mental models are not unlike a table’s schema, defining how we categorize and process new information.

This brings us to schema management. As business problems and requirements evolve over time, so too does the structure of your data. With Delta Lake, as the data changes, incorporating new dimensions is easy. Users have access to simple semantics to control the schema of their tables. These tools include schema enforcement, which prevents users from accidentally polluting their tables with mistakes or garbage data, as well as schema evolution, which enables them to automatically add new columns of rich data when those columns belong. In this webinar, we’ll dive into the use of these tools.

In this webinar you will learn about:

  • Understanding table schemas and schema enforcement
  • How does schema enforcement work?
  • How is schema enforcement useful?
  • Preventing data dilution
  • How does schema evolution work?
  • How is schema evolution useful?

Related Resources:

Heres’s an online Tech Talk hosted by Denny Lee, Developer Advocate at Databricks with Burak Yavuz, Software Engineer also of Databricks

Link to Notebook.

The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel, and more. In this session, we’ll explore what the Delta Lake transaction log is, how it works at the file level, and how it offers an elegant solution to the problem of multiple concurrent reads and writes.

In this tech talk you will learn about:

  • What is the Delta Lake Transaction Log
  • What is the transaction log used for?
  • How does the transaction log work?
  • Reviewing the Delta Lake transaction log at the file level
  • Dealing with multiple concurrent reads and writes
  • How the Delta Lake transaction log solves other use cases including Time Travel and Data Lineage and Debugging

Databricks hosted this Online Tech Talk hosted by Denny Lee, Developer Advocate at Databricks, to see what data professionals can do to help the world beat the virus.

My name is Denny Lee and I’m a Developer Advocate at Databricks. But before this, I was a biostatistician working on HIV/AIDS research at the Fred Hutchinson Cancer Research Center and University of Washington Virology Lab in the Seattle-area. Watching my friends and colleagues working the front lines of this current pandemic has inspired me to see if we – as the data scientist community – can potentially help with “flattening the curve”. But before we dive into data science, remember – the most important thing you can do is wash your hands and social distancing! A great reference is How to Protect Yourself (https://www.cdc.gov/coronavirus/2019-ncov/prepare/prevention.html).

With the current concerns over SARS-Cov-2 and COVID-19, there are now available various COVID-19 datasets on Kaggle and GitHub as well as competitions such as the COVID-19 Open Research Dataset Challenge (CORD-19) (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge#). Whether you are a student or a professional data scientist, we thought we could help out by providing a primer session with notebooks on how to start analyzing these datasets.