Building a curated data lake on real time data is an emerging data warehouse pattern with delta.

However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.

In this presentation we will present how we built a robust streaming ETL pipeline that can handle changing schemas and unseen event types with zero downtimes. The pipeline can infer changed schemas, adjust the underlying tables and create new tables and ingestion streams when it detects a new event type. We will show the details how to infer the schemas on the fly and how to track and store these schemas when you don’t have the luxury of having a schema registry in the system.

With potentially hundreds of streams, it’s important how we deploy these streams and make them operational on Databricks.

Delta Lake is an open-source storage management system (storage layer) that brings ACID transactions and time travel to Apache Spark and big data workloads.

The latest and greatest of Delta Lake 0.7.0 requires Apache Spark 3 and among the features is a full coverage of SQL DDL and DML commands.

Coming from a data warehousing and BI background, Franco Patano wanted to have a catalogue of the Lakehouse, including schema and profiling statistics.

He created the Lakehouse Data Profiler notebook using Python and SQL to analyze the data and generate schema and statistics tables. He then uses the new SQL Analytics product from Databricks to dashboard and visualize the data profiling statistics. He discusses how to use these dashboards to optimize JOINs and other operations.

[ Lightning talk from Data + AI Summit 2020]

Apache Spark has become the de-facto open-source standard for big data processing for its ease of use and performance.

The open-source Delta Lake project improves upon Spark’s data reliability, with new capabilities like ACID transactions, Schema Enforcement, and Time Travel.

Watch this webinar to learn how Apache Spark 3.0 and Delta Lake enhance Data Lake reliability. We will also walk through updates in the Apache Spark 3.0.release as part of our new Databricks Runtime 7.0 Beta.

Azure Synapse has many features to help analyze data, and in this episode of Data Exposed, Ginger Grant will review how to query data stored in a Data Lake not only in Azure Synapse but also visualize the data in Power BI.

The demonstrations show how to run SQL queries against the Data Lake without using any Synapse Compute or data manipulation. Ginger will also walk-through the steps for how you can connect to Power BI from within Azure Synapse and visualize the data. To help get started Power BI and Azure Synapse, the video will walk through the steps to create Power BI Data Source files to speed connectivity.

Index:

  • 0:00 Introduction
  • 1:15 What is Azure Synapse
  • 2:27 What you can do with Azure Synapse
  • 3:40 Azure Synapse Studio
  • 5:10 Including PowerBI Demo
  • 9:40 When to use Azure Synapse

Databricks recently streamed this tech chat on SCD, or Slowly Changing Dimensions.

We will discuss a popular online analytics processing (OLAP) fundamental – slowly changing dimensions (SCD) – specifically Type-2.

As we have discussed in various other Delta Lake tech talks, the reliability brought to data lakes by Delta Lake has brought a resurgence of many of the data warehousing fundamentals such as Change Data Capture in data lakes.

Type 2 SCD within data warehousing allows you to keep track of both the history and current data over time. We will discuss how to apply these concepts to your data lake within the context of the market segmentation of a climbing eCommerce site.

 

Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs. All of this leverages our limitless Azure Data Lake Storage service for any type of data.

Microsoft Mechanics explains.

Data Lake Storage Gen 2 is the best storage solution for big data analytics in Azure. With its Hadoop compatible access, it is a perfect fit for existing platforms like Databricks, Cloudera, Hortonworks, Hadoop, HDInsight and many more. Take advantage of both blob storage and data lake in one service!

In this video, Azure 4 Everyone introduces to what Azure Data Lake Storage is, how it works and how can you leverage it in your big data workloads. I will also explain the differences between Blob and ADLS.

Sample code from demo: https://pastebin.com/ee7ULpwx

Next steps for you after watching the video
1. Azure Data Lake Storage documentation
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction
2. Transform data using Databricks and ADLS demo tutorial
– https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse
3. More on multi-protocol access
– https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-multi-protocol-access
4. Read more on ACL
– https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control

HDFS Tiering is Microsoft’s latest contribution to the Apache HDFS open source project.

In this video, learn how to use the HDFS tiering feature in SQL Server big data clusters, to seamlessly get access to your remote HDFS compatible storages for querying and analysis.

To learn more, check out our documentation: https://docs.microsoft.com/sql/big-data-cluster/hdfs-tiering?view=sql-server-ver15&WT.mc_id=dataexposed-c9-niner.