Databricks recently streamed this tech chat on SCD, or Slowly Changing Dimensions.

We will discuss a popular online analytics processing (OLAP) fundamental – slowly changing dimensions (SCD) – specifically Type-2.

As we have discussed in various other Delta Lake tech talks, the reliability brought to data lakes by Delta Lake has brought a resurgence of many of the data warehousing fundamentals such as Change Data Capture in data lakes.

Type 2 SCD within data warehousing allows you to keep track of both the history and current data over time. We will discuss how to apply these concepts to your data lake within the context of the market segmentation of a climbing eCommerce site.

 

ThorogoodBI explores the use of Databricks for data engineering purposes in this webinar.

Whether you’re looking to transform and clean large volumes of data or collaborate with colleagues to build advanced analytics jobs that can be scaled and run automatically, Databricks offers a Unified Analytics Platform that promises to make your life easier.

In the second of 2 recorded webcasts Thorogood Consultants Jon Ward and Robbie Shaw showcase Databricks’ data transformation and data movement capabilities, how the tool aligns with cloud computing services, and highlight the security, flexibility and collaboration aspects of Databricks. We’ll also look at Databricks Delta Lake, and how it offers improved storage for both large-scale datasets and real-time streaming data.Whether you’re looking to transform and clean large volumes of data or collaborate with colleagues to build advanced analytics jobs that can be scaled and run automatically, Databricks offers a Unified Analytics Platform that promises to make your life easier.

In this video, Chris Seferlis continues discussing the Modern Data Platform in Azure with Part 3: Data Processing.

Tools Discusssed:

Learn about what is Spark and using it in Big Data Clusters.

Time index

  • [00:00] Introduction
  • [00:30] One-sentence definition of Spark
  • [00:47] Storing Big Data
  • [01:44] What is Spark?
  • [02:35] Language choice
  • [03:27] Unified compute engine
  • [04:57] Spark with SQL Server
  • [05:47] Learning more
  • [06:10] Wrap-up

Data Lake Storage Gen 2 is the best storage solution for big data analytics in Azure. With its Hadoop compatible access, it is a perfect fit for existing platforms like Databricks, Cloudera, Hortonworks, Hadoop, HDInsight and many more. Take advantage of both blob storage and data lake in one service!

In this video, Azure 4 Everyone introduces to what Azure Data Lake Storage is, how it works and how can you leverage it in your big data workloads. I will also explain the differences between Blob and ADLS.

Sample code from demo: https://pastebin.com/ee7ULpwx

Next steps for you after watching the video
1. Azure Data Lake Storage documentation
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction
2. Transform data using Databricks and ADLS demo tutorial
– https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse
3. More on multi-protocol access
– https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-multi-protocol-access
4. Read more on ACL
– https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control

CloudAcademy has an intro piece Apache Spark on Azure DataBricks.

Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. There are plenty of other differences between the two systems, as well, but we don’t need to go into the details here.