One of the new features of Synapse Analytics is Synapse Link – the ability to query a live analytics store within CosmosDB with only tiny amounts of setup. We’ve recently seen it rolled out for the SQL On-Demand endpoint, meaning we can write both Spark and SQL directly over this analytics store!

In today’s video, Simon demonstrates how we can use Synapse Link to build up a Lambda Architecture, which enables near real-time querying with relatively little fuss!

More information on Synapse Link can be found here: https://azure.microsoft.com/en-us/updates/azure-synapse-link-for-azure-cosmos-db-sql-serverless-runtime-support-in-preview/

For the OG Lambda Architecture, check out Nathan Marz’s book “Big Data” here – https://www.manning.com/books/big-data

Advancing Analytics explainshow to parameterize Spark in Synapse Analytics, meaning you can plug notebooks to our orchestration pipelines and dynamically pass parameters to change how it works each time.

But how does it actually work?

Simon’s digging in to give us a quick peek at the new functionality.

For more details on the new parameters, take a peek here: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks#orchestrate-notebook

Azure Synapse workspaces can host a Spark cluster.

In addition to providing the execution environment for certain Synapse features such as Notebooks, you can also write custom code that runs as a job inside Synapse hosted Spark cluster.

This video walks through the process of running a C# custom Spark job in Azure Synapse. It shows how to create the Synapse workspace in the Azure portal, how to add a Spark pool, and how to configure a suitable storage account. It also shows how to write the custom job in C#, how to upload the built output to Azure, and then how to configure Azure Synapse to execute the .NET application as a custom job.

Topics/Time index:

  • Create a new Azure Synapse Analytics workspace (0:17)
  • Configuring security on the storage account (1:29)
  • Exploring the workspace (2:42)
  • Creating an Apache Spark pool (3:01)
  • Creating the C# application (4:05)
  • Adding a namespace directive to use Spark (SQL 4:48)
  • Creating the Spark session (5:01)
  • How the job will work (5:22)
  • Defining the work with Spark SQL (6:42)
  • Building the .NET application to upload to Azure Synapse (9:48)
  • Uploading our application to Azyure Synapse (11:45)
  • Using the ZIPed .NET application in a custom Spark job definition (12:39)
  • Testing the custom job (13:36)
  • Monitoring the job (13:56)
  • Inspecting the results (14:25)

Petastorm is an open source data access library.

This library enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames.

Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, PyTorch, and PySpark. For more information about Petastorm, refer to the Petastorm GitHub page and Petastorm API documentation.

Change Data Capture (CDC) is a typical use case in Real-Time Data Warehousing. It tracks the data change log (binlog) of a relational database (OLTP), and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu.

To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code. This talk will share the practice for simplify CDC pipeline with SparkStreaming SQL and Delta Lake.

Adam Marczak explains Azure Data Factory Mapping Data Flow in this video.

With Azure Data Factory Mapping Data Flow, you can create fast and scalable on-demand transformations by using visual user interface. In just minutes you can leverage power of Spark with not a single line of code written.

In this episode I give you introduction to what Mapping Data Flow for Data Factory is and how can it solve your day to day ETL challenges. In a short demo I will consume data from blob storage, transform movie data, aggregate it and save multiple outputs back to blob storage.

Sample code and data: https://github.com/MarczakIO/azure4everyone-samples/tree/master/azure-data-factory-mapping-data-flows 

Ayman El-Ghazali recently presenting this Introduction to Databricks from the perspective of a SQL DBA at the NoVA SQL Users Group.

Code available at:https://github.com/thesqlpro/blogThis is an introduction to Databricks from the perspective of a SQL DBA. Come learn about the following topics:

  • Basics of how Spark works
  • Basics of how Databricks works (cluster setup, basic admin)
  • How to design and code an ETL Pipeline using Databricks
  • How to read/write from Azure Datalake and Database
  • Integration of Databricks into Azure Data Factory pipeline

Code available at:  https://github.com/thesqlpro/blog

Databricks recently streamed this tech chat on SCD, or Slowly Changing Dimensions.

We will discuss a popular online analytics processing (OLAP) fundamental – slowly changing dimensions (SCD) – specifically Type-2.

As we have discussed in various other Delta Lake tech talks, the reliability brought to data lakes by Delta Lake has brought a resurgence of many of the data warehousing fundamentals such as Change Data Capture in data lakes.

Type 2 SCD within data warehousing allows you to keep track of both the history and current data over time. We will discuss how to apply these concepts to your data lake within the context of the market segmentation of a climbing eCommerce site.

 

ThorogoodBI explores the use of Databricks for data engineering purposes in this webinar.

Whether you’re looking to transform and clean large volumes of data or collaborate with colleagues to build advanced analytics jobs that can be scaled and run automatically, Databricks offers a Unified Analytics Platform that promises to make your life easier.

In the second of 2 recorded webcasts Thorogood Consultants Jon Ward and Robbie Shaw showcase Databricks’ data transformation and data movement capabilities, how the tool aligns with cloud computing services, and highlight the security, flexibility and collaboration aspects of Databricks. We’ll also look at Databricks Delta Lake, and how it offers improved storage for both large-scale datasets and real-time streaming data.Whether you’re looking to transform and clean large volumes of data or collaborate with colleagues to build advanced analytics jobs that can be scaled and run automatically, Databricks offers a Unified Analytics Platform that promises to make your life easier.