Advancing Analytics takes a closer look at the two new runtimes available for Databricks.

We have not just one but two new Databricks Runtimes currently in preview – 7.6 brings several new features focussing on making Autoloader more flexible, improving performance of Optimize and Structured Streaming.

Runtime 8.0 is a much wider change, seeing the shift to Spark 3.1, introducing new language versions for Python, Scala and R.

This shift brings a large swathe of functionality, performance and feature changes, so take some time to look through the docs.

Simon walks through the high level notes, pulling out some interesting features and improvements.

Simon from Advancing Analytics explores the Atlas API that’s exposed under the covers of the new Azure Purview data governance offering.

There are a couple of different libraries available currently, so don’t be surprised if we see a lot of shifts & changes as the preview matures!

In this video, Simon takes a look at how you can get started with the API in a Databricks Notebook to register a custom lineage between two entities

For more info on the pyapacheatlas library used, see: https://pypi.org/project/pyapacheatlas/

NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python.

Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. It’s the most widely used NLP library in the enterprise today.

You’ll edit and extend a set of executable Python notebooks by implementing these common NLP tasks: named entity recognition, sentiment analysis, spell checking and correction, document classification, and multilingual and multi domain support. The discussion of each NLP task includes the latest advances in deep learning used to tackle it, including the prebuilt use of BERT embeddings within Spark NLP, using tuned embeddings, and “post-BERT” research results like XLNet, ALBERT, and roBERTa. Spark NLP builds on the Apache Spark and TensorFlow ecosystems, and as such it’s the only open-source NLP library that can natively scale to use any Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. You’ll run the notebooks locally on your laptop, but we’ll explain and show a complete case study and benchmarks on how to scale an NLP pipeline for both training and inference.

On the latest episode of Data Brew, Denny Lee talks to Michael Armbrust about Delta Lake.

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

For our “Demystifying Delta Lake” session, we will interview Michael Armbrust – committer and PMC member of Apache Spark™ and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Delta Lake.

One of the new features of Synapse Analytics is Synapse Link – the ability to query a live analytics store within CosmosDB with only tiny amounts of setup. We’ve recently seen it rolled out for the SQL On-Demand endpoint, meaning we can write both Spark and SQL directly over this analytics store!

In today’s video, Simon demonstrates how we can use Synapse Link to build up a Lambda Architecture, which enables near real-time querying with relatively little fuss!

More information on Synapse Link can be found here: https://azure.microsoft.com/en-us/updates/azure-synapse-link-for-azure-cosmos-db-sql-serverless-runtime-support-in-preview/

For the OG Lambda Architecture, check out Nathan Marz’s book “Big Data” here – https://www.manning.com/books/big-data

Advancing Analytics explainshow to parameterize Spark in Synapse Analytics, meaning you can plug notebooks to our orchestration pipelines and dynamically pass parameters to change how it works each time.

But how does it actually work?

Simon’s digging in to give us a quick peek at the new functionality.

For more details on the new parameters, take a peek here: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks#orchestrate-notebook

Azure Synapse workspaces can host a Spark cluster.

In addition to providing the execution environment for certain Synapse features such as Notebooks, you can also write custom code that runs as a job inside Synapse hosted Spark cluster.

This video walks through the process of running a C# custom Spark job in Azure Synapse. It shows how to create the Synapse workspace in the Azure portal, how to add a Spark pool, and how to configure a suitable storage account. It also shows how to write the custom job in C#, how to upload the built output to Azure, and then how to configure Azure Synapse to execute the .NET application as a custom job.

Topics/Time index:

  • Create a new Azure Synapse Analytics workspace (0:17)
  • Configuring security on the storage account (1:29)
  • Exploring the workspace (2:42)
  • Creating an Apache Spark pool (3:01)
  • Creating the C# application (4:05)
  • Adding a namespace directive to use Spark (SQL 4:48)
  • Creating the Spark session (5:01)
  • How the job will work (5:22)
  • Defining the work with Spark SQL (6:42)
  • Building the .NET application to upload to Azure Synapse (9:48)
  • Uploading our application to Azyure Synapse (11:45)
  • Using the ZIPed .NET application in a custom Spark job definition (12:39)
  • Testing the custom job (13:36)
  • Monitoring the job (13:56)
  • Inspecting the results (14:25)