David Giard recently posted a how-to article on creating an Azure DataBricks service. Check it out!

Azure Databricks is a web-based platform built on top of Apache Spark and deployed to Microsoft’s Azure cloud platform. Databricks provides a web-based interface that makes it simple for users to create and scale clusters of Spark servers and deploy jobs and Notebooks to those clusters. Spark provides a […]

CloudAcademy has an intro piece Apache Spark on Azure DataBricks.

Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. There are plenty of other differences between the two systems, as well, but we don’t need to go into the details here.

MLflow enables data scientists to track and distribute experiments, package and share models across frameworks, and deploy them – no matter if the target environment is a personal laptop or a cloud data center. Here’s an interesting take from the Register.

MLflow was designed to take some of the pain out of machine learning in organizations that don’t have the coding and engineering muscle of the hyperscalers. It works with every major ML library, algorithm, deployment tool and language.

Databricks, announced that it has open-sourced Delta Lake, a storage layer that makes it easier to ensure data integrity as new data flows into an enterprise’s data lake by bringing ACID transactions to these big data repositories. TechCrunch has an article detailing on why this is a big deal.

The tool provides the ability to enforce specific schemas (which can be changed as necessary), to create snapshots and to ingest streaming data or backfill the lake as a batch job. Delta Lake also uses the Spark engine to handle the metadata of the data lake (which by itself is often a big data problem). Over time, Databricks also plans to add an audit trail, among other things.

Data engineering is about 70% of any data pipeline today, and without having the experience to implement a data engineering pipeline well, there is no value to be accumulated from your data.
In this session from Microsoft Ignite we discuss the best practices and demonstrate how a data engineer can develop and orchestrate the big data pipeline, including: data ingestion and orchestration using Azure Data Factory; data curation, cleansing and transformation using Azure Databricks; data loading into Azure SQL Data Warehouse for serving your BI tools.
Watch and learn how to effectively do the ETL/ELT process combined with advanced capabilities such as monitoring the jobs, getting alerts, jobs retrial, set permissions, and much more.

Gaurav Malhotra discusses how you can operationalize Jars and Python scripts running on Azure Databricks as an activity step in a Data Factory pipeline.

For more information: