Learn about what is Spark and using it in Big Data Clusters.

Time index

  • [00:00] Introduction
  • [00:30] One-sentence definition of Spark
  • [00:47] Storing Big Data
  • [01:44] What is Spark?
  • [02:35] Language choice
  • [03:27] Unified compute engine
  • [04:57] Spark with SQL Server
  • [05:47] Learning more
  • [06:10] Wrap-up

Data Lake Storage Gen 2 is the best storage solution for big data analytics in Azure. With its Hadoop compatible access, it is a perfect fit for existing platforms like Databricks, Cloudera, Hortonworks, Hadoop, HDInsight and many more. Take advantage of both blob storage and data lake in one service!

In this video, Azure 4 Everyone introduces to what Azure Data Lake Storage is, how it works and how can you leverage it in your big data workloads. I will also explain the differences between Blob and ADLS.

Sample code from demo: https://pastebin.com/ee7ULpwx

Next steps for you after watching the video
1. Azure Data Lake Storage documentation
2. Transform data using Databricks and ADLS demo tutorial
– https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse
3. More on multi-protocol access
– https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-multi-protocol-access
4. Read more on ACL
– https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control

CloudAcademy has an intro piece Apache Spark on Azure DataBricks.

Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. There are plenty of other differences between the two systems, as well, but we don’t need to go into the details here.

In this talk, Andrei Varanoch demonstrates the blueprint for such a Lambda Architecture implementation in Microsoft Azure, with Azure Databricks — a PaaS Spark offering – as a key component.  The term “Lambda Architecture” stands for a generic, scalable and fault-tolerant data processing architecture. As the hyper-scale now offers a various PaaS services for data ingestion, storage and processing, the need for a revised, cloud-native implementation of the lambda architecture is arising.

As Apache Spark is 10 years old. This article in Analytics India Magazine explores what led to Spark’s widespread adoption and what will keep it going into the future.

Dubbed as the official “in-memory replacement for MapReduce”, the disk-based computational engine is at the heart of early Hadoop clusters. Why Spark took off was because it reflects the changing processing paradigm to a more memory intensive pipeline, so if your cluster has a decent memory and an API simpler than MapReduce, processing in Spark will be faster. The reason why Spark is faster is because most of the operations (including reads) decrease in processing time roughly linearly with the number of machines since it’s all distributed.