Big Data Engineering closely examines  Spark Standalone Architecture.

Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled. This architecture is further integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions:
Resilient Distributed Dataset (RDD)
Directed Acyclic Graph (DAG)

This on the Databricks YouTube channel presents the web application that calculates real-time health scores at a very rapid speed using Spark on Kubernates.

A health score represents a machine’s lifetime and it is commonly used as a landmark for making a decision on whether to replace the machine with new one for high productivity maintenance. Therefore, it is very important to observe the health scores of the large number of machines in a factory without a delay.

To cope with this issue, the BISTel has applied the stream processing using Spark and services the real-time health score application.

Chris Seferlis introduce us to the newly added Apache Spark Pools in Azure Synapse Analytics for Big Data, Machine Learning, and Data Processing needs.

From the description:

I give an overview of what Spark is, and where it came from; why the Synapse Team added it to the suite of offering, and some sample workloads why you might use it.In this video I introduce the newly added Apache Spark Pools in Azure Synapse Analytics for Big Data, Machine Learning, and Data Processing needs. I give an overview of what Spark is, and where it came from; why the Synapse Team added it to the suite of offering, and some sample workloads why you might use it.

Community adoption of Kubernetes (instead of YARN) as a scheduler for Apache Spark has been accelerating since the major improvements from Spark 3.0 release.

Companies choose to run Spark on Kubernetes to use a single cloud-agnostic technology across their entire stack, and to benefit from improved isolation and resource sharing for concurrent workloads.

In this talk, the founders of Data Mechanics, a serverless Spark platform powered by Kubernetes, will show how to easily get started with Spark on Kubernetes.

We will go through an end-to-end example of building, deploying and maintaining an end-to-end data pipeline. This will be a code-heavy session with many tips to help beginners and intermediate Spark developers be successful with Spark on Kubernetes, and live demos running on the Data Mechanics platform.

Included topics:
– Setting up your environment (data access, node pools)
– Sizing your applications (pod sizes, dynamic allocation)
– Boosting your performance through critical disk and I/O optimizations
– Monitoring your application logs and metrics for debugging and reporting

In this talk from the Databricks YouTube Channel is about date-time processing in Spark 3.0, its API and implementations made since Spark 2.4.

In particular,it covers the following topics:

  1. Definition and internal representation of dates/timestamps in Spark SQL. Comparisons of Spark 3.0 date-time API with previous versions and other DBMS.
  2. Date/timestamp functions of Spark SQL.  Nuances of behavior and details of implementation. Use cases and corner cases of date-time API.
  3. Migration from the hybrid calendar (Julian and Gregorian calendars) to Proleptic Gregorian calendar in Spark 3.0.
  4. Parsing of date/timestamp strings, saving and loading date/time data via Spark’s datasources.
  5. Support of Java 8 time API in Spark 3.0.

XGBoost is one of the most popular machine learning library, and its Spark integration enables distributed training on a cluster of servers.

This talk will cover the recent progress on XGBoost and its GPU acceleration via Jupyter notebooks on Databricks. 

Spark XGBoost has been enhanced to training large datasets with GPUs. Training data could now be loaded in chunks, and XGBoost DMatrix will be built up incrementally with compressions. The compressed DMatrix data could be stored in GPU memory or external memory/disk. These changes enable us to train models with datasets beyond GPU size limit. A gradient based sampling algorithm with external memory is also been introduced to achieve comparable accuracy and improved training performance on GPUs.

XGBoost has recently added a new kernel for learning to rank (LTR) tasks. It provides several algorithms: pairwise rank, lambda rank with NDC or MAP. These GOU kernels enables 5x speedup on LTR model training with the largest public LTR dataset (MSLR-Web). We have integrated Spark XGBoost with RAPIDS cudf library to achieve end-to-end GPU acceleration on Spark 2.x and Spark 3.0.

Databricks explore the power of Horovod and what it means for data scientists and AI engineers.

The newly introduced Horovod Spark Estimator API enables TensorFlow and PyTorch models to be trained directly on Spark DataFrames, leveraging Horovod’s ability to scale to hundreds of GPUs in parallel, without any specialized code for distributed training. With the new accelerator aware scheduling and columnar processing APIs in Apache Spark 3.0, a production ETL job can hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline.

Databricks live streamed this interview with Matei Zaharia, an assistant professor at Stanford CS and co-founder and Chief Technologist of Databricks, the data and AI platform startup.

During his Ph.D., Matei started the Apache Spark project, which is now one of the most widely used frameworks for distributed data processing. He also co-started other widely used data and AI software such as MLflow, Apache Mesos and Spark Streaming.