XGBoost is one of the most popular machine learning library, and its Spark integration enables distributed training on a cluster of servers.

This talk will cover the recent progress on XGBoost and its GPU acceleration via Jupyter notebooks on Databricks. 

Spark XGBoost has been enhanced to training large datasets with GPUs. Training data could now be loaded in chunks, and XGBoost DMatrix will be built up incrementally with compressions. The compressed DMatrix data could be stored in GPU memory or external memory/disk. These changes enable us to train models with datasets beyond GPU size limit. A gradient based sampling algorithm with external memory is also been introduced to achieve comparable accuracy and improved training performance on GPUs.

XGBoost has recently added a new kernel for learning to rank (LTR) tasks. It provides several algorithms: pairwise rank, lambda rank with NDC or MAP. These GOU kernels enables 5x speedup on LTR model training with the largest public LTR dataset (MSLR-Web). We have integrated Spark XGBoost with RAPIDS cudf library to achieve end-to-end GPU acceleration on Spark 2.x and Spark 3.0.

Databricks explore the power of Horovod and what it means for data scientists and AI engineers.

The newly introduced Horovod Spark Estimator API enables TensorFlow and PyTorch models to be trained directly on Spark DataFrames, leveraging Horovod’s ability to scale to hundreds of GPUs in parallel, without any specialized code for distributed training. With the new accelerator aware scheduling and columnar processing APIs in Apache Spark 3.0, a production ETL job can hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline.

Databricks live streamed this interview with Matei Zaharia, an assistant professor at Stanford CS and co-founder and Chief Technologist of Databricks, the data and AI platform startup.

During his Ph.D., Matei started the Apache Spark project, which is now one of the most widely used frameworks for distributed data processing. He also co-started other widely used data and AI software such as MLflow, Apache Mesos and Spark Streaming.

.NET for Apache Spark empowers developers with .NET experience or code bases to participate in the world of big data analytics.

In this episode, Brigit Murtaugh joins Rich to show us how to start processing data with .NET for Apache Spark.

Time index:

  • [01:01] – What is Apache Spark?
  • [02:33] – What are customers using Apache Spark for?
  • [03:50] – What did we create .NET for Apache Spark?
  • [06:30] – Exploring GitHub data
  • [15:012] – Considering data processing in the real world
  • [18:26] – Analyzing continuous data streams

Useful Links

Databricks  recently hosted this online tech talk on Delta Lake.

The General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) both aim to guarantee strong protection for individuals regarding their personal data and apply to businesses that collect, use, or share consumer data, whether the information was obtained online or offline. This remains one of the top priorities for the companies to be compliant and they are spending a lot of time and resources on being GDPR and CCPA compliant.

Your organization may manage hundreds of terabytes worth of personal information in your cloud. Bringing these datasets into GDPR and CCPA compliance is of paramount importance, but this can be a big challenge, especially for larger datasets stored in data lakes.

Databricks hosted this webinar introducing Apache Spark, the platform that Databricks is based upon.

Abstract: scikit-learn is one of the most popular open-source machine learning libraries among data science practitioners.

This workshop will walk through what machine learning is, the different types of machine learning, and how to build a simple machine learning model. This workshop focuses on the techniques of applying and evaluating machine learning methods, rather than the statistical concepts behind them. We will be using data released by the New York Times (https://github.com/nytimes/covid-19-data).

Prior basic Python and pandas experience is required.

Previous webinars in the series:

  • Watch Part1, Intro to Python: https://youtu.be/HBVQAlv8MRQ ( to learn about python)
  • Watch Part 2, Data Analysis with pandas: https://youtu.be/riSgfbq3jpY
  • Watch Part 3, Machine Learning: https://youtu.be/g103iO-izoI

Databricks recently held a webinar on how they worked with Virgin Hyperloop One engineers.

They discuss the goals, implementation, and outcome of moving from Pandas code to Koalas code and using MLflow. Lots of code, notebooks, demos, etc.

Come hear Patryk Oleniuk, Software Engineer at Virgin Hyperloop (VHO) discuss how VHO has dramatically reduced processing time by 95%, while changing less than 1% of previously single-threaded, pandas-based python code. Attendees of this webinar will learn:

How VHO leverages public and private transportation data to optimize Hyperloop designHow to ‘Sparkify’ (scale) your pandas code by using ‘Koalas’ with minimal code changesHow to use ‘Koalas’ and MLflow for sweeping machine learning models and experiment resultsFeatured SpeakersPatryk Oleniuk, Lead Data Engineer, Virgin Hyperloop OneYifan Cao, Senior Product Manager, Databricks 

Resources:

Slides: https://www.slideshare.net/databricks/from-pandas-to-koalas-reducing-timetoinsight-for-virgin-hyperloops-data

Koalas Notebook: https://pages.databricks.com/rs/094-YMS-629/images/koalas_webinar_code%20-%20Copy.html