In the last several months, MLflow has introduced significant platform enhancements that simplify machine learning lifecycle management.

Expanded autologging capabilities, including a new integration with scikit-learn, have streamlined the instrumentation and experimentation process in MLflow Tracking.

Additionally, schema management functionality has been incorporated into MLflow Models, enabling users to seamlessly inspect and control model inference APIs for batch and real-time scoring. 

Taking deep learning models to production and doing so reliably is one of the next frontiers of MLOps.

With the advent of Redis modules and the availability of C APIs for the major deep learning frameworks, it is now possible to turn Redis into a reliable runtime for deep learning workloads, providing a simple solution for a model serving microservice.

RedisAI is shipped with several cool features such as support for multiple frameworks, CPU and GPU backend, auto batching, DAGing, and soon will be with automatic monitoring abilities. In this talk, we’ll explore some of these features of RedisAI and see how easy it is to integrate MLflow and RedisAI to build an efficient productionization pipeline.

[Originally aired as part of the Data+AI Online Meetup (https://www.meetup.com/data-ai-online/) and Bay Area MLflow meetup]

Visualizations are a powerful tool for communicating results to end-users and stakeholders. Their development and life-cycle management are no less challenging than the underlying processes producing the results they communicate.

Databricks explains how this is helping with the COVID pandemic.

Our team overcomes these challenges by leveraging Vega-Lite to encode visualizations as JSON objects and using the MLflow model registry as a visualization registry. During this presentation, we will walk through the process of creating a multi-layered Vega-Lite visualization using COVID-19 and Geodata, then managing it with the MLFlow model registry.Visualizations are a powerful tool for communicating results to end-users and stakeholders. Their development and life-cycle management are no less challenging than the underlying processes producing the results they communicate. Our team overcomes these challenges by leveraging Vega-Lite to encode visualizations as JSON objects and using the MLflow model registry as a visualization registry. During this presentation, we will walk through the process of creating a multi-layered Vega-Lite visualization using COVID-19 and Geodata, then managing it with the MLFlow model registry.

Chris Seferlis discusses one of the lesser known and newer Data Services in Azure, Data Explorer.

If you’re looking to run extremely fast queries over large sets of log and IoT data, this may be the right tool for you. I also discuss where it’s not a replacement for Azure Synapse or Azure Databricks, but works nicely alongside them in the overall architecture of the Azure Data Platform.

PyTorch, the popular open-source ML framework, has continued to evolve rapidly since the introduction of PyTorch 1.0, which brought an accelerated workflow from research to production.

In this video, take a deep dive on some of the most important new advances, including model parallel distributed training, model optimization and on device deployment as well as the latest libraries that support production scale deployment working in concert with MLFlow.

It’s all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.

However, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs.

These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions.

This session from the Spark + AI Summit introduces and concisely explains the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro.

From the abstract:

We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow).

After this presentation, attendees should understand the core concepts behind the prevalent file formats, the relevant file-format specific settings, and finally how to select the correct file format for their jobs. This presentation is relevant to Spark+AI summit because as more AI/ML workflows move into the Spark ecosystem (especially IO intensive deep learning) leveraging the correct file format is paramount in performant model training.

XGBoost is one of the most popular machine learning library, and its Spark integration enables distributed training on a cluster of servers.

This talk will cover the recent progress on XGBoost and its GPU acceleration via Jupyter notebooks on Databricks. 

Spark XGBoost has been enhanced to training large datasets with GPUs. Training data could now be loaded in chunks, and XGBoost DMatrix will be built up incrementally with compressions. The compressed DMatrix data could be stored in GPU memory or external memory/disk. These changes enable us to train models with datasets beyond GPU size limit. A gradient based sampling algorithm with external memory is also been introduced to achieve comparable accuracy and improved training performance on GPUs.

XGBoost has recently added a new kernel for learning to rank (LTR) tasks. It provides several algorithms: pairwise rank, lambda rank with NDC or MAP. These GOU kernels enables 5x speedup on LTR model training with the largest public LTR dataset (MSLR-Web). We have integrated Spark XGBoost with RAPIDS cudf library to achieve end-to-end GPU acceleration on Spark 2.x and Spark 3.0.

Solving a data science problem is about more than making a model.

It entails data cleaning, exploration, modeling and tuning, production deployment, and workflows governing each of these steps.

Databricks has a great video on how MLflow fits into the data science process.

In this simple example, we’ll take a look at how health data can be used to predict life expectancy. Starting with data engineering in Apache Spark, data exploration, model tuning and logging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or dashboard.