MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps.

MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase.

In this video, learn the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning.

No need for a DeLorean, you can time travel with Databricks.

While you can use the features of Delta Lake, what is actually happening underneath the covers? We will walk you through the concepts of ACID transactions, Delta time machine, Transaction protocol and how Delta brings reliability to data lakes. Organizations can finally standardize on a clean, centralized, versioned big data repository in their own cloud storage for analytics

Data engineers can simplify their pipelines and roll back bad writes.
Data scientists can manage their experiments better.
Data analysts can do easy reporting.

Here’s an interesting talk Albert Franziu Cros on a CI/CD setup composed by a Spark Streaming job in K8s consuming from Kafka.

Over the last year, we have been moving from a batch processing jobs setup with Airflow using EC2s to a powerful & scalable setup using Airflow & Spark in K8s.

The increasing need of moving forward with all the technology changes, the new community advances, and multidisciplinary teams, forced us to design a solution where we were able to run multiple Spark versions at the same time by avoiding duplicating infrastructure and simplifying its deployment, maintenance, and development.

In our talk, we will be covering our journey about how we ended up with a CI/CD setup composed by a Spark Streaming job in K8s consuming from Kafka, using the Spark Operator and deploying with ArgoCD.

ENGEL, which was founded in 1945, now is the leading manufacturer for injection moulding machines on the global market.

Since then, the amount of data has grown immensely and has also become more and more heterogenous due to newer generations of machine controls.

Taking a closer look at the conglomerations of each and every machine’s log files, one can find 13 different types of timestamps, different archive types and more peculiarities of each control generation. Apparently, this has led to certain problems in automatically processing and analysing the data.

In this talk, you will explore how ENGEL managed to centralise this data in only one place, how ENGEL set up a data pipeline to ingest batch-oriented data in a streaming fashion and how ENGEL migrated their pipeline from an on-premise Hadoop setup to the cloud using Databricks.

Machine learning suffers from a reproducibility crisis.

Deterministic machine learning is not only incredibly important for academia to verify research papers, but also for developers in enterprise scenarios.

Here’s a great video on how to address this shortcoming.

Due to the various reasons for non-deterministic ML, especially when GPUs are in play, I conducted several experiments and identified all causes and the corresponding solutions (if available).

Building a curated data lake on real time data is an emerging data warehouse pattern with delta.

However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.

In this presentation we will present how we built a robust streaming ETL pipeline that can handle changing schemas and unseen event types with zero downtimes. The pipeline can infer changed schemas, adjust the underlying tables and create new tables and ingestion streams when it detects a new event type. We will show the details how to infer the schemas on the fly and how to track and store these schemas when you don’t have the luxury of having a schema registry in the system.

With potentially hundreds of streams, it’s important how we deploy these streams and make them operational on Databricks.

This on the Databricks YouTube channel presents the web application that calculates real-time health scores at a very rapid speed using Spark on Kubernates.

A health score represents a machine’s lifetime and it is commonly used as a landmark for making a decision on whether to replace the machine with new one for high productivity maintenance. Therefore, it is very important to observe the health scores of the large number of machines in a factory without a delay.

To cope with this issue, the BISTel has applied the stream processing using Spark and services the real-time health score application.

With cloud-native rising, the conversation of infrastructure costs seeped from R&D Directors to every person in the R&D:

  • “How does much a VM cost?”
  • “can we use that managed services? How much will it cost us with our workload??”
  • “I need a stronger machine with more GPU, how do we make it happen within the budget?”

When deciding on a big data/data lake strategy for a product, one of the main chapters is cost management.

On top of the budget for hiring technical people, we need to prepare a strategy for services and infrastructure costs. That includes the provider we want to work with, the different tiers plan they have, the system needs, the R&D needs, and each service’s pros and cons.