Machine learning suffers from a reproducibility crisis.

Deterministic machine learning is not only incredibly important for academia to verify research papers, but also for developers in enterprise scenarios.

Here’s a great video on how to address this shortcoming.

Due to the various reasons for non-deterministic ML, especially when GPUs are in play, I conducted several experiments and identified all causes and the corresponding solutions (if available).

Building a curated data lake on real time data is an emerging data warehouse pattern with delta.

However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.

In this presentation we will present how we built a robust streaming ETL pipeline that can handle changing schemas and unseen event types with zero downtimes. The pipeline can infer changed schemas, adjust the underlying tables and create new tables and ingestion streams when it detects a new event type. We will show the details how to infer the schemas on the fly and how to track and store these schemas when you don’t have the luxury of having a schema registry in the system.

With potentially hundreds of streams, it’s important how we deploy these streams and make them operational on Databricks.