Building a curated data lake on real time data is an emerging data warehouse pattern with delta.
However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.
In this presentation we will present how we built a robust streaming ETL pipeline that can handle changing schemas and unseen event types with zero downtimes. The pipeline can infer changed schemas, adjust the underlying tables and create new tables and ingestion streams when it detects a new event type. We will show the details how to infer the schemas on the fly and how to track and store these schemas when you don’t have the luxury of having a schema registry in the system.
With potentially hundreds of streams, it’s important how we deploy these streams and make them operational on Databricks.