When creating streaming workloads with Databricks, it can sometimes be difficult to capture and understand the current structure of your source data. For example, what happens if you are ingesting JSON events from a vendor, and the keys are very sparsely populated, or contain dynamic content? Ideally, data engineers want to “lock in” a target schema in order to minimize complexity and maximize performance for known access patterns. What do you do when your data sources just don’t cooperate with that vision? The first step is to quantify how far your current source data is drifting from your established Delta table. But how?

This session will demonstrate a way to capture and visual drift across all your streaming tables. The next question is, “Now that I see all of the data I’m missing, how do I selectively promote some of these keys into DataFrame columns?” The second half of this session will demonstrate precisely how to do a schema migration with minimal job downtime.

Talk by: Alexander Vanadio

Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc
Facebook: https://www.facebook.com/databricksinc

Add comment

Your email address will not be published. Required fields are marked *

Categories

All Topics