For a data source with huge data, it’s always challenging to get the column stats, especially the NDV. We plan to calculate NDV using Apache DataSketches Theta sketch and save the serialized compact sketch in the statistics file. The NDV and other column stats will be reported to Spark for query plan optimization.
Talk by: Huaxin Gao and Parth Chandra
Here’s more to explore:
Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq
Lakehouse Fundamentals Training: https://dbricks.co/44ancQs
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc
Facebook: https://www.facebook.com/databricksinc
Add comment