When doing the TPC-DS benchmark using external v2 data source, we have observed that for several of the queries, DS v1 has better join plans than Apache Spark. The main reason is that DS v1 uses column stats, especially number of distinct values (NDV) for query optimization. Currently, Spark™ DS v2 only has interfaces for data sources to report table statistics such as size in bytes and number of rows. In order to use column stats in DS v2, we have added new interfaces to allow external data sources to report column stats to Spark.

For a data source with huge data, it’s always challenging to get the column stats, especially the NDV. We plan to calculate NDV using Apache DataSketches Theta sketch and save the serialized compact sketch in the statistics file. The NDV and other column stats will be reported to Spark for query plan optimization.

Talk by: Huaxin Gao and Parth Chandra

Here’s more to explore:
Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq
Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc
Facebook: https://www.facebook.com/databricksinc

Add comment

Your email address will not be published. Required fields are marked *

Categories

All Topics