Testing Generative AI Models: What You Need to Know
26.07.2023
Generative AI shows incredible promise for enterprise applications. The explosion of generative AI can be attributed to the convergence of several factors. Most significant is that the barrier to entry has dropped for AI application developers through customizable prompts (few-shot learning), enabling laypeople to generate high-quality content. The flexibility of models like ChatGPT and DALLE-2 have sparked curiosity and creativity about new applications that they can support. The number of tools will continue to grow in a manner similar to how AWS fueled app development. But excitement must be tampered by concerns about new risks imposed to business and society. Increased capability and adoption also increase risk exposure. As organizations explore creative boundaries of generative models, measures to reduce risk must be put in place. However, the enormous size of the input space and inherent complexity make this task more challenging than traditional ML models.
In this session, we summarize the new risks introduced by the new class of generative foundation models through several examples, and compare how these risks relate to the risks of mainstream discriminative models. Steps can be taken to reduce the operational risk, bias and fairness issues, and privacy and security of systems that leverage LLM for automation. We’ll explore model hallucinations, output evaluation, output bias, prompt injection, data leakage, stochasticity, and more. We’ll discuss some of the larger issues common to LLMs and show how to test for them. A comprehensive, test-based approach to generative AI development will help instill model integrity by proactively mitigating failure and the associated business risk.
Talk by: Yaron Singer
Here’s more to explore:
LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Using DMS and DLT for Change Data Capture
25.07.2023
Bringing in Relational Data Store (RDS) data into your data lake is a critical and important process to facilitate use cases. By leveraging AWS Database Migration Services (DMS) and Databricks Delta Live Tables (DLT) we can simplify change data capture from your RDS. In this talk, we will be breaking down this complex process by discussing the fundamentals and best practices. There will also be a demo where we bring this all together.
Talk by: Neil Patel and Ganesh Chand
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Sponsored by: Microsoft | Next-Level Analytics with Power BI and Databricks
25.07.2023
The widely-adopted combination of Power BI and Databricks has been a game-changer in providing a comprehensive solution for modern data analytics. In this session, you’ll learn how self-service analytics combined with the Databricks Lakehouse Platform can allow users to make better-informed decisions by unlocking insights hidden in complex data. We’ll provide practical examples of how organizations have leveraged these technologies together to drive digital transformation, lower total cost of ownership (TCO), and increase revenue. By the end of the presentation and demo, you’ll understand how Power BI and Databricks can help drive real-time insights at scale for organizations in any industry.
Talk by: Bob Zhang and Mahesh Prakriya
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Optimizing Batch and Streaming Aggregations
27.07.2023
A client recently asked to optimize their batch and streaming workloads. It happened to be aggregations using DataFrame.groupby operation with a custom Scala UDAF over a data stream from Kafka. Just a single simple-looking request that turned itself up into a a-few-month-long hunt to find a more performant query execution planning than ObjectHashAggregateExec that kept falling back to a sort-based aggregation (i.e., the worst possible aggregation runtime performance). It quickly taught us that an aggregation using a custom Scala UDAF cannot be planned other than ObjectHashAggregateExec but at least tasks don't always have to fall back. And that's just batch workloads. When you throw in streaming semantics and think of the different output modes, windowing and streaming watermark optimizing aggregation can take a long time to do right.
Talk by: Jacek Laskowski
Here’s more to explore:
Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV
The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift
26.07.2023
Delta Lake is an open source project that helps implement modern data lake architectures commonly built on cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud.
There are a lot of use cases of Delta tables on AWS. AWS has invested a lot in this technology, and now Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. With AWS Glue, you can easily ingest data from multiple data sources such as on-prem databases, Amazon RDS, DynamoDB, MongoDB into Delta Lake on Amazon S3 even without expertise in coding.
This session will demonstrate how to get started with processing Delta Lake tables on Amazon S3 using AWS Glue, and querying from Amazon Athena, and Amazon Redshift. The session also covers recent AWS service updates related to Delta Lake.
Talk by: Noritaka Sekiyama and Akira Ajisaka
Here’s more to explore:
Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq
Lakehouse Fundamentals Training: https://dbricks.co/44ancQs
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Advanced Governance with Collibra on Databricks
26.07.2023
A data lake is only as good as its governance. Understanding what data you have, performing classification, defining/applying security policies and auding how it's used is the data governance lifecycle. Unity Catalog with its rich ecosystem of supported tools simplifies all stages of the data governance lifecycle. Learn how metadata can be hydrated, into Collibra directly from Unity Catalog. Once the metadata is available in Collibra we will demonstrate classification, defining security policies on the data and pushing those policies into Databricks. All access and usage of data is automatically audited with real time lineage provided in the data explorer as well as system tables.
Talk by: Leon Eller and Antonio Castelo
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Unlocking Near Real Time Data Replication with CDC, Apache Spark™ Streaming, and Delta Lake
26.07.2023
Tune into DoorDash's journey to migrate from a flaky ETL system with 24-hour data delays, to standardizing a CDC streaming pattern across more than 150 databases to produce near real-time data in a scalable, configurable, and reliable manner.
During this journey, understand how we use Delta Lake to build a self-serve, read-optimized data lake with data latencies of 15, whilst reducing operational overhead. Furthermore, understand how certain tradeoffs like conceding to a non-real-time system allow for multiple optimizations but still permit for OLTP query use-cases, and the benefits it provides.
Talk by: Ivan Peng and Phani Nalluri
Here’s more to explore:
Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV
The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Streaming Data Analytics with Power BI and Databricks
26.07.2023
This session is comprised of a series of end-to-end technical demos illustrating the synergy between Databricks and Power BI for streaming use cases, and considerations around when to choose which scenario:
Scenario 1: DLT + Power BI Direct Query and Auto Refresh
Scenario 2: Structured Streaming + Power BI streaming datasets
Scenario 3: DLT + Power BI composite datasets
Talk by: Liping Huang and Marius Panga
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
What’s New in Databricks Workflows -- With Live Demos
25.07.2023
Databricks Workflows provides unified orchestration for the Lakehouse. Since it was first announced last year, thousands of organizations have been leveraging Workflows for orchestrating lakehouse workloads such as ETL, BI dashboard refresh and ML model training.
In this session, the Workflows product team will cover and demo the latest features and capabilities of Databricks Workflows in the areas of workflow authoring, observability and more. This session will also include an outlook for future innovations you can expect to see in the coming months.
Talk by: Muhammad Bilal Aslam
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Cross-Platform Data Lineage with OpenLineage
28.07.2023
There are more data tools available than ever before, and it is easier to build a pipeline than it has ever been. These tools and advancements have created an explosion of innovation, resulting in data within today's organizations becoming increasingly distributed and can't be contained within a single brain, a single team, or a single platform. Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe.
OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark™, Flink®, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time. In this session, we will show how to trace data lineage across Apache Spark and Apache Airflow. There will be a walk-through of the OpenLineage architecture and a live demo of a running pipeline with real-time data lineage.
Talk by: Julien Le Dem,Willy Lulciuc
Here’s more to explore:
Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Advancements in Open Source LLM Tooling, Including MLflow
25.07.2023
MLflow is one of the most used open source machine learning frameworks with over 13 million monthly downloads. With the recent advancements in generative AI, MLflow has been rapidly integrating support for a lot of the popular AI tools being used such as Hugging Face, LangChain, and OpenAI. This means that it’s becoming easier than ever to build AI pipelines with your data as the foundation, yet expanding your capabilities with the incredible advancements of the AI community.
Come to this session to learn how MLflow can help you:
- Easily grab open source models from Hugging Face and use Transformers pipelines in MLflow
- Integrate LangChain for more advanced services and to add context into your model pipelines
- Bring in OpenAI APIs as part of your pipelines
- Quickly track and deploy models on the lakehouse using MLflow
Talk by: Corey Zumar and Ben Wilson
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse
25.07.2023
In this technical deep dive, we will detail how customers implemented data mesh on Databricks and how standardizing on delta format enabled delta-to-delta share to non-Databricks consumers.
- Current state of the IT landscape
1. Data silos (problems with organizations not having connected data in the ecosystem)
2. A look back on why we moved away from data warehouses and choose cloud in the first place
3. What caused the data chaos in the cloud (instrumentation and too much stitching together) ~ periodic table list of services of the cloud
- How to strike the balance between autonomy and centralization
- Why Databricks Unity Catalog puts you in the right path to implementing data mesh strategy
- What are the process and features that enable and end-to-end Implementation of a data strategy
- How customers were able to successfully implement the data mesh on out of the box Unity Catalog and delta sharing without overwhelming their IT tool stack
- Use cases
1. Delta-to-delta data sharing
2. Delta-to-others data sharing
- How do you navigate when data today is available across regions, across clouds, on-prem and external systems
1. Change data feed to share only “data that has changed”
- Data stewardship
1. Why ABAC is important
2. How file based access policies and governance play an important role
- Future state and its pitfalls
1. Egress costs
2. Data compliances
Talk by: Surya Turaga and Thomas Roach
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Photon for Dummies: How Does this New Execution Engine Actually Work?
25.07.2023
Did you finish the Photon whitepaper and think, wait, what? I know I did; it’s my job to understand it, explain it, and then use it. If your role involves using Apache Spark™ on Databricks, then you need to know about Photon and where to use it. Join me, chief dummy, nay "supreme" dummy, as I break down this whitepaper into easy to understand explanations that don’t require a computer science degree. Together we will unravel mysteries such as:
- Why is a Java Virtual Machine the current bottleneck for Spark enhancements?
- What does vectorized even mean? And how was it done before?
- Why is the relationship status between Spark and Photon "complicated?"
In this session, we’ll start with the basics of Apache Spark, the details we pretend to know, and where those performance cracks are starting to show through. Only then will we start to look at Photon, how it’s different, where the clever design choices are and how you can make the most of this in your own workloads. I’ve spent over 50 hours going over the paper in excruciating detail; every reference, and in some instances, the references of the references so that you don’t have to.
Talk by: Holly Smith
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Deep Dive into the New Features of Apache Spark™ 3.4
25.07.2023
Join us for this Technical Deep Dive session. In 2022, Apache Spark™ was awarded the prestigious SIGMOD Systems Award, because Spark is the de facto standard for data processing.
In this session, we will share the latest progress in Apache Spark community. With tremendous contribution from the open source community, Spark 3.4 managed to resolve in excess of 2,400 Jira tickets. We will talk about the major features and improvements in Spark 3.4. The major updates are Spark Connect, numerous PySpark and SQL language features, engine performance enhancements, as well as operational improvements in Spark UX and error handling.
Talk by: Xiao Li and Daniel Tenedorio
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?
27.07.2023
Rust is a unique language whose traits make it very appealing for data engineering. In this session, we'll walk through the different aspects of the language that make it such a good fit for big data processing including: how it improves performance and how it provides greater safety guarantees and compatibility with a wide range of existing tools that make it well positioned to become a major building block for the future of analytics.
We will also take a hands-on look through real code examples at a few emerging technologies built on top of Rust that utilize these capabilities, and learn how to apply them to our modern lakehouse architecture.
Talk by: Oz Katz
Here’s more to explore:
Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Data + AI Summit Keynote Thursday
29.06.2023
0:00 Open
6:08 Ali Ghodsi & Marc Andreessen
32:06 Reynold Xin
48:09 Michael Armbrust
1:00:00 Matei Zaharia & Panel
1:27:10 Hannes Muhleisen
01:37:43 Harrison Chase
01:49:15 Lin Qiao
02:05:03 Jitendra Malik
02:21:15 Arsalan & Eric Schmidt
What’s New in Unity Catalog -- With Live Demos
25.07.2023
Join the Unity Catalog product team and dive into the cutting-edge world of data, analytics and AI governance. With Unity Catalog’s unified governance solution for data, analytics, and AI on any cloud, you’ll discover the latest and greatest enhancements we’re shipping, including fine-grained governance with row/column filtering, new enhancements with automated data lineage and governance for ML assets.
In this demo-packed session, You’ll learn how new capabilities in Unity Catalog can further simplify your data governance and accelerated analytics and AI initiatives. Plus, get an exclusive sneak peek at our upcoming roadmap. And don’t forget, you’ll have the chance to ask the product teams themselves any burning questions you have about the best governance solution for the lakehouse. Don’t miss out on this exciting opportunity to level up your data game with Unity Catalog.
Talk by: Paul Roome
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Databricks Cost Management: Tips and Tools to Stay Under Budget
26.07.2023
How do you prevent surprise bills at the end of the month? Join us as we discuss best practices for cost management. You'll learn how to analyze and break down costs and hear best practices for keeping your budget in check. This session will:
- Walk through cost reporting across various surfaces
- Discuss best practices for cost optimization on Databricks
- Highlight how tagging and budgets can give you the confidence you seek
- Share news about upcoming features related to cost management
Talk by: Greg Kroleski and Thorsten Jacobs
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Learn How to Reliably Monitor Your Data and Model Quality in the Lakehouse
25.07.2023
Developing and upkeep of production data engineering and machine learning pipelines is a challenging process for many data teams. Even more challenging is monitoring the quality of your data and models once they go into production. Building upon untrustworthy data can cause many complications for data teams. Without a monitoring service, it is challenging to proactively discover when your ML models degrade over time, and the root causes behind it. Furthermore, with a lack of lineage tracking, it is even more painful to debug errors in your models and data. Databricks Lakehouse Monitoring offers a unified service to monitor the quality of all your data and ML assets.
In this session, you’ll learn how to:
- Use one unified tool to monitor the quality of any data product: data or AI
- Quickly diagnose errors in your data products with root cause analysis
- Set up a monitor with low friction, requiring only a button click or a single API call to start and automatically generate out-of-the-box metrics
- Enable self-serve experiences for data analysts by providing reliability status for every data asset
Talk by: Kasey Uhlenhuth and Alkis Polyzotis
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
How to Build a Metadata Driven Data Pipelines with Delta Live Tables
25.07.2023
In this session, you will learn how you can use metaprogramming to automate the creation and management of Delta Live Tables pipelines at scale. The goal is to make it easy to use DLT for large-scale migrations, and other use cases that require ingesting and managing hundreds or thousands of tables, using generic code components and configuration-driven pipelines that can be dynamically reused across different projects or datasets.
Talk by: Mojgan Mazouchi and Ravi Gawai
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
LLMOps: Everything You Need to Know to Manage LLMs
25.07.2023
With the recent surge in popularity of ChatGPT and other LLMs such as Dolly, many people are going to start training, tuning, and deploying their own custom models to solve their domain-specific challenges. When training and tuning these models, there are certain considerations that need to be accounted for in the MLOps process that differ from traditional machine learning. Come watch this session where you’ll gain a better understanding of what to look out for when starting to enter the world of applying LLMs in your domain.
In this session, you’ll learn about:
- Grabbing foundational models and fine-tuning them
- Optimizing resource management such as GPUs
- Integrating human feedback and reinforcement learning to improve model performance
- Different evaluation methods for LLMs
Talk by: Joseph Bradley and Eric Peter
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks
25.07.2023
In this session, we will introduce Databricks Asset Bundles, provide a demonstration of how they work for a variety of data products, and how to fit them into an overall CICD strategy for the well-architected Lakehouse.
Data teams produce a variety of assets; datasets, reports and dashboards, ML models, and business applications. These assets depend upon code (notebooks, repos, queries, pipelines), infrastructure (clusters, SQL warehouses, serverless endpoints), and supporting services/resources like Unity Catalog, Databricks Workflows, and DBSQL dashboards. Today, each organization must figure out a deployment strategy for the variety of data products they build on Databricks as there is no consistent way to describe the infrastructure and services associated with project code.
Databricks Asset Bundles is a new capability on Databricks that standardizes and unifies the deployment strategy for all data products developed on the platform. It allows developers to describe the infrastructure and resources of their project through a YAML configuration file, regardless of whether they are producing a report, dashboard, online ML model, or Delta Live Tables pipeline. Behind the scenes, these configuration files use Terraform to manage resources in a Databricks workspace, but knowledge of Terraform is not required to use Databricks Asset Bundles.
Talk by: Rafi Kurlansik and Pieter Noordhuis
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Evaluating LLM-based Applications
26.07.2023
Evaluating LLM-based applications can feel like more of an art than a science. In this workshop, we'll give a hands-on introduction to evaluating language models. You'll come away with knowledge and tools you can use to evaluate your own applications, and answers to questions like:
- Where do I get evaluation data from, anyway?
- Is it possible to evaluate generative models in an automated way?
- What metrics can I use?
- What's the role of human evaluation?
Talk by: Josh Tobin
Here’s more to explore:
LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
A Technical Deep Dive into Unity Catalog's Practitioner Playbook
25.07.2023
Get ready to take a deep dive into Unity Catalog and explore how it can simplify data, analytics and AI governance across multiple clouds. In this session, take a deep dive into Unity Catalog and the expert Databricks team will guide you through a hands-on demo, showcasing the latest features and best practices for data governance. You'll learn how to master Unity Catalog and gain a practical understanding of how it can streamline your analytics and AI initiatives. Whether you're migrating from Hive Metastore or just looking to expand your knowledge of Unity Catalog, this session is for you. Join us for a practical, hands-on deep dive into Unity Catalog and learn how to achieve seamless data governance while following best practices for data, analytics and AI governance.
Talk by: Zeashan Pappa and Ifigeneia Derekli
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
How to Build LLMs on Your Company’s Data While on a Budget
25.07.2023
Large Language Models (LLMs) are taking AI mainstream across companies and individuals. However, public LLMs are trained on general-purpose data. They do not include your own corporate data and they are black boxes on how they are trained. Because terminology is different for healthcare, financial, retail, digital-native and other industries, companies today are looking for industry-specific LLMs to better understand the terminology, context and knowledge that better suits their needs. In contrast to closed LLMs, open source-based models can be used for commercial usage or customized to suit an enterprise’s needs on their own data. Learn how Databricks makes it easy for you to build, tune and use custom models, including a deep dive into Dolly, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
In this session, you will:
- See a real-life demo of creating your own LLMs specific to your industry
- Learn how to securely train on your own documents if needed
- Learn how Databricks makes it quick, scalable and inexpensive
- Deep dive into Dolly and its applications
Talk by: Sean Owen
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
How to Train Your Own Large Language Models
26.07.2023
Given the success of OpenAI’s GPT-4 and Google’s PaLM, every company is now assessing its own use cases for Large Language Models (LLMs). Many companies will ultimately decide to train their own LLMs for a variety of reasons, ranging from data privacy to increased control over updates and improvements. One of the most common reasons will be to make use of proprietary internal data.
In this session, we’ll go over how to train your own LLMs, from raw data to deployment in a user-facing production environment. We’ll discuss the engineering challenges, and the vendors that make up the modern LLM stack: Databricks, Hugging Face, and MosaicML. We’ll also break down what it means to train an LLM using your own data, including the various approaches and their associated tradeoffs.
Topics covered in this session:
- How Replit trained a state-of-the-art LLM from scratch
- The different approaches to using LLMs with your internal data
- The differences between fine-tuning, instruction tuning, and RLHF
Talk by: Reza Shabani
Here’s more to explore:
LLM Compact Guide: https://dbricks.co/43WuQyb
Big Book of MLOps: https://dbricks.co/3r0Pqiz
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...
Data + AI Summit Keynote Wednesday
29.06.2023
0:00 Opener
01:18- Ali Ghodsi, Databricks
06:53 - Satya Nadella, Microsoft
15:50 Ali Ghodsi, Databricks
20:40 Larry Feinsmith, JP Morgan Chase
41:09 Ali Ghodsi, Databricks
45:07 Matei Zaharia, Databricks
52:31 Weston Hutchins, Databricks
58:36 Ali Ghodsi, Databricks
1:02:05 Naveen Rao, MosaicML
1:12:15 Patrick Wendell, Databricks
1:27:57 Kasey Uhlenhuth, Databricks
1:39:18 Sai Pradhan Ravuru, Jetblue
01:47 Ali Ghodsi, Databricks
1:49:20 Reynold Xin, Databricks
2:05:07 Ali Ghodsi, Databricks
2:09:26 Matei Zaharia, Databricks
2:17:24 Caryl Yuhas, Databricks
2:24:12 Zaheera Valani, Databricks
2:39:55 Wassym Bensaid, Rivian
Delta Live Tables A to Z: Best Practices for Modern Data Pipelines
25.07.2023
Join Databricks' Distinguished Principal Engineer Michael Armbrust for a technical deep dive into how Delta Live Tables (DLT) reduces the complexity of data transformation and ETL. Learn what’s new; what’s coming; and how to easily master the ins-and-outs of DLT.
Michael will describe and demonstrate:
- What’s new in Delta Live Tables (DLT) - Enzyme, Enhanced Autoscaling, and more
- How to easily create and maintain your DLT pipelines
- How to monitor pipeline operations
- How to optimize data for analytics and ML
- Sneak Peek into the DLT roadmap
Talk by: Michael Armbrust
Connect with us: Website: https://databricks.com
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/da...
Instagram: https://www.instagram.com/databrick...
Facebook: https://www.facebook.com/databricks...