Andrej Karpathy

17 videos 17935767 views 825000 subscribers
Andrej Karpathy
Uploads
  • LOL
  • General Audience
  • Neural Networks: Zero to Hero
  • stable diffusion dreams

Building makemore Part 5: Building a WaveNet

21.11.2022
We take the 2-layer MLP from previous video and make it deeper with a tree-like structure, arriving at a convolutional neural network architecture similar to the WaveNet (2016) from DeepMind. In the WaveNet paper, the same hierarchical architecture is implemented more efficiently using causal dilated convolutions (not yet covered). Along the way we get a better sense of torch.nn and what it is and how it works under the hood, and what a typical deep learning development process looks like (a lot of reading of documentation, keeping track of multidimensional tensor shapes, moving between jupyter notebooks and repository code, ...). Links: - makemore on github: https://github.com/karpathy/makemor... - jupyter notebook I built in this video: https://github.com/karpathy/nn-zero... - collab notebook: https://colab.research.google.com/d... - my website: https://karpathy.ai - my twitter: https://twitter.com/karpathy - our Discord channel: https://discord.gg/3zy8kqD9Cp Supplementary links: - WaveNet 2016 from DeepMind https://arxiv.org/abs/1609.03499 - Bengio et al. 2003 MLP LM https://www.jmlr.org/papers/volume3... Chapters: intro 00:00:00 intro 00:01:40 starter code walkthrough 00:06:56 let’s fix the learning rate plot 00:09:16 pytorchifying our code: layers, containers, torch.nn, fun bugs implementing wavenet 00:17:11 overview: WaveNet 00:19:33 dataset bump the context size to 8 00:19:55 re-running baseline code on block_size 8 00:21:36 implementing WaveNet 00:37:41 training the WaveNet: first pass 00:38:50 fixing batchnorm1d bug 00:45:21 re-training WaveNet with bug fix 00:46:07 scaling up our WaveNet conclusions 00:46:58 experimental harness 00:47:44 WaveNet but with “dilated causal convolutions” 00:51:34 torch.nn 00:52:28 the development process of building deep neural nets 00:54:17 going forward 00:55:26 improve on my loss! how far can we improve a WaveNet on this data?
216338
3846
220

Let's build the GPT Tokenizer

20.02.2024
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. Chapters: 00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues 00:05:50 tokenization by example in a Web UI (tiktokenizer) 00:14:56 strings in Python, Unicode code points 00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 00:22:47 daydreaming: deleting tokenization 00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough 00:27:02 starting the implementation 00:28:35 counting consecutive pairs, finding most common pair 00:30:36 merging the most common pair 00:34:58 training the tokenizer: adding the while loop, compression ratio 00:39:20 tokenizer/LLM diagram: it is a completely separate stage 00:42:47 decoding tokens to strings 00:48:21 encoding strings to tokens 00:57:36 regex patterns to force splits across categories 01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex 01:14:59 GPT-2 encoder.py released by OpenAI walkthrough 01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences 01:25:28 minbpe exercise time! write your own GPT-4 tokenizer 01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary 01:43:27 how to set vocabulary set? revisiting gpt.py transformer 01:48:11 training new tokens, example of prompt compression 01:49:58 multimodal [image, video, audio] tokenization with vector quantization 01:51:41 revisiting and explaining the quirks of LLM tokenization 02:10:20 final recommendations 02:12:50 ??? 🙂 Exercises: - Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code https://github.com/karpathy/minbpe/... Links: - Google colab for the video: https://colab.research.google.com/d... - GitHub repo for the video: minBPE https://github.com/karpathy/minbpe - Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj... - our Discord channel: https://discord.gg/3zy8kqD9Cp - my Twitter: https://twitter.com/karpathy Supplementary links: - tiktokenizer https://tiktokenizer.vercel.app - tiktoken from OpenAI: https://github.com/openai/tiktoken - sentencepiece from Google https://github.com/google/sentencep...
762308
22721
1090

Let's reproduce GPT-2 (124M)

09.06.2024
We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar. Links: - build-nanogpt GitHub repo, with all the changes in this video as individual commits: https://github.com/karpathy/build-n... - nanoGPT repo: https://github.com/karpathy/nanoGPT - llm.c repo: https://github.com/karpathy/llm.c - my website: https://karpathy.ai - my twitter: https://twitter.com/karpathy - our Discord channel: https://discord.gg/3zy8kqD9Cp Supplementary links: - Attention is All You Need paper: https://arxiv.org/abs/1706.03762 - OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 - OpenAI GPT-2 paper: https://d4mucfpksywv.cloudfront.net... The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com Chapters: 00:00:00 intro: Let’s reproduce GPT-2 (124M) 00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint 00:13:47 SECTION 1: implementing the GPT-2 nn.Module 00:28:08 loading the huggingface/GPT-2 parameters 00:31:00 implementing the forward pass to get logits 00:33:31 sampling init, prefix tokens, tokenization 00:37:02 sampling loop 00:41:47 sample, auto-detect the device 00:45:50 let’s train: data batches (B,T) → logits (B,T,C) 00:52:53 cross entropy loss 00:56:42 optimization loop: overfit a single batch 01:02:00 data loader lite 01:06:14 parameter sharing wte and lm_head 01:13:47 model initialization: std 0.02, residual init 01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms 01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms 01:39:38 float16, gradient scalers, bfloat16, 300ms 01:48:15 torch.compile, Python overhead, kernel fusion, 130ms 02:00:18 flash attention, 96ms 02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms 02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping 02:21:06 learning rate scheduler: warmup + cosine decay 02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms 02:34:09 gradient accumulation 02:46:52 distributed data parallel (DDP) 03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU) 03:23:10 validation data split, validation loss, sampling revive 03:28:23 evaluation: HellaSwag, starting the run 03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro 03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA 03:59:39 summary, phew, build-nanogpt github repo Corrections: I will post all errata and followups to the build-nanogpt GitHub repo (link above) SuperThanks: I experimentally enabled them on my channel yesterday. Totally optional and only use if rich. All revenue goes to to supporting my work in AI + Education.
780329
25755
1083

How I use LLMs

27.02.2025
The example-driven, practical walkthrough of Large Language Models and their growing list of related features, as a new entry to my general audience series on LLMs. In this more practical followup, I take you through the many ways I use LLMs in my own life. Chapters 00:00:00 Intro into the growing LLM ecosystem 00:02:54 ChatGPT interaction under the hood 00:13:12 Basic LLM interactions examples 00:18:03 Be aware of the model you're using, pricing tiers 00:22:54 Thinking models and when to use them 00:31:00 Tool use: internet search 00:42:04 Tool use: deep research 00:50:57 File uploads, adding documents to context 00:59:00 Tool use: python interpreter, messiness of the ecosystem 01:04:35 ChatGPT Advanced Data Analysis, figures, plots 01:09:00 Claude Artifacts, apps, diagrams 01:14:02 Cursor: Composer, writing code 01:22:28 Audio (Speech) Input/Output 01:27:37 Advanced Voice Mode aka true audio inside the model 01:37:09 NotebookLM, podcast generation 01:40:20 Image input, OCR 01:47:02 Image output, DALL-E, Ideogram, etc. 01:49:14 Video input, point and talk on app 01:52:23 Video output, Sora, Veo 2, etc etc. 01:53:29 ChatGPT memory, custom instructions 01:58:38 Custom GPTs 02:06:30 Summary Links - Tiktokenizer https://tiktokenizer.vercel.app/ - OpenAI's ChatGPT https://chatgpt.com/ - Anthropic's Claude https://claude.ai/ - Google's Gemini https://gemini.google.com/ - xAI's Grok https://grok.com/ - Perplexity https://www.perplexity.ai/ - Google's NotebookLM https://notebooklm.google.com/ - Cursor https://www.cursor.com/ - Histories of Mysteries AI podcast on Spotify https://open.spotify.com/show/3K4LR... - The visualization UI I was using in the video: https://excalidraw.com/ - The specific file of Excalidraw we built up: https://drive.google.com/file/d/1DN... - Discord channel for Eureka Labs and this video: https://discord.gg/3zy8kqD9Cp Educational Use Licensing This video is freely available for educational and internal training purposes. Educators, students, schools, universities, nonprofit institutions, businesses, and individual learners may use this content freely for lessons, courses, internal training, and learning activities, provided they do not engage in commercial resale, redistribution, external commercial use, or modify content to misrepresent its intent.
1189094
43485
1749

Deep Dive into LLMs like ChatGPT

05.02.2025
This is a general audience deep dive into the Large Language Model (LLM) AI technology that powers ChatGPT and related products. It is covers the full training stack of how the models are developed, along with mental models of how to think about their "psychology", and how to get the best use them in practical applications. I have one "Intro to LLMs" video already from ~year ago, but that is just a re-recording of a random talk, so I wanted to loop around and do a lot more comprehensive version. Instructor Andrej was a founding member at OpenAI (2015) and then Sr. Director of AI at Tesla (2017-2022), and is now a founder at Eureka Labs, which is building an AI-native school. His goal in this video is to raise knowledge and understanding of the state of the art in AI, and empower people to effectively use the latest and greatest in their work. Find more at https://karpathy.ai/ and https://x.com/karpathy Chapters 00:00:00 introduction 00:01:00 pretraining data (internet) 00:07:47 tokenization 00:14:27 neural network I/O 00:20:11 neural network internals 00:26:01 inference 00:31:09 GPT-2: training and inference 00:42:52 Llama 3.1 base model inference 00:59:23 pretraining to post-training 01:01:06 post-training data (conversations) 01:20:32 hallucinations, tool use, knowledge/working memory 01:41:46 knowledge of self 01:46:56 models need tokens to think 02:01:11 tokenization revisited: models struggle with spelling 02:04:53 jagged intelligence 02:07:28 supervised finetuning to reinforcement learning 02:14:42 reinforcement learning 02:27:47 DeepSeek-R1 02:42:07 AlphaGo 02:48:26 reinforcement learning from human feedback (RLHF) 03:09:39 preview of things to come 03:15:15 keeping track of LLMs 03:18:34 where to find LLMs 03:21:46 grand summary Links - ChatGPT https://chatgpt.com/ - FineWeb (pretraining dataset): https://huggingface.co/spaces/Huggi... - Tiktokenizer: https://tiktokenizer.vercel.app/ - Transformer Neural Net 3D visualizer: https://bbycroft.net/llm - llm.c Let's Reproduce GPT-2 https://github.com/karpathy/llm.c/d... - Llama 3 paper from Meta: https://arxiv.org/abs/2407.21783 - Hyperbolic, for inference of base model: https://app.hyperbolic.xyz/ - InstructGPT paper on SFT: https://arxiv.org/abs/2203.02155 - HuggingFace inference playground: https://huggingface.co/spaces/huggi... - DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948 - TogetherAI Playground for open model inference: https://api.together.xyz/playground - AlphaGo paper (PDF): https://discovery.ucl.ac.uk/id/epri... - AlphaGo Move 37 video: https://www.youtube.com/watch?v=HT-... - LM Arena for model rankings: https://lmarena.ai/ - AI News Newsletter: https://buttondown.com/ainews - LMStudio for local inference https://lmstudio.ai/ - The visualization UI I was using in the video: https://excalidraw.com/ - The specific file of Excalidraw we built up: https://drive.google.com/file/d/1EZ... - Discord channel for Eureka Labs and this video: https://discord.gg/3zy8kqD9Cp Educational Use Licensing This video is freely available for educational and internal training purposes. Educators, students, schools, universities, nonprofit institutions, businesses, and individual learners may use this content freely for lessons, courses, internal training, and learning activities, provided they do not engage in commercial resale, redistribution, external commercial use, or modify content to misrepresent its intent.
2261760
65592
2727

[1hr Talk] Intro to Large Language Models

23.11.2023
This is a 1 hour general-audience introduction to Large Language Models: the core technical component behind systems like ChatGPT, Claude, and Bard. What they are, where they are headed, comparisons and analogies to present-day operating systems, and some of the security-related challenges of this new computing paradigm. As of November 2023 (this field moves fast!). Context: This video is based on the slides of a talk I gave recently at the AI Security Summit. The talk was not recorded but a lot of people came to me after and told me they liked it. Seeing as I had already put in one long weekend of work to make the slides, I decided to just tune them a bit, record this round 2 of the talk and upload it here on YouTube. Pardon the random background, that's my hotel room during the thanksgiving break. - Slides as PDF: https://drive.google.com/file/d/1px... (42MB) - Slides. as Keynote: https://drive.google.com/file/d/1FP... (140MB) Few things I wish I said (I'll add items here as they come up): - The dreams and hallucinations do not get fixed with finetuning. Finetuning just "directs" the dreams into "helpful assistant dreams". Always be careful with what LLMs tell you, especially if they are telling you something from memory alone. That said, similar to a human, if the LLM used browsing or retrieval and the answer made its way into the "working memory" of its context window, you can trust the LLM a bit more to process that information into the final answer. But TLDR right now, do not trust what LLMs say or do. For example, in the tools section, I'd always recommend double-checking the math/code the LLM did. - How does the LLM use a tool like the browser? It emits special words, e.g. |BROWSER|. When the code "above" that is inferencing the LLM detects these words it captures the output that follows, sends it off to a tool, comes back with the result and continues the generation. How does the LLM know to emit these special words? Finetuning datasets teach it how and when to browse, by example. And/or the instructions for tool use can also be automatically placed in the context window (in the “system message”). - You might also enjoy my 2015 blog post "Unreasonable Effectiveness of Recurrent Neural Networks". The way we obtain base models today is pretty much identical on a high level, except the RNN is swapped for a Transformer. http://karpathy.github.io/2015/05/2... - What is in the run.c file? A bit more full-featured 1000-line version hre: https://github.com/karpathy/llama2.... Chapters: Part 1: LLMs 00:00:00 Intro: Large Language Model (LLM) talk 00:00:20 LLM Inference 00:04:17 LLM Training 00:08:58 LLM dreams 00:11:22 How do they work? 00:14:14 Finetuning into an Assistant 00:17:52 Summary so far 00:21:05 Appendix: Comparisons, Labeling docs, RLHF, Synthetic data, Leaderboard Part 2: Future of LLMs 00:25:43 LLM Scaling Laws 00:27:43 Tool Use (Browser, Calculator, Interpreter, DALL-E) 00:33:32 Multimodality (Vision, Audio) 00:35:00 Thinking, System 1/2 00:38:02 Self-improvement, LLM AlphaGo 00:40:45 LLM Customization, GPTs store 00:42:15 LLM OS Part 3: LLM Security 00:45:43 LLM Security Intro 00:46:14 Jailbreaks 00:51:30 Prompt Injection 00:56:23 Data poisoning 00:58:37 LLM Security conclusions End 00:59:23 Outro Educational Use Licensing This video is freely available for educational and internal training purposes. Educators, students, schools, universities, nonprofit institutions, businesses, and individual learners may use this content freely for lessons, courses, internal training, and learning activities, provided they do not engage in commercial resale, redistribution, external commercial use, or modify content to misrepresent its intent.
2701894
79022
4768

Let's build GPT: from scratch, in code, spelled out.

17.01.2023
We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video. Links: - Google colab for the video: https://colab.research.google.com/d... - GitHub repo for the video: https://github.com/karpathy/ng-vide... - Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj... - nanoGPT repo: https://github.com/karpathy/nanoGPT - my website: https://karpathy.ai - my twitter: https://twitter.com/karpathy - our Discord channel: https://discord.gg/3zy8kqD9Cp Supplementary links: - Attention is All You Need paper: https://arxiv.org/abs/1706.03762 - OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 - OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/ - The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com . If you prefer to work in notebooks, I think the easiest path today is Google Colab. Suggested exercises: - EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT). - EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.) - EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining? - EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT? Chapters: 00:00:00 intro: ChatGPT, Transformers, nanoGPT, Shakespeare baseline language modeling, code setup 00:07:52 reading and exploring the data 00:09:28 tokenization, train/val split 00:14:27 data loader: batches of chunks of data 00:22:11 simplest baseline: bigram language model, loss, generation 00:34:53 training the bigram model 00:38:00 port our code to a script Building the "self-attention" 00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation 00:47:11 the trick in self-attention: matrix multiply as weighted aggregation 00:51:54 version 2: using matrix multiply 00:54:42 version 3: adding softmax 00:58:26 minor code cleanup 01:00:18 positional encoding 01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention 01:11:38 note 1: attention as communication 01:12:46 note 2: attention has no notion of space, operates over sets 01:13:40 note 3: there is no communication across batch dimension 01:14:14 note 4: encoder blocks vs. decoder blocks 01:15:39 note 5: attention vs. self-attention vs. cross-attention 01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size) Building the Transformer 01:19:11 inserting a single self-attention block to our network 01:21:59 multi-headed self-attention 01:24:25 feedforward layers of transformer block 01:26:48 residual connections 01:32:51 layernorm (and its relationship to our previous batchnorm) 01:37:49 scaling up the model! creating a few variables. adding dropout Notes on Transformer 01:42:39 encoder vs. decoder vs. both (?) Transformers 01:46:22 super quick walkthrough of nanoGPT, batched multi-headed self-attention 01:48:53 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF 01:54:32 conclusions Corrections: 00:57:00 Oops "tokens from the _future_ cannot communicate", not "past". Sorry! 🙂 01:20:05 Oops I should be using the head_size for the normalization, not C
5441881
126275
2769
Load more...

Categories

All Topics