The KV Cache: Memory Usage in Transformers

Published 2023-07-21

Download video MP4 360p

Recommendations

14:06

RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs
07:38

Which transformer architecture is best? Encoder-only vs Encoder-decoder vs Decoder-only models
32:07

Fast LLM Serving with vLLM and PagedAttention
36:55

Andrew Ng: Opportunities in AI - 2023
30:25

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral
31:51

MAMBA from Scratch: Neural Nets Better and Faster than Transformers
12:46

Speculative Decoding: When Two LLMs are Faster than One
10:24

Training Your Own AI Model Is Not As Hard As You (Probably) Think
1:56:20

Let's build GPT: from scratch, in code, spelled out.
17:07

LoRA explained (and a bit about precision and quantization)
1:11:41

Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
36:16

The math behind Attention: Keys, Queries, and Values matrices
21:02

The Attention Mechanism in Large Language Models

Similar videos

49:53

How a Transformer works at inference vs training time
1:10:55

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
58:58

FlashAttention - Tri Dao | Stanford MLSys #67
12:26

Rasa Algorithm Whiteboard - Transformers & Attention 2: Keys, Values, Queries
36:12

Deep Dive: Optimizing LLM inference
45:44

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)
01:08

Accelerate Big Model Inference: How Does it Work?
05:34

Attention mechanism: Overview
39:10

Mistral Architecture Explained From Scratch with Sliding Window Attention, KV Caching Explanation
3:04:11

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
24:34

Scaling Transformer to 1M tokens and beyond with RMT (Paper Explained)
1:17:49

EfficientML.ai Lecture 12 - Transformer and LLM (Part I) (MIT 6.5940, Fall 2023)
1:26:21

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer
More results