LLM Inference from Principles to Production (LLM 从原理到生产级推理)¶
🌐 Read on GitHub Pages (English Version) | 🌐 阅读精美网页版 (中文版)
Foreword¶
This book is a systematic organization of Large Language Model (LLM) inference technology, originating from the author's study notes and reflections during paternity leave (and welcoming my second baby, Emerson! 👶🍼🧸). With daily work focused on internal clusters supporting inference and other business, the author rarely had time to follow open-source progress. Therefore, the core objectives of this book are:
- Build Mental Models (End-to-End): Connect scattered knowledge points to help myself understand core LLM inference principles and serving frameworks end-to-end, establishing a global mindset.
- Track Open-Source Progress: Follow frontier progress in the open-source community to look beyond daily work, focusing especially on Kubernetes evolution and how it can better adapt to LLM inference.
- Establish a Sustainable Framework: Serve as a foundation for easier continuous updates in the future to stay up-to-date myself.
Disclaimer and Positioning: This book is not a profound mathematical derivation book, nor is it a "frontier tracking" that follows the latest daily papers. We will not get overly entangled in complex mathematical formulas and overly trivial code details. Our focus is on revealing the essential logic behind the technology.
Additionally, this book was deeply assisted by Gemini and Claude. Without AI, it would have been impossible for the author to learn and understand such a broad field in a single month. I would like to thank AI on one hand; on the other hand, this has reinforced my determination to "do inference well" — only by building efficient inference infrastructure can we make AI benefit more people.
Target Audience: The primary audience for this book is the author themselves. It is open-sourced on GitHub for convenient version control, switching between machines, and to potentially benefit other peers interested in this field (such as system architects, backend engineers, AI product managers, and developers curious about underlying mechanisms). If you also hope to establish a global understanding of large model inference from principles to production services, I hope these notes can offer some inspiration. Feel free to point out any errors.
Table of Contents¶
- Foreword
-
Part 1: Principles —— Transformer and LLM Foundations
- Chapter 1: Transformer Architecture Analysis: The Mechanism of Q, K, and V
- Section 1: Birds-Eye View: The Macro Division of Labor in Classic Transformer Architecture
- Section 2: Evolution: Modern LLM Decoder-Only Architecture
- Section 3: The Library Analogy: Understanding the Logical Meaning of Q, K, and V Self-Attention
- Section 4: Mathematical Principles: Self-Attention Weight Matrix and Dynamic Vector Generation
- Section 5: Feed-Forward Network (FFN): The Model's Knowledge Base
- Section 6: Expanding Self-Attention: Multi-Head Attention
- Section 7: Expanding FFN: Mixture of Experts (MoE)
- Chapter 2: Multi-Layer Stacking and Data Flow Mechanisms
- Section 1: The Skyscraper Entrance: Word Embedding and Positional Encoding
- Section 2: The Mechanism of Stacking: Hierarchical Feature Extraction
- Section 3: The Translator: LM Head
- Section 4: Logits and Softmax: Converting Raw Scores into Probability Distributions
- Section 5: Science Tip: What Weights Do the 8B/70B Parameters We Often Talk About Refer To?
- Section 6: Data Flow: End-to-End Panoramic View from Bottom to Top
- Chapter 3: Autoregressive Decoding and Text Generation Mechanisms
- Chapter 1: Transformer Architecture Analysis: The Mechanism of Q, K, and V
-
Part 3: Single Node — High-Performance Engines with Optimized VRAM
- Chapter 9: Model Architecture VRAM Optimization: GQA
- Chapter 10: Precision Reduction: KV Cache Quantization (FP8/INT8)
- Chapter 11: VRAM Management at the Engine Level: PagedAttention
- Chapter 12: Prefix Caching Mechanism Based on Radix Tree (RadixAttention)
- Chapter 13: Dynamic Scheduling Mechanisms: Continuous Batching and Chunked Prefill
- Chapter 14: Preemption and Scheduling Mechanisms Under VRAM Saturation
- Chapter 15: Trading "Idle Compute" for "Ultimate Latency": Speculative Decoding
-
- Chapter 16: Distributed Slicing of Large Models: Tensor, Pipeline, and Context Parallelism
- Section 1: The Necessity of Multiple Machines: VRAM Capacity Limitations
- Section 2: TP and PP: Vertical and Horizontal Slicing
- Section 3: Automatic Distribution: Distributed Decoupling of Compute and Memory
- Section 4: The Impact of TP and PP on Core Metrics
- Section 5: Breaking the Sequence Wall: Context Parallelism
- Section 6: Hybrid Parallelism: The 3D Topology of TP, PP, and CP
- Chapter 17: Expert Parallelism (EP)
- Chapter 18: Disaggregated Serving Architecture
- Chapter 19: Content-Aware Routing: The Traffic Police
- Chapter 20: Network Communication and High-Speed Interconnects in Large Model Inference
- Chapter 16: Distributed Slicing of Large Models: Tensor, Pipeline, and Context Parallelism
-
Part 5: Orchestration —— Taming the Supercomputer: Harnessing AI Compute with Kubernetes
- Chapter 21: When "Loose Coupling" Meets "Tight Coupling": The Collision of K8s and LLM Lifecycles
- Chapter 22: Racing Against Time: Model Distribution and Cold Start Optimization
- Chapter 23: Tentacles Reaching into the Motherboard: DRA and Hardware Topology Aware Scheduling
- Section 1: Topology Black Hole: Why Scalar Counting Fails in Distributed Inference
- Section 2: Evolution: DRA (Dynamic Resource Allocation) and Resource Management Paradigm Revolution
- Section 3: Single-Node Battle: Facing Hardware Locality
- Section 4: Beyond Single Node: Cluster-Level Network Topology and Multi-Machine Synergy
- Chapter 24: Breaking Silos: LeaderWorkerSet and Distributed LLM Orchestration
- Chapter 25: All-or-Nothing: Gang Scheduling and Resource Deadlocks
- Chapter 26: Breathing of the Pool: Pod and Node Autoscaling in LLM Inference
- Chapter 27: Operations and Upgrades: Continuity vs. Heavy Assets
- Part 5 Summary: Core Contradictions and Breakthroughs in LLM Orchestration