High Level Conceptual Notes
1. Model Serving (Local & Global)
- Local Serving
- Single data center or a single node handling inference.
- Lower latency (no cross-region hops), simpler infrastructure.
- Works best if all your users/traffic are geographically close.
- Global Serving
- Multiple data centers or cloud regions, each with replicas of the model.
- Load balancer directs requests based on proximity or resource availability.
- Minimizes latency for globally distributed users, but cost and complexity go up.
- Must consider cross-region load balancing, network overhead, and possible replication of large model artifacts.
- Key Considerations
- Latency: Real-time user experience requires local-ish compute.
- Concurrency: More replicas → handle more requests but higher overall cost.
- Resilience: Multi-region can handle outages, but orchestration gets complex.
- Cost: GPU hours in multiple regions can be pricey; you might scale down in regions with low usage.
2. Continuous Batching
- Core Idea
- Autoregressive decoding merges multiple requests at the same token step into one forward pass.
- Greatly boosts GPU utilization by combining small workloads into a single big kernel launch.
- Benefits
- Throughput: Fewer, larger GPU kernels instead of many small ones.
- Latency: Small added “batch wait” (a few ms) can be worth the big throughput gain.
- Easy Integration: Tools like vLLM, Hugging Face TGI do this automatically.
- Mechanics
- Orchestrator checks which sessions are at decode step n, merges them, runs one forward pass for that token → merges again for n+1, etc.
- Works even if requests arrive asynchronously or have different prompt lengths.
- Trade-Off
- If concurrency is very low (fewer requests), your batch size might be 1. Gains may be smaller.
- Must carefully handle user experience so we don’t stall them too long while batching.
3. Parallelism Strategies (Pros, Cons, Uniqueness)
A. Data Parallelism
- Concept: Replicate the entire model on each GPU (or node).
- Pros
- Simple: Each GPU runs a full copy of the model, no cross-GPU sync for forward passes.
- Scales throughput linearly for many concurrent requests.
- Cons
- Doesn’t reduce per-request latency. One large inference still runs on one GPU.
- Memory duplication for all replicas → might be expensive if model is huge.
- Use Case
- High concurrency (many user requests), and the model fits on one GPU.
B. Tensor (Model) Parallelism
- Concept: Split each weight matrix (or attention heads) across multiple GPUs.
- Pros
- Enables serving models larger than a single GPU’s memory.
- Can reduce latency if compute is heavy and interconnect is fast.
- Cons
- Requires high-speed GPU interconnects (NVLink/InfiniBand). Communication overhead can bottleneck.
- Complex to implement (Megatron-LM, etc.). Diminishing returns with too many GPUs.
- Use Case
- Ultra-large model that doesn’t fit on one GPU. HPC or specialized clusters.
C. Pipeline Parallelism
- Concept: Each GPU holds a consecutive chunk of layers; pass activations from one stage to the next in sequence.
- Pros
- Also allows serving bigger models than one GPU can hold.
- Increases throughput if you have many micro-batches in flight (assembly-line).
- Cons
- Single-request latency can be higher (sequential pipeline stages + inter-stage comm).
- Requires careful load balancing among pipeline stages.
- Use Case
- Very deep models (lots of layers). Good for batch or multi-request concurrency.
D. Expert Parallelism (MoE)
- Concept: Many “expert” sub-models distributed across GPUs; gating network routes tokens to specific experts.
- Pros
- Sparse activation → can hold massive total parameters with limited per-token compute.
- Scales model capacity almost arbitrarily (more experts = bigger model).
- Cons
- Complex routing, possible load imbalance if many tokens pick the same expert.
- High communication overhead when scattering tokens to different experts.
- Use Case
- Extremely large models with specialized “experts” (multi-language, multi-domain).
4. KV Caching & APC (Automatic Prefix Caching)
- KV Cache
- Stores hidden states (Key & Value tensors) from past tokens to speed up subsequent attention steps.
- Essential for autoregressive LLMs (GPT-like).
- Benefits
- Avoids recomputing entire sequence for every new token – big latency and throughput boost.
- Must manage memory (KV can be large if many tokens or many concurrent requests).
- APC (Automatic Prefix Caching)
- If multiple prompts share the same prefix, reuse the same computed KV chunk.
- Greatly speeds up repeated patterns (e.g., same system prompts or repeated instructions).
- Implementation Details
- PagedAttention: KV blocks stored in “pages” for dynamic allocation and offload.
- Some frameworks automatically detect prefix overlaps.
5. Eviction Policies & Offloading Strategies
- Why Evict?
- GPU memory is precious. Large or idle sessions can hog KV space.
- Eviction frees memory for new or active sessions.
- Common Policies
- LRU (Least Recently Used): Discard the oldest or least-accessed session’s KV.
- Time-Based: If a session is idle for X seconds, remove or move it to CPU.
- Priority: Premium sessions never evict; low-priority sessions evict first.
- Offloading Approaches
- KV Offload: Move old tokens’ KV to CPU pinned memory or disk. Reload if needed.
- Partial Summarization: Summarize older context, reduce token count (soft eviction).
- FlexGen (advanced): Offloads model weights as well, loading layers on-demand.
- Trade-Off
- Reloading from CPU or disk can spike latency for revived sessions.
- Summarization saves GPU memory but might lose detailed context.
6. Scaling Inference Algorithms
Here are some acronyms and features to keep in mind:
- GQA (Grouped Query Attention) / MQA (Multi-Query Attention)
- Variants of multi-head attention with fewer key/value heads to reduce memory usage.
- Can help scale to longer contexts or reduce overhead.
- MLA (Multi-Loader Attention?)
- Not a standard acronym in mainstream usage; might refer to specialized attention or multi-level attention.
- Key idea: optimizing how attention states are loaded or partitioned.
- FlashAttention
- Fused kernel that calculates attention in one pass using GPU shared memory.
- Dramatically reduces memory reads/writes, lowering latency for large seq lengths.
- Speculative Decoding
- Use a smaller “draft model” to predict multiple tokens at once, then verify with the large model.
- Achieves 2–3x speedups if the draft model’s predictions are usually correct.
- Quantization
- 8-bit or 4-bit weights (INT8/FP8) to reduce memory footprint and speed up matmul.
- Slight hit to model accuracy but huge gains in throughput.
7. Ways to Optimize for Latency
Think of three angles: System-Level, Model-Level, Hardware-Level.
System-Level
- Low-Latency Batching Windows
- Keep batch windows (waiting time) short so tokens appear quickly.
- Local Serving
- Deploy replicas close to user region to cut network RTT.
- High-Speed Interconnect
- Use NVLink or InfiniBand for multi-GPU setups to reduce communication overhead.
- Pipeline Stage Minimization
- Too many pipeline stages across nodes → high hop latency.
Model-Level
- Reduced Sequence Length
- Summarize or chunk context if possible to shorten sequence.
- FlashAttention
- Minimizes attention overhead at large sequence lengths.
- Quantization
- Fewer bits → faster compute → lower latency (with caution on accuracy).
- Speculative Decoding
- Big latency win if the small draft model’s guesses are good.
Hardware-Level
- GPU Generation
- Modern GPUs (A100, H100) have better memory bandwidth + tensor cores.
- Sufficient GPU Memory
- Avoid constant offloading to CPU/disk, which kills latency.
- Efficient CUDA Kernels
- Fused ops reduce overhead (FlashAttention, fused MLP, etc.).
8. Ways to Optimize for Throughput
Again, consider System, Model, and Hardware.
System-Level
- Continuous Batching
- Merge multiple requests per decode step → higher total tokens/sec.
- Autoscaling / Data Parallel
- Multiple replicas handle more requests in parallel.
- Eviction Policies
- Free up memory from idle sessions to serve more active requests concurrently.
- Load Balancer
- Distribute requests so no single node is overloaded while others idle.
Model-Level
- Tensor Parallel
- Split big layers across GPUs → handle bigger batches concurrently (if interconnect is fast).
- Pipeline Parallel
- Keep multiple micro-batches in flight like an assembly line.
- Quantization
- Smaller data → bigger batch fits in GPU memory → more tokens per second.
- MoE (Expert Parallel)
- Sparse activation: can handle large batch if routing is balanced.
Hardware-Level
- Scaling Up GPU Count
- More GPUs (with enough bandwidth) → more total throughput.
- High-Bandwidth Networking
- Critical if your model is sharded (tensor or pipeline).
- Faster Disks / Storage
- If offloading to disk (FlexGen), faster NVMe or SSD read speeds matter.
One-Liner Reminder
- Local vs. Global → Where you serve from. Concurrency vs. region & cost trade-offs.
- Continuous Batching → Merge decode steps for throughput with minimal latency penalty.
- Parallelism → Data (throughput), Tensor (big model), Pipeline (layers), MoE (experts).
- KV & APC → Cache previous tokens and share repeated prefixes = speed.
- Eviction & Offload → LRU/time-based to manage GPU memory. Summarize or store old KV.
- Scaling Algos → FlashAttention, Quantization, Speculative Decoding = big speedups.
- Latency → Optimize system, model, hardware. Batching windows, local replicas, high-end GPUs.
- Throughput → Data parallel replicas, continuous batching, quantization, and HPC interconnect.
Keep these bullet points in mind, and you’ll have a strong mental map of how to handle high-performance, scalable LLM inference. Good luck!