High Level Conceptual Notes

1. Model Serving (Local & Global)

  • Local Serving
    • Single data center or a single node handling inference.
    • Lower latency (no cross-region hops), simpler infrastructure.
    • Works best if all your users/traffic are geographically close.
  • Global Serving
    • Multiple data centers or cloud regions, each with replicas of the model.
    • Load balancer directs requests based on proximity or resource availability.
    • Minimizes latency for globally distributed users, but cost and complexity go up.
    • Must consider cross-region load balancing, network overhead, and possible replication of large model artifacts.
  • Key Considerations
    • Latency: Real-time user experience requires local-ish compute.
    • Concurrency: More replicas → handle more requests but higher overall cost.
    • Resilience: Multi-region can handle outages, but orchestration gets complex.
    • Cost: GPU hours in multiple regions can be pricey; you might scale down in regions with low usage.

2. Continuous Batching

  • Core Idea
    • Autoregressive decoding merges multiple requests at the same token step into one forward pass.
    • Greatly boosts GPU utilization by combining small workloads into a single big kernel launch.
  • Benefits
    • Throughput: Fewer, larger GPU kernels instead of many small ones.
    • Latency: Small added “batch wait” (a few ms) can be worth the big throughput gain.
    • Easy Integration: Tools like vLLM, Hugging Face TGI do this automatically.
  • Mechanics
    • Orchestrator checks which sessions are at decode step n, merges them, runs one forward pass for that token → merges again for n+1, etc.
    • Works even if requests arrive asynchronously or have different prompt lengths.
  • Trade-Off
    • If concurrency is very low (fewer requests), your batch size might be 1. Gains may be smaller.
    • Must carefully handle user experience so we don’t stall them too long while batching.

3. Parallelism Strategies (Pros, Cons, Uniqueness)

A. Data Parallelism

  • Concept: Replicate the entire model on each GPU (or node).
  • Pros
    • Simple: Each GPU runs a full copy of the model, no cross-GPU sync for forward passes.
    • Scales throughput linearly for many concurrent requests.
  • Cons
    • Doesn’t reduce per-request latency. One large inference still runs on one GPU.
    • Memory duplication for all replicas → might be expensive if model is huge.
  • Use Case
    • High concurrency (many user requests), and the model fits on one GPU.

B. Tensor (Model) Parallelism

  • Concept: Split each weight matrix (or attention heads) across multiple GPUs.
  • Pros
    • Enables serving models larger than a single GPU’s memory.
    • Can reduce latency if compute is heavy and interconnect is fast.
  • Cons
    • Requires high-speed GPU interconnects (NVLink/InfiniBand). Communication overhead can bottleneck.
    • Complex to implement (Megatron-LM, etc.). Diminishing returns with too many GPUs.
  • Use Case
    • Ultra-large model that doesn’t fit on one GPU. HPC or specialized clusters.

C. Pipeline Parallelism

  • Concept: Each GPU holds a consecutive chunk of layers; pass activations from one stage to the next in sequence.
  • Pros
    • Also allows serving bigger models than one GPU can hold.
    • Increases throughput if you have many micro-batches in flight (assembly-line).
  • Cons
    • Single-request latency can be higher (sequential pipeline stages + inter-stage comm).
    • Requires careful load balancing among pipeline stages.
  • Use Case
    • Very deep models (lots of layers). Good for batch or multi-request concurrency.

D. Expert Parallelism (MoE)

  • Concept: Many “expert” sub-models distributed across GPUs; gating network routes tokens to specific experts.
  • Pros
    • Sparse activation → can hold massive total parameters with limited per-token compute.
    • Scales model capacity almost arbitrarily (more experts = bigger model).
  • Cons
    • Complex routing, possible load imbalance if many tokens pick the same expert.
    • High communication overhead when scattering tokens to different experts.
  • Use Case
    • Extremely large models with specialized “experts” (multi-language, multi-domain).

4. KV Caching & APC (Automatic Prefix Caching)

  • KV Cache
    • Stores hidden states (Key & Value tensors) from past tokens to speed up subsequent attention steps.
    • Essential for autoregressive LLMs (GPT-like).
  • Benefits
    • Avoids recomputing entire sequence for every new token – big latency and throughput boost.
    • Must manage memory (KV can be large if many tokens or many concurrent requests).
  • APC (Automatic Prefix Caching)
    • If multiple prompts share the same prefix, reuse the same computed KV chunk.
    • Greatly speeds up repeated patterns (e.g., same system prompts or repeated instructions).
  • Implementation Details
    • PagedAttention: KV blocks stored in “pages” for dynamic allocation and offload.
    • Some frameworks automatically detect prefix overlaps.

5. Eviction Policies & Offloading Strategies

  • Why Evict?
    • GPU memory is precious. Large or idle sessions can hog KV space.
    • Eviction frees memory for new or active sessions.
  • Common Policies
    • LRU (Least Recently Used): Discard the oldest or least-accessed session’s KV.
    • Time-Based: If a session is idle for X seconds, remove or move it to CPU.
    • Priority: Premium sessions never evict; low-priority sessions evict first.
  • Offloading Approaches
    • KV Offload: Move old tokens’ KV to CPU pinned memory or disk. Reload if needed.
    • Partial Summarization: Summarize older context, reduce token count (soft eviction).
    • FlexGen (advanced): Offloads model weights as well, loading layers on-demand.
  • Trade-Off
    • Reloading from CPU or disk can spike latency for revived sessions.
    • Summarization saves GPU memory but might lose detailed context.

6. Scaling Inference Algorithms

Here are some acronyms and features to keep in mind:

  • GQA (Grouped Query Attention) / MQA (Multi-Query Attention)
    • Variants of multi-head attention with fewer key/value heads to reduce memory usage.
    • Can help scale to longer contexts or reduce overhead.
  • MLA (Multi-Loader Attention?)
    • Not a standard acronym in mainstream usage; might refer to specialized attention or multi-level attention.
    • Key idea: optimizing how attention states are loaded or partitioned.
  • FlashAttention
    • Fused kernel that calculates attention in one pass using GPU shared memory.
    • Dramatically reduces memory reads/writes, lowering latency for large seq lengths.
  • Speculative Decoding
    • Use a smaller “draft model” to predict multiple tokens at once, then verify with the large model.
    • Achieves 2–3x speedups if the draft model’s predictions are usually correct.
  • Quantization
    • 8-bit or 4-bit weights (INT8/FP8) to reduce memory footprint and speed up matmul.
    • Slight hit to model accuracy but huge gains in throughput.

7. Ways to Optimize for Latency

Think of three angles: System-Level, Model-Level, Hardware-Level.

System-Level

  • Low-Latency Batching Windows
    • Keep batch windows (waiting time) short so tokens appear quickly.
  • Local Serving
    • Deploy replicas close to user region to cut network RTT.
  • High-Speed Interconnect
    • Use NVLink or InfiniBand for multi-GPU setups to reduce communication overhead.
  • Pipeline Stage Minimization
    • Too many pipeline stages across nodes → high hop latency.

Model-Level

  • Reduced Sequence Length
    • Summarize or chunk context if possible to shorten sequence.
  • FlashAttention
    • Minimizes attention overhead at large sequence lengths.
  • Quantization
    • Fewer bits → faster compute → lower latency (with caution on accuracy).
  • Speculative Decoding
    • Big latency win if the small draft model’s guesses are good.

Hardware-Level

  • GPU Generation
    • Modern GPUs (A100, H100) have better memory bandwidth + tensor cores.
  • Sufficient GPU Memory
    • Avoid constant offloading to CPU/disk, which kills latency.
  • Efficient CUDA Kernels
    • Fused ops reduce overhead (FlashAttention, fused MLP, etc.).

8. Ways to Optimize for Throughput

Again, consider System, Model, and Hardware.

System-Level

  • Continuous Batching
    • Merge multiple requests per decode step → higher total tokens/sec.
  • Autoscaling / Data Parallel
    • Multiple replicas handle more requests in parallel.
  • Eviction Policies
    • Free up memory from idle sessions to serve more active requests concurrently.
  • Load Balancer
    • Distribute requests so no single node is overloaded while others idle.

Model-Level

  • Tensor Parallel
    • Split big layers across GPUs → handle bigger batches concurrently (if interconnect is fast).
  • Pipeline Parallel
    • Keep multiple micro-batches in flight like an assembly line.
  • Quantization
    • Smaller data → bigger batch fits in GPU memory → more tokens per second.
  • MoE (Expert Parallel)
    • Sparse activation: can handle large batch if routing is balanced.

Hardware-Level

  • Scaling Up GPU Count
    • More GPUs (with enough bandwidth) → more total throughput.
  • High-Bandwidth Networking
    • Critical if your model is sharded (tensor or pipeline).
  • Faster Disks / Storage
    • If offloading to disk (FlexGen), faster NVMe or SSD read speeds matter.

One-Liner Reminder

  1. Local vs. Global → Where you serve from. Concurrency vs. region & cost trade-offs.
  1. Continuous Batching → Merge decode steps for throughput with minimal latency penalty.
  1. Parallelism → Data (throughput), Tensor (big model), Pipeline (layers), MoE (experts).
  1. KV & APC → Cache previous tokens and share repeated prefixes = speed.
  1. Eviction & Offload → LRU/time-based to manage GPU memory. Summarize or store old KV.
  1. Scaling Algos → FlashAttention, Quantization, Speculative Decoding = big speedups.
  1. Latency → Optimize system, model, hardware. Batching windows, local replicas, high-end GPUs.
  1. Throughput → Data parallel replicas, continuous batching, quantization, and HPC interconnect.

Keep these bullet points in mind, and you’ll have a strong mental map of how to handle high-performance, scalable LLM inference. Good luck!