High Level Conceptual Notes

1. Model Serving (Local & Global)

Local Serving
- Single data center or a single node handling inference.
- Lower latency (no cross-region hops), simpler infrastructure.
- Works best if all your users/traffic are geographically close.

Global Serving
- Multiple data centers or cloud regions, each with replicas of the model.
- Load balancer directs requests based on proximity or resource availability.
- Minimizes latency for globally distributed users, but cost and complexity go up.
- Must consider cross-region load balancing, network overhead, and possible replication of large model artifacts.

Key Considerations
- Latency: Real-time user experience requires local-ish compute.
- Concurrency: More replicas → handle more requests but higher overall cost.
- Resilience: Multi-region can handle outages, but orchestration gets complex.
- Cost: GPU hours in multiple regions can be pricey; you might scale down in regions with low usage.

2. Continuous Batching

Core Idea
- Autoregressive decoding merges multiple requests at the same token step into one forward pass.
- Greatly boosts GPU utilization by combining small workloads into a single big kernel launch.

Benefits
- Throughput: Fewer, larger GPU kernels instead of many small ones.
- Latency: Small added “batch wait” (a few ms) can be worth the big throughput gain.
- Easy Integration: Tools like vLLM, Hugging Face TGI do this automatically.

Mechanics
- Orchestrator checks which sessions are at decode step n, merges them, runs one forward pass for that token → merges again for n+1, etc.
- Works even if requests arrive asynchronously or have different prompt lengths.

Trade-Off
- If concurrency is very low (fewer requests), your batch size might be 1. Gains may be smaller.
- Must carefully handle user experience so we don’t stall them too long while batching.

3. Parallelism Strategies (Pros, Cons, Uniqueness)

A. Data Parallelism

Concept: Replicate the entire model on each GPU (or node).

Pros
- Simple: Each GPU runs a full copy of the model, no cross-GPU sync for forward passes.
- Scales throughput linearly for many concurrent requests.

Cons
- Doesn’t reduce per-request latency. One large inference still runs on one GPU.
- Memory duplication for all replicas → might be expensive if model is huge.

Use Case
- High concurrency (many user requests), and the model fits on one GPU.

B. Tensor (Model) Parallelism

Concept: Split each weight matrix (or attention heads) across multiple GPUs.

Pros
- Enables serving models larger than a single GPU’s memory.
- Can reduce latency if compute is heavy and interconnect is fast.

Cons
- Requires high-speed GPU interconnects (NVLink/InfiniBand). Communication overhead can bottleneck.
- Complex to implement (Megatron-LM, etc.). Diminishing returns with too many GPUs.

Use Case
- Ultra-large model that doesn’t fit on one GPU. HPC or specialized clusters.

C. Pipeline Parallelism

Concept: Each GPU holds a consecutive chunk of layers; pass activations from one stage to the next in sequence.

Pros
- Also allows serving bigger models than one GPU can hold.
- Increases throughput if you have many micro-batches in flight (assembly-line).

Cons
- Single-request latency can be higher (sequential pipeline stages + inter-stage comm).
- Requires careful load balancing among pipeline stages.

Use Case
- Very deep models (lots of layers). Good for batch or multi-request concurrency.

D. Expert Parallelism (MoE)

Concept: Many “expert” sub-models distributed across GPUs; gating network routes tokens to specific experts.

Pros
- Sparse activation → can hold massive total parameters with limited per-token compute.
- Scales model capacity almost arbitrarily (more experts = bigger model).

Cons
- Complex routing, possible load imbalance if many tokens pick the same expert.
- High communication overhead when scattering tokens to different experts.

Use Case
- Extremely large models with specialized “experts” (multi-language, multi-domain).

4. KV Caching & APC (Automatic Prefix Caching)

KV Cache
- Stores hidden states (Key & Value tensors) from past tokens to speed up subsequent attention steps.
- Essential for autoregressive LLMs (GPT-like).

Benefits
- Avoids recomputing entire sequence for every new token – big latency and throughput boost.
- Must manage memory (KV can be large if many tokens or many concurrent requests).

APC (Automatic Prefix Caching)
- If multiple prompts share the same prefix, reuse the same computed KV chunk.
- Greatly speeds up repeated patterns (e.g., same system prompts or repeated instructions).

Implementation Details
- PagedAttention: KV blocks stored in “pages” for dynamic allocation and offload.
- Some frameworks automatically detect prefix overlaps.

5. Eviction Policies & Offloading Strategies

Why Evict?
- GPU memory is precious. Large or idle sessions can hog KV space.
- Eviction frees memory for new or active sessions.

Common Policies
- LRU (Least Recently Used): Discard the oldest or least-accessed session’s KV.
- Time-Based: If a session is idle for X seconds, remove or move it to CPU.
- Priority: Premium sessions never evict; low-priority sessions evict first.

Offloading Approaches
- KV Offload: Move old tokens’ KV to CPU pinned memory or disk. Reload if needed.
- Partial Summarization: Summarize older context, reduce token count (soft eviction).
- FlexGen (advanced): Offloads model weights as well, loading layers on-demand.

Trade-Off
- Reloading from CPU or disk can spike latency for revived sessions.
- Summarization saves GPU memory but might lose detailed context.

6. Scaling Inference Algorithms

Here are some acronyms and features to keep in mind:

GQA (Grouped Query Attention) / MQA (Multi-Query Attention)
- Variants of multi-head attention with fewer key/value heads to reduce memory usage.
- Can help scale to longer contexts or reduce overhead.

MLA (Multi-Loader Attention?)
- Not a standard acronym in mainstream usage; might refer to specialized attention or multi-level attention.
- Key idea: optimizing how attention states are loaded or partitioned.

FlashAttention
- Fused kernel that calculates attention in one pass using GPU shared memory.
- Dramatically reduces memory reads/writes, lowering latency for large seq lengths.

Speculative Decoding
- Use a smaller “draft model” to predict multiple tokens at once, then verify with the large model.
- Achieves 2–3x speedups if the draft model’s predictions are usually correct.

Quantization
- 8-bit or 4-bit weights (INT8/FP8) to reduce memory footprint and speed up matmul.
- Slight hit to model accuracy but huge gains in throughput.

7. Ways to Optimize for Latency

Think of three angles: System-Level, Model-Level, Hardware-Level.

System-Level

Low-Latency Batching Windows
- Keep batch windows (waiting time) short so tokens appear quickly.

Local Serving
- Deploy replicas close to user region to cut network RTT.

High-Speed Interconnect
- Use NVLink or InfiniBand for multi-GPU setups to reduce communication overhead.

Pipeline Stage Minimization
- Too many pipeline stages across nodes → high hop latency.

Model-Level

Reduced Sequence Length
- Summarize or chunk context if possible to shorten sequence.

FlashAttention
- Minimizes attention overhead at large sequence lengths.

Quantization
- Fewer bits → faster compute → lower latency (with caution on accuracy).

Speculative Decoding
- Big latency win if the small draft model’s guesses are good.

Hardware-Level

GPU Generation
- Modern GPUs (A100, H100) have better memory bandwidth + tensor cores.

Sufficient GPU Memory
- Avoid constant offloading to CPU/disk, which kills latency.

Efficient CUDA Kernels
- Fused ops reduce overhead (FlashAttention, fused MLP, etc.).

8. Ways to Optimize for Throughput

Again, consider System, Model, and Hardware.

System-Level

Continuous Batching
- Merge multiple requests per decode step → higher total tokens/sec.

Autoscaling / Data Parallel
- Multiple replicas handle more requests in parallel.

Eviction Policies
- Free up memory from idle sessions to serve more active requests concurrently.

Load Balancer
- Distribute requests so no single node is overloaded while others idle.

Model-Level

Tensor Parallel
- Split big layers across GPUs → handle bigger batches concurrently (if interconnect is fast).

Pipeline Parallel
- Keep multiple micro-batches in flight like an assembly line.

Quantization
- Smaller data → bigger batch fits in GPU memory → more tokens per second.

MoE (Expert Parallel)
- Sparse activation: can handle large batch if routing is balanced.

Hardware-Level

Scaling Up GPU Count
- More GPUs (with enough bandwidth) → more total throughput.

High-Bandwidth Networking
- Critical if your model is sharded (tensor or pipeline).

Faster Disks / Storage
- If offloading to disk (FlexGen), faster NVMe or SSD read speeds matter.

One-Liner Reminder

Local vs. Global → Where you serve from. Concurrency vs. region & cost trade-offs.

Continuous Batching → Merge decode steps for throughput with minimal latency penalty.

Parallelism → Data (throughput), Tensor (big model), Pipeline (layers), MoE (experts).

KV & APC → Cache previous tokens and share repeated prefixes = speed.

Eviction & Offload → LRU/time-based to manage GPU memory. Summarize or store old KV.

Scaling Algos → FlashAttention, Quantization, Speculative Decoding = big speedups.

Latency → Optimize system, model, hardware. Batching windows, local replicas, high-end GPUs.

Throughput → Data parallel replicas, continuous batching, quantization, and HPC interconnect.

Keep these bullet points in mind, and you’ll have a strong mental map of how to handle high-performance, scalable LLM inference. Good luck!