LLM Fundamentals

High-Level Overview

Training Optimizations

Mixed Precision Training: Reduces memory usage and speeds computation.

Gradient Accumulation: Enables large batch sizes on limited GPU memory.

Data Parallelism: Splits data across GPUs for faster training.

Model Parallelism: Distributes model layers across GPUs.

Pipeline Parallelism: Pipelines model layers across GPUs for efficiency.

Gradient Checkpointing: Trades computation for memory savings.

Optimizer Enhancements (e.g., AdamW, LAMB): Accelerates convergence.

FlashAttention: Optimizes attention computation for speed and memory.

Cocktail SGD: Reduces network overhead in distributed training.

Sub-Quadratic Architectures (e.g., Striped Hyena): Lowers computational complexity.

LoRA Optimization: Efficient fine-tuning for large models.

Inference Optimizations

Quantization: Reduces precision (e.g., FP16, INT8) for faster inference.

Layer Fusion: Combines operations into single kernels.

Dynamic Batching: Groups requests for GPU efficiency.

Speculative Decoding: Uses draft models to predict tokens faster.

Continuous Batching: Processes requests as they arrive.

Caching (e.g., KV Cache): Reuses computed values.

Knowledge Distillation: Trains smaller, faster models.

Custom CUDA Kernels: Optimizes specific ops (e.g., softmax).

SplitRPC: Splits control/data paths for reduced latency.

FlashAttention-3: Enhances inference speed for long sequences.

1. Core Neural Network & Deep Learning Fundamentals

Forward Pass vs. Backward Pass (Backpropagation)
- Understanding how gradients are computed (chain rule, layer-by-layer propagation).
- Difference between training (forward + backward) and inference (forward-only).
- Relevance for inference: parameters are fixed, but knowledge of backprop cements how models are trained.

Common Activation & Loss Functions
- ReLU, Sigmoid, Softmax, etc.
- Role of loss functions in model training (though not directly used in inference).

2. GPU Programming & CUDA Foundations

GPU Architecture (NVIDIA Focus)
- Organization of threads into warps, blocks, and grids.
- How warps operate in lock-step and the impact of warp divergence on performance.

CUDA Memory Hierarchy
- Global memory vs. shared memory vs. registers vs. caches.
- Techniques for memory coalescing and shared-memory tiling to reduce global memory bandwidth usage.
- Asynchronous data transfers and concurrent kernel execution.

CUDA Kernel Optimization for Inference
- Kernel fusion to reduce memory round-trips.
- Minimizing memory transfers by carefully orchestrating host–device communication.
- Keeping GPUs saturated (maximizing occupancy, avoiding warp divergence, etc.).

3. Transformer & Large Language Model (LLM) Architecture

Transformer Basics
- Encoder vs. decoder vs. decoder-only structures.
- Self-attention mechanism and scaled dot-product attention.
- Multi-head attention rationale (parallel heads, capturing different representations).

Attention Mechanism Details
- Queries, keys, values: how they are computed and combined.
- O(n²) complexity in naive self-attention and how this impacts inference speed.
- Key-value caching in autoregressive decoding to avoid recomputing past tokens’ attention.

LLM Training Pipeline & RLHF
- High-level overview of pre-training, fine-tuning, and RLHF.
- Why LLMs are typically pretrained on massive corpora and then specialized or aligned.

4. Model Compression & Parameter-Efficient Techniques

Quantization
- Floating-point (FP16, BF16) vs. integer (INT8/INT4) representations.
- Post-training quantization (PTQ) vs. quantization-aware training (QAT).
- Trade-offs between accuracy, memory savings, and speed.

Knowledge Distillation
- Concept: training a smaller “student” model to mimic a larger “teacher” model.
- How this can drastically reduce model size and inference cost.

Pruning (Structured & Unstructured)
- Removing redundant weights or neurons for sparser models.
- Relevance for LLMs (e.g., SparseGPT).
- Hardware considerations: sparse kernels are not always fully efficient on GPUs.

LoRA (Low-Rank Adaptation)
- Freezing the original model and training low-rank updates for fine-tuning.
- Why LoRA adds negligible overhead in inference (low-rank matrices can be merged or applied with minimal extra cost).

5. LLM Inference Frameworks & Optimization Techniques

High-Throughput Inference Strategies
- Challenges: large memory footprint, sequential token generation, concurrent requests.
- Batching approaches: static vs. continuous/dynamic batching.

vLLM (PagedAttention & Continuous Batching)
- PagedAttention for more efficient key/value caching.
- Continuous batching to merge incoming requests on the fly for better GPU utilization.

NVIDIA TensorRT & TensorRT-LLM
- Overview of TensorRT graph optimizations, kernel fusion, quantization support.
- TensorRT-LLM specifics: support for 8-bit & 4-bit quantization, multi-GPU scaling, fast attention kernels (FlashAttention-style).
- Building optimized inference “engines” from trained LLMs.

Other Inference Frameworks (awareness)
- HF Text Generation Inference (TGI), DeepSpeed-Inference, FasterTransformer, Triton Inference Server.
- High-level idea that all aim for minimized latency and maximized throughput via specialized optimizations.

6. Decoding & Speculative Decoding

Standard Decoding Methods
- Greedy, beam search, top-k, nucleus sampling—trade-offs for speed, diversity, and quality.

Speculative Decoding
- Core concept: using a smaller “draft” model to predict multiple tokens at once, then verifying with the larger “target” model.
- ~2–3× speedup in throughput with zero quality loss if the draft model aligns well with the main model.
- Implementation details in NVIDIA TensorRT-LLM and Google’s research (2022).

7. Systems Integration & End-to-End Inference Knowledge

Putting It All Together
- Memory management, scheduling (continuous batching), model optimization (quantization, LoRA, etc.), and decoding strategies (speculative decoding) to achieve high-performing LLM inference.
- Trade-offs: speed vs. accuracy, throughput vs. latency, memory footprint vs. user concurrency.

Interview Q&A Readiness
- Be prepared to explain the “why” behind each optimization (e.g., “Why quantize?”, “Why use LoRA instead of full fine-tuning?”, “Why does GPU shared memory speed things up?”, “How does speculative decoding preserve exact distribution?”).
- Cite relevant research or frameworks to demonstrate cutting-edge familiarity.

Final Note

Mastering these topics—from the low-level (CUDA, memory hierarchy) to the high-level (Transformer architecture, inference frameworks, decoding strategies)—will position you to speak confidently about modern LLM inference pipelines. These skills collectively show that you can optimize and serve large models effectively in a production environment.

Video references:

https://youtu.be/9tvJ_GYJA-o?si=dE2gwQm2bCfmGn7R

https://youtu.be/wjZofJX0v4M?si=ZS7RHTt0Q8pihsJc

https://youtu.be/KuXjwB4LzSA?si=9KPrv2GFHJ1d3UYo

https://youtu.be/eMlx5fFNoYc?si=UrCt9Xri3YuIfHUW

https://youtu.be/9-Jl0dxWQs8?si=nbg8_RbTRfIBY1AE

https://youtu.be/q8SA3rM6ckI?si=XVA7dlNItqc5Q8MH

https://youtu.be/7xTGNNLPyMI?si=oBpLaDegRkAOjZIu

https://youtu.be/UcwDgsMgTu4?si=P1uBXLf7SLsTrOYT

https://youtu.be/cXpTDKjjKZE?si=-JetWKIvWH1A9iQS