Practice 2
50 Essential PyTorch Coding Interview Questions (LLM Inference & Optimization)
Easy Questions (Fundamentals & Basics)
- (Easy): Use PyTorch's inference mode properly: put a model in evaluation mode and disable gradient calculations. Write a code snippet that wraps a forward pass in
model.eval()andtorch.no_grad()to ensure no gradients are tracked.
- (Easy): Perform device management for inference: given a PyTorch model and input tensor, write code to move them to CUDA GPU for faster inference, then transfer the result back to CPU (e.g., using
model.to('cuda')andtensor.to('cuda'), then.cpu()on the output).
- (Easy): Implement the softmax function from scratch for a given logits tensor and use it to get a prediction. For example, compute probabilities with exponentiation and normalization (without using
torch.softmax), then usetorch.argmaxto find the index of the highest probability.
- (Easy): Define a simple neural network module in PyTorch and run a forward pass. For instance, implement an
nn.Modulewith onenn.Linearlayer followed by a ReLU activation. Show how to instantiate this model and feed a sample input through it.
- (Easy): Use an embedding layer to map token IDs to vectors. For example, given a batch of token indices, create an
nn.Embeddingof appropriate size and show how to retrieve the embedding tensor for the batch (by calling the embedding layer on the input indices).
- (Easy): Combine embedding vectors with positional encodings. Suppose you have a tensor of word embeddings and a tensor of positional encodings of the same shape; write code to add them together elementwise to form the final input for a transformer model.
- (Easy): Pad sequences for batching: write a function that takes a list of sequences (lists of token IDs of varying lengths) and pads them with a PAD token (e.g., 0) to the same length. Also return an attention mask indicating which positions are real tokens (1) and which are padding (0).
- (Easy): Implement a basic greedy decoding loop for text generation. Starting from an initial prompt (sequence of input IDs), iteratively feed it into the model to get next-token logits, pick the token with the highest probability (argmax), append it to the sequence, and repeat until an end-of-sequence token is produced.
- (Easy): Calculate model size: write code to compute the total number of parameters in a given PyTorch model and estimate its memory footprint. (Hint: sum up
param.numel() * param.element_size()for each parameter to get total bytes, and convert to MB or GB.)
Intermediate Questions (Moderate Difficulty)
- (Medium): Implement scaled dot-product attention. Given query, key, and value tensors (Q, K, V) of shape
(batch, seq_len, dim), compute the attention output = softmax$(QK^T / \sqrt{d}})$ · V. Include support for an attention mask (e.g., ignore certain positions by addinginfto logits before softmax for masked positions).
- (Medium): Implement the Transformer's feed-forward network block. Given an input tensor of shape
(batch, seq_len, dim), pass it through a two-layer MLP: firstLinear(dim → hidden_dim), apply an activation (e.g., GELU), thenLinear(hidden_dim → dim). Show this in PyTorch code (you can assume somehidden_dimvalue).
- (Medium): Implement top-k sampling for one step of language model decoding. Given a tensor of logits for the next token, filter it to the top k highest values (use
torch.topk), then sample a token from those top-k probabilities (e.g., withtorch.multinomial). The code should output an index for the sampled token.
- (Medium): Implement nucleus (top-p) sampling for one decoding step. Given logits and a probability threshold p, sort the token probabilities, compute their cumulative sum, and select the smallest set of tokens whose cumulative probability ≥ p. Then sample the next token from that set. Provide code to perform this selection and sampling.
- (Medium): Add caching to an autoregressive transformer decoding loop. Modify a naive generation function so that it passes a “past key-values” cache to the model. Show how you would store the K and V from each timestep (e.g., in lists or a preallocated tensor) and reuse them in subsequent model calls to avoid recomputing attention on previous tokens.
- (Medium): Batch by sequence length for efficiency. Given a list of input sequences of different lengths, write code to sort them by length, batch those of similar lengths together, pad within each batch, and then run the model on each batch. (This minimizes padding and idle compute, improving throughput on variable-length inputs.)
- (Medium): Implement micro-batching for inference. If a batch of N inputs is too large to process at once on the GPU, show how to split it into smaller sub-batches, run the model on each sub-batch sequentially (accumulating outputs), and then concatenate the results. Ensure the final outputs preserve the original input order.
- (Medium): Optimize inference with TorchScript. Take a simple PyTorch model (or function) and demonstrate how to convert it to TorchScript using
torch.jit.traceortorch.jit.script. Show the code for scripting/tracing the model and then using the compiledscripted_modelto perform a forward pass.
- (Medium): Use PyTorch 2.x compile (TorchDynamo + TorchInductor) to speed up inference. Write an example of wrapping a model with
torch.compileand running inference on some input. (For instance:optimized_model = torch.compile(model)then useoptimized_model(x)to execute the compiled graph for faster execution.)
- (Medium): Profile a model's inference to find bottlenecks. Use
torch.profiler(ortorch.autograd.profiler) in a context manager to record the time taken by each operation during a forward pass. Provide code that runs the model undertorch.profiler.profile(...){ ... }and then prints out a report of the most time-consuming ops or layers.
- (Medium): Use mixed precision during inference. Show how to wrap model inference code in a
torch.cuda.amp.autocast()context to run it in float16 where possible. For example, demonstrate loading a model and input, then callingwith torch.cuda.amp.autocast(): output = model(input_half)to leverage tensor cores, and mention the speed/memory benefits of FP16.
- (Medium): Apply dynamic quantization for faster CPU inference. Provide code to quantize a trained model (for example, using
torch.quantization.quantize_dynamicon a model with linear layers or LSTMs) so that it uses int8/FP16 weights. Then show how to run inference with the quantized model and mention the potential speedup on CPU (with minimal accuracy drop).
- (Medium): Prune insignificant weights in a model. For instance, take a fully connected layer's weight matrix and zero out all entries with magnitude below a threshold. Show code that identifies these small weights (e.g.,
mask = (weight.abs() < threshold)) and sets them to zero in-place. Comment on how this sparsity might affect model size or speed (noting that unstructured sparsity may need specialized kernels to see speedup).
- (Medium): Use the HuggingFace Transformers library to run a model inference manually. For example, load a pretrained GPT-2 model and tokenizer, encode a prompt into input IDs, feed the input IDs to the model (e.g.,
outputs = model(input_ids)), get the logits, and then decode the model's output IDs back to text. (This tests using HF models without the high-level.generate()convenience function.)
- (Medium): Use pinned memory to accelerate data transfer. Demonstrate creating an input tensor with
pin_memory=True(or using a DataLoader withpin_memory=True), then transferring it to the GPU withtensor.to('cuda', non_blocking=True). In code comments, explain how pinned (page-locked) host memory can improve throughput for CPU-to-GPU data copy operations by allowing DMA transfers.
Advanced Questions (High Difficulty)
- (Hard): Implement multi-head self-attention from scratch. Given an input tensor of shape
(batch, seq_len, model_dim)and weight matrices for W_q, W_k, W_v (to project inputs to each head) and W_o (to project concatenated heads to output), write code to compute multi-head attention. Split the input into multiple heads, compute scaled dot-product attention for each head (without usingnn.MultiheadAttention), then concatenate the head outputs and apply the output projection. (Ensure tensor shapes line up for the matrix multiplies.)
- (Hard): Implement a full Transformer decoder block in PyTorch. The block should include self-attention followed by a feed-forward network, with residual connections and layer normalization around each sub-layer. Write the
forwardmethod assuming you have functions or modules for the attention and feed-forward parts. Show how you take the inputx, computeattn_out = SelfAttention(x), thenx = x + attn_outfollowed byLayerNorm, then pass that through the feed-forward network, add the residual and normalize again.
- (Hard): Implement beam search decoding for a language model. Write a function that given a model and an input prompt, performs beam search with a specified beam width B. It should keep track of multiple hypotheses (sequences and their cumulative log-probabilities), expand each hypothesis with new tokens at each step (using the model's output probabilities), and prune down to the top B candidates. Continue until a stopping condition (e.g., all beams have produced an end-of-sequence token or a max length is reached), then return the highest scoring completed sequence. Include code for managing the beam candidate lists at each step.
- (Hard): Implement speculative decoding to accelerate generation. Suppose you have a large language model and a smaller "draft" model. Write a procedure where the draft model predicts the next k tokens in one go, and the larger model then verifies these tokens sequentially. If the large model's output matches the draft for a token, you accept it and move on to verifying the next token; as soon as it diverges, you discard the draft's remaining suggestions and resume generation from that point with the large model (perhaps resampling a new draft continuation). Provide a code outline showing how you'd interleave calls to the draft model and the main model, managing two sets of tokens (the proposed tokens and the confirmed tokens).
- (Hard): Outline a continuous batching strategy for an LLM inference server (similar to what vLLM does). Write pseudo-code for a loop that continuously collects incoming requests and groups them into batches on the fly. For example, maintain a queue of incoming requests; at each iteration, take as many as available (up to some max batch size) to form a batch and run the model. If new requests arrive while a batch is running, they wait and then get batched in the next iteration. Ensure your outline handles a timeout or maximum delay so that no request waits indefinitely. (This tests understanding of dynamic batching in a live setting.)
- (Hard): Use CUDA streams to overlap computation and data transfer. Provide an example where you create at least two
torch.cuda.Streamobjects: one stream that preloads or preprocesses data on the GPU while another stream runs the model inference on already-loaded data. Show how to usewith torch.cuda.stream(stream): ...to assign operations to a stream, and ensure that you launch GPU-to-GPU copies or CPU-to-GPU transfers withnon_blocking=True. The goal is to overlap the data copy of batch n+1 with the compute of batch n.
- (Hard): Write a custom PyTorch autograd Function for a new operation. For example, implement a custom ReLU. Define a subclass of
torch.autograd.Functionwith aforward(ctx, input)that returnsinput.clamp(min=0), and abackward(ctx, grad_output)that returnsgrad_output * (input > 0).float(). Provide the code for this class and then show how to use it in a model (e.g.,CustomReLU.apply(tensor)) to verify it computes the same result as the built-in ReLU. (This tests low-level autograd understanding.)
- (Hard): Write a GPU kernel using OpenAI Triton to perform an elementwise operation (for instance, add 1 to each element of an input tensor), and show how to launch it from PyTorch. Include the Triton kernel definition with
@triton.jit, usingtl.loadto read from memory andtl.storeto write results. Also demonstrate how to configure the launch grid/block size (e.g., using agridlambda or specifyingBLOCK_SIZE) and then call the kernel likekernel[grid](..., BLOCK_SIZE=...)on a sample tensor.
- (Hard): Convert a PyTorch model to a TensorRT engine for deployment. Outline the steps in code: for example, export the model to ONNX format using
torch.onnx.export, then use NVIDIA’s TensorRT Python API (or Torch-TensorRT) to load that ONNX model and build an optimized TensorRT engine. Finally, show how you would run inference using that TensorRT engine (e.g., by binding input/output and executing the context). Pseudo-code for the TensorRT part is fine — focus on the sequence of steps and any important parameters (like enabling FP16 or setting max workspace size).
- (Hard): Parallelize preprocessing using CPU threads or processes to feed the model faster. For instance, if tokenization on CPU is a bottleneck, demonstrate how you could use Python’s
concurrent.futures.ThreadPoolExecutor(for I/O-bound tasks) ormultiprocessing.Pool(for CPU-bound tasks) to tokenize multiple inputs in parallel before batching them. Show a code snippet that takes a list of text strings, splits the work across threads or processes to produce token ID tensors for each, and then stacks them into a batch for the model.
- (Hard): Integrate LoRA (Low-Rank Adaptation) into a model’s layer. Take a pre-trained weight matrix $W$ (for example, the weight of a transformer’s dense layer) and incorporate LoRA matrices into it. Show how you’d create two small trainable matrices $A$ and $B$ (with shapes like
[out_dim, r]and[r, in_dim]for some small rank $r$) and modify the layer’s forward pass to use $W + \alpha \cdot A B$ (with $W$ kept frozen and only $A, B$ learned). Provide code snippets for modifying the model’s__init__to add the new parameters and theforwardto add the LoRA contribution to $W x$.
- (Hard): Demonstrate a case where
torch.jit.traceis insufficient andtorch.jit.scriptis needed. For example, write a small PyTorch function that has an if/else branch or a loop that depends on the input data (not just on tensor sizes). Show that tracing this function with a sample input will capture only one branch (thus not generalizing to the other case). Then show how usingtorch.jit.scriptinstead can handle the dynamic control flow. Provide the code for both the traced and scripted versions, highlighting why the traced one is incorrect.
- (Hard): Optimize a generation loop by avoiding CPU synchronization. For example, explain that calling
.item()on a CUDA tensor forces a GPU-to-CPU sync. Show code where instead of doingnext_token_id = logits.argmax().item()each iteration (which brings the value to CPU), you keep the computation on the GPU by using tensor operations. For instance, you can obtain the index of the max logit as a 0-dim tensor and use it directly to index into embedding for the next step. By not calling.item()in the loop, you allow asynchronous GPU execution to proceed without stall, greatly improving throughput in autoregressive generation.
- (Hard): Perform model-parallel inference across multiple GPUs. For a very large model that cannot fit into one GPU’s memory, illustrate how to split the model’s layers between two GPUs. Provide a code sketch: for example, move
model.encodertocuda:0andmodel.decodertocuda:1. In the forward pass, send the input to the encoder on GPU0, get its output, then transfer that output tensor to GPU1 to feed into the decoder. Show how you would coordinate the device placements and .to() operations so that each part of the model runs on the intended device.
- (Hard): Use HuggingFace Optimum to accelerate inference. For example, show how to convert a Transformers model to an ONNX runtime for faster CPU/GPU inference using Optimum. You might demonstrate code that uses
Optimumto export a model to ONNX (ORTModel.from_pretrained(...)) or wrap a model withBetterTransformer. Include the steps to load the optimized model and perform a sample inference. (This tests familiarity with external optimization tools for PyTorch models.)
- (Hard): Perform post-optimization validation of a model. After applying an optimization (quantization, compilation, etc.), you need to ensure the model’s outputs are still correct. Write code to run the same input through both the original and the optimized model and compare the outputs. For example, compute the mean absolute difference between the two outputs, or if it's an classification/generation model, check that they produce the same top-1 prediction or sequence. This verification code helps confirm that the optimization didn’t degrade the model’s accuracy beyond an acceptable range.
- (Hard): Use HuggingFace Text Generation Inference (TGI) server for serving an LLM. Assume a model is already loaded on a TGI server endpoint — write a small Python client snippet that sends a generation request to the server. For example, use the
requestslibrary to POST a JSON payload with a prompt and decoding parameters to the TGI HTTP endpoint, then parse the JSON response to get the generated text. (This tests understanding of integrating with an inference server via its API.)
- (Hard): Configure NVIDIA Triton Inference Server for dynamic batching. Describe (or provide) what you would put in the model’s
config.pbtxtto enable this. For example, specify amax_batch_sizethat the model can handle (e.g., 8 or 16) and add adynamic_batchingblock with parameters likemax_queue_delay_microseconds(to set the max wait time for forming a batch). Provide a short snippet of a config showing these settings, and explain that this will allow Triton to automatically batch incoming requests up to the max batch size or timeout.
- (Hard): Implement sinusoidal positional encoding as used in the original Transformer. Write a function that takes a sequence length L and model dimension d_model and returns a tensor of shape
(L, d_model)where each position $i$ (0-indexed) has a sinusoidal encoding. Use the formula: for each dimension $j$:
Implement this calculation in PyTorch (avoiding explicit Python loops if possible).
- (Hard): Refactor a computational Python loop into a vectorized PyTorch operation for speed. For example, suppose you have code that iterates over each element in a tensor to apply a function (which is very slow in Python). Show an example of such a loop and then show how to replace it with equivalent PyTorch tensor operations (which leverage parallelism). Explain in comments how removing Python-loop overhead and using vectorized operations speeds up inference, especially for large tensor computations.
- (Hard): Diagnose and fix a memory leak in an inference loop. For instance, consider a scenario where you append each model output tensor to a Python list for logging or further processing. Write code that simulates this (e.g., a loop doing
outputs.append(model(x))each time) and explain why this causes GPU memory to balloon (hint: the computation graph is being retained for each output tensor). Then show how to fix it by disabling grad tracking (with torch.no_grad():around inference) or detaching/cloning tensors before storing them (so that no reference to the computation graph remains), and by deleting or reusing tensors appropriately.
- (Hard): Work around a part of the model that can’t be compiled by TorchDynamo. If
torch.compileis failing or falling back for a certain section of your model (for example, a part with unsupported operations or dynamic Python logic), demonstrate how you can usetorch._dynamo.disableas a context manager around that code block to exclude it from compilation. Provide a code example where you wrap a problematic function or section inwith torch._dynamo.disable():so that the rest of the model runs under TorchDynamo optimization, but that section will run normally (ensuring the whole model can execute without errors).
- (Hard): Use multiple GPUs to increase throughput via data parallelism. Show how you could replicate a model on two GPUs and split a batch between them for inference. For example, you might use
torch.nn.DataParallelto automatically split input across GPUs, or do it manually: send half of the input batch to a model oncuda:0and the other half to a clone of the model oncuda:1, then concatenate the outputs. Provide code that demonstrates this parallel inference across two GPUs and mention how it can almost halve the per-batch latency (ignoring some overhead).
- (Medium): Implement random sampling for text generation with a temperature parameter. At each decoding step, apply a softmax to the model’s logits (you can divide the logits by a temperature τ > 0 to control randomness), then use
torch.multinomialto sample one token from the resulting probability distribution. Write a code snippet for one step of this process, givenlogitsfor the current step, and show how changing the temperature affects the sampling outcome (more random vs. more greedy).
- (Medium): Apply weight tying in a language model to reduce parameters. For example, if a model uses an
nn.Embeddingfor input tokens and annn.Linearas the output projection for predicting next-token logits, set the linear layer’s weight equal to the embedding matrix so they share weights. Show in code how you would do this (e.g.,model.output_layer.weight = model.input_embed.weight) and explain that this way the input and output embeddings remain the same, saving memory and typically improving model consistency.
- (Medium): Tune PyTorch’s thread settings for CPU inference. For a model running on CPU (multi-core), show how to configure the number of threads used by PyTorch. For example, use
torch.set_num_threads(n)andtorch.set_num_interop_threads(m)to limit or set the parallelism. Provide code setting these values (e.g., to 4 threads), and explain in comments when you might adjust these — such as to prevent thread oversubscription in a multi-model deployment or to optimize throughput on CPU-bound inference workloads.
each question above is designed to be hands-on and coding-focused, reflecting scenarios a Machine Learning Engineer (Inference) might encounter when optimizing LLM inference in production. Difficulty is labeled from easy fundamentals to challenging performance engineering tasks. The questions emphasize practical reasoning, awareness of inference speed and memory bottlenecks, and the ability to write clean, optimized PyTorch code under real-world constraints.