Practice 2

50 Essential PyTorch Coding Interview Questions (LLM Inference & Optimization)

Easy Questions (Fundamentals & Basics)

(Easy): Use PyTorch's inference mode properly: put a model in evaluation mode and disable gradient calculations. Write a code snippet that wraps a forward pass in model.eval() and torch.no_grad() to ensure no gradients are tracked.

(Easy): Perform device management for inference: given a PyTorch model and input tensor, write code to move them to CUDA GPU for faster inference, then transfer the result back to CPU (e.g., using model.to('cuda') and tensor.to('cuda'), then .cpu() on the output).

(Easy): Implement the softmax function from scratch for a given logits tensor and use it to get a prediction. For example, compute probabilities with exponentiation and normalization (without using torch.softmax), then use torch.argmax to find the index of the highest probability.

(Easy): Define a simple neural network module in PyTorch and run a forward pass. For instance, implement an nn.Module with one nn.Linear layer followed by a ReLU activation. Show how to instantiate this model and feed a sample input through it.

(Easy): Use an embedding layer to map token IDs to vectors. For example, given a batch of token indices, create an nn.Embedding of appropriate size and show how to retrieve the embedding tensor for the batch (by calling the embedding layer on the input indices).

(Easy): Combine embedding vectors with positional encodings. Suppose you have a tensor of word embeddings and a tensor of positional encodings of the same shape; write code to add them together elementwise to form the final input for a transformer model.

(Easy): Pad sequences for batching: write a function that takes a list of sequences (lists of token IDs of varying lengths) and pads them with a PAD token (e.g., 0) to the same length. Also return an attention mask indicating which positions are real tokens (1) and which are padding (0).

(Easy): Implement a basic greedy decoding loop for text generation. Starting from an initial prompt (sequence of input IDs), iteratively feed it into the model to get next-token logits, pick the token with the highest probability (argmax), append it to the sequence, and repeat until an end-of-sequence token is produced.

(Easy): Calculate model size: write code to compute the total number of parameters in a given PyTorch model and estimate its memory footprint. (Hint: sum up param.numel() * param.element_size() for each parameter to get total bytes, and convert to MB or GB.)

Intermediate Questions (Moderate Difficulty)

(Medium): Implement scaled dot-product attention. Given query, key, and value tensors (Q, K, V) of shape (batch, seq_len, dim), compute the attention output = softmax$(QK^T / \sqrt{d}})$ · V. Include support for an attention mask (e.g., ignore certain positions by adding inf to logits before softmax for masked positions).

(Medium): Implement the Transformer's feed-forward network block. Given an input tensor of shape (batch, seq_len, dim), pass it through a two-layer MLP: first Linear(dim → hidden_dim), apply an activation (e.g., GELU), then Linear(hidden_dim → dim). Show this in PyTorch code (you can assume some hidden_dim value).

(Medium): Implement top-k sampling for one step of language model decoding. Given a tensor of logits for the next token, filter it to the top k highest values (use torch.topk), then sample a token from those top-k probabilities (e.g., with torch.multinomial). The code should output an index for the sampled token.

(Medium): Implement nucleus (top-p) sampling for one decoding step. Given logits and a probability threshold p, sort the token probabilities, compute their cumulative sum, and select the smallest set of tokens whose cumulative probability ≥ p. Then sample the next token from that set. Provide code to perform this selection and sampling.

(Medium): Add caching to an autoregressive transformer decoding loop. Modify a naive generation function so that it passes a “past key-values” cache to the model. Show how you would store the K and V from each timestep (e.g., in lists or a preallocated tensor) and reuse them in subsequent model calls to avoid recomputing attention on previous tokens.

(Medium): Batch by sequence length for efficiency. Given a list of input sequences of different lengths, write code to sort them by length, batch those of similar lengths together, pad within each batch, and then run the model on each batch. (This minimizes padding and idle compute, improving throughput on variable-length inputs.)

(Medium): Implement micro-batching for inference. If a batch of N inputs is too large to process at once on the GPU, show how to split it into smaller sub-batches, run the model on each sub-batch sequentially (accumulating outputs), and then concatenate the results. Ensure the final outputs preserve the original input order.

(Medium): Optimize inference with TorchScript. Take a simple PyTorch model (or function) and demonstrate how to convert it to TorchScript using torch.jit.trace or torch.jit.script. Show the code for scripting/tracing the model and then using the compiled scripted_model to perform a forward pass.

(Medium): Use PyTorch 2.x compile (TorchDynamo + TorchInductor) to speed up inference. Write an example of wrapping a model with torch.compile and running inference on some input. (For instance: optimized_model = torch.compile(model) then use optimized_model(x) to execute the compiled graph for faster execution.)

(Medium): Profile a model's inference to find bottlenecks. Use torch.profiler (or torch.autograd.profiler) in a context manager to record the time taken by each operation during a forward pass. Provide code that runs the model under torch.profiler.profile(...){ ... } and then prints out a report of the most time-consuming ops or layers.

(Medium): Use mixed precision during inference. Show how to wrap model inference code in a torch.cuda.amp.autocast() context to run it in float16 where possible. For example, demonstrate loading a model and input, then calling with torch.cuda.amp.autocast(): output = model(input_half) to leverage tensor cores, and mention the speed/memory benefits of FP16.

(Medium): Apply dynamic quantization for faster CPU inference. Provide code to quantize a trained model (for example, using torch.quantization.quantize_dynamic on a model with linear layers or LSTMs) so that it uses int8/FP16 weights. Then show how to run inference with the quantized model and mention the potential speedup on CPU (with minimal accuracy drop).

(Medium): Prune insignificant weights in a model. For instance, take a fully connected layer's weight matrix and zero out all entries with magnitude below a threshold. Show code that identifies these small weights (e.g., mask = (weight.abs() < threshold)) and sets them to zero in-place. Comment on how this sparsity might affect model size or speed (noting that unstructured sparsity may need specialized kernels to see speedup).

(Medium): Use the HuggingFace Transformers library to run a model inference manually. For example, load a pretrained GPT-2 model and tokenizer, encode a prompt into input IDs, feed the input IDs to the model (e.g., outputs = model(input_ids)), get the logits, and then decode the model's output IDs back to text. (This tests using HF models without the high-level .generate() convenience function.)

(Medium): Use pinned memory to accelerate data transfer. Demonstrate creating an input tensor with pin_memory=True (or using a DataLoader with pin_memory=True), then transferring it to the GPU with tensor.to('cuda', non_blocking=True). In code comments, explain how pinned (page-locked) host memory can improve throughput for CPU-to-GPU data copy operations by allowing DMA transfers.

Advanced Questions (High Difficulty)

(Hard): Implement multi-head self-attention from scratch. Given an input tensor of shape (batch, seq_len, model_dim) and weight matrices for W_q, W_k, W_v (to project inputs to each head) and W_o (to project concatenated heads to output), write code to compute multi-head attention. Split the input into multiple heads, compute scaled dot-product attention for each head (without using nn.MultiheadAttention), then concatenate the head outputs and apply the output projection. (Ensure tensor shapes line up for the matrix multiplies.)

(Hard): Implement a full Transformer decoder block in PyTorch. The block should include self-attention followed by a feed-forward network, with residual connections and layer normalization around each sub-layer. Write the forward method assuming you have functions or modules for the attention and feed-forward parts. Show how you take the input x, compute attn_out = SelfAttention(x), then x = x + attn_out followed by LayerNorm, then pass that through the feed-forward network, add the residual and normalize again.

(Hard): Implement beam search decoding for a language model. Write a function that given a model and an input prompt, performs beam search with a specified beam width B. It should keep track of multiple hypotheses (sequences and their cumulative log-probabilities), expand each hypothesis with new tokens at each step (using the model's output probabilities), and prune down to the top B candidates. Continue until a stopping condition (e.g., all beams have produced an end-of-sequence token or a max length is reached), then return the highest scoring completed sequence. Include code for managing the beam candidate lists at each step.

(Hard): Implement speculative decoding to accelerate generation. Suppose you have a large language model and a smaller "draft" model. Write a procedure where the draft model predicts the next k tokens in one go, and the larger model then verifies these tokens sequentially. If the large model's output matches the draft for a token, you accept it and move on to verifying the next token; as soon as it diverges, you discard the draft's remaining suggestions and resume generation from that point with the large model (perhaps resampling a new draft continuation). Provide a code outline showing how you'd interleave calls to the draft model and the main model, managing two sets of tokens (the proposed tokens and the confirmed tokens).

(Hard): Outline a continuous batching strategy for an LLM inference server (similar to what vLLM does). Write pseudo-code for a loop that continuously collects incoming requests and groups them into batches on the fly. For example, maintain a queue of incoming requests; at each iteration, take as many as available (up to some max batch size) to form a batch and run the model. If new requests arrive while a batch is running, they wait and then get batched in the next iteration. Ensure your outline handles a timeout or maximum delay so that no request waits indefinitely. (This tests understanding of dynamic batching in a live setting.)

(Hard): Use CUDA streams to overlap computation and data transfer. Provide an example where you create at least two torch.cuda.Stream objects: one stream that preloads or preprocesses data on the GPU while another stream runs the model inference on already-loaded data. Show how to use with torch.cuda.stream(stream): ... to assign operations to a stream, and ensure that you launch GPU-to-GPU copies or CPU-to-GPU transfers with non_blocking=True. The goal is to overlap the data copy of batch n+1 with the compute of batch n.

(Hard): Write a custom PyTorch autograd Function for a new operation. For example, implement a custom ReLU. Define a subclass of torch.autograd.Function with a forward(ctx, input) that returns input.clamp(min=0), and a backward(ctx, grad_output) that returns grad_output * (input > 0).float(). Provide the code for this class and then show how to use it in a model (e.g., CustomReLU.apply(tensor)) to verify it computes the same result as the built-in ReLU. (This tests low-level autograd understanding.)

(Hard): Write a GPU kernel using OpenAI Triton to perform an elementwise operation (for instance, add 1 to each element of an input tensor), and show how to launch it from PyTorch. Include the Triton kernel definition with @triton.jit, using tl.load to read from memory and tl.store to write results. Also demonstrate how to configure the launch grid/block size (e.g., using a grid lambda or specifying BLOCK_SIZE) and then call the kernel like kernel[grid](..., BLOCK_SIZE=...) on a sample tensor.

(Hard): Convert a PyTorch model to a TensorRT engine for deployment. Outline the steps in code: for example, export the model to ONNX format using torch.onnx.export, then use NVIDIA’s TensorRT Python API (or Torch-TensorRT) to load that ONNX model and build an optimized TensorRT engine. Finally, show how you would run inference using that TensorRT engine (e.g., by binding input/output and executing the context). Pseudo-code for the TensorRT part is fine — focus on the sequence of steps and any important parameters (like enabling FP16 or setting max workspace size).

(Hard): Parallelize preprocessing using CPU threads or processes to feed the model faster. For instance, if tokenization on CPU is a bottleneck, demonstrate how you could use Python’s concurrent.futures.ThreadPoolExecutor (for I/O-bound tasks) or multiprocessing.Pool (for CPU-bound tasks) to tokenize multiple inputs in parallel before batching them. Show a code snippet that takes a list of text strings, splits the work across threads or processes to produce token ID tensors for each, and then stacks them into a batch for the model.

(Hard): Integrate LoRA (Low-Rank Adaptation) into a model’s layer. Take a pre-trained weight matrix $W$ (for example, the weight of a transformer’s dense layer) and incorporate LoRA matrices into it. Show how you’d create two small trainable matrices $A$ and $B$ (with shapes like [out_dim, r] and [r, in_dim] for some small rank $r$) and modify the layer’s forward pass to use $W + \alpha \cdot A B$ (with $W$ kept frozen and only $A, B$ learned). Provide code snippets for modifying the model’s __init__ to add the new parameters and the forward to add the LoRA contribution to $W x$.

(Hard): Demonstrate a case where torch.jit.trace is insufficient and torch.jit.script is needed. For example, write a small PyTorch function that has an if/else branch or a loop that depends on the input data (not just on tensor sizes). Show that tracing this function with a sample input will capture only one branch (thus not generalizing to the other case). Then show how using torch.jit.script instead can handle the dynamic control flow. Provide the code for both the traced and scripted versions, highlighting why the traced one is incorrect.

(Hard): Optimize a generation loop by avoiding CPU synchronization. For example, explain that calling .item() on a CUDA tensor forces a GPU-to-CPU sync. Show code where instead of doing next_token_id = logits.argmax().item() each iteration (which brings the value to CPU), you keep the computation on the GPU by using tensor operations. For instance, you can obtain the index of the max logit as a 0-dim tensor and use it directly to index into embedding for the next step. By not calling .item() in the loop, you allow asynchronous GPU execution to proceed without stall, greatly improving throughput in autoregressive generation.

(Hard): Perform model-parallel inference across multiple GPUs. For a very large model that cannot fit into one GPU’s memory, illustrate how to split the model’s layers between two GPUs. Provide a code sketch: for example, move model.encoder to cuda:0 and model.decoder to cuda:1. In the forward pass, send the input to the encoder on GPU0, get its output, then transfer that output tensor to GPU1 to feed into the decoder. Show how you would coordinate the device placements and .to() operations so that each part of the model runs on the intended device.

(Hard): Use HuggingFace Optimum to accelerate inference. For example, show how to convert a Transformers model to an ONNX runtime for faster CPU/GPU inference using Optimum. You might demonstrate code that uses Optimum to export a model to ONNX (ORTModel.from_pretrained(...)) or wrap a model with BetterTransformer. Include the steps to load the optimized model and perform a sample inference. (This tests familiarity with external optimization tools for PyTorch models.)

(Hard): Perform post-optimization validation of a model. After applying an optimization (quantization, compilation, etc.), you need to ensure the model’s outputs are still correct. Write code to run the same input through both the original and the optimized model and compare the outputs. For example, compute the mean absolute difference between the two outputs, or if it's an classification/generation model, check that they produce the same top-1 prediction or sequence. This verification code helps confirm that the optimization didn’t degrade the model’s accuracy beyond an acceptable range.

(Hard): Use HuggingFace Text Generation Inference (TGI) server for serving an LLM. Assume a model is already loaded on a TGI server endpoint — write a small Python client snippet that sends a generation request to the server. For example, use the requests library to POST a JSON payload with a prompt and decoding parameters to the TGI HTTP endpoint, then parse the JSON response to get the generated text. (This tests understanding of integrating with an inference server via its API.)

(Hard): Configure NVIDIA Triton Inference Server for dynamic batching. Describe (or provide) what you would put in the model’s config.pbtxt to enable this. For example, specify a max_batch_size that the model can handle (e.g., 8 or 16) and add a dynamic_batching block with parameters like max_queue_delay_microseconds (to set the max wait time for forming a batch). Provide a short snippet of a config showing these settings, and explain that this will allow Triton to automatically batch incoming requests up to the max batch size or timeout.

(Hard): Implement sinusoidal positional encoding as used in the original Transformer. Write a function that takes a sequence length L and model dimension d_model and returns a tensor of shape (L, d_model) where each position $i$ (0-indexed) has a sinusoidal encoding. Use the formula: for each dimension $j$:

\text{PE}[i, 2j] = \sin \left( \frac{i}{10000^{\frac{2j}{d_{\text{model}}}}} \right) , \text{PE}[i, 2j+1] = \cos \left( \frac{i}{10000^{\frac{2j}{d_{\text{model}}}}} \right)

Implement this calculation in PyTorch (avoiding explicit Python loops if possible).

(Hard): Refactor a computational Python loop into a vectorized PyTorch operation for speed. For example, suppose you have code that iterates over each element in a tensor to apply a function (which is very slow in Python). Show an example of such a loop and then show how to replace it with equivalent PyTorch tensor operations (which leverage parallelism). Explain in comments how removing Python-loop overhead and using vectorized operations speeds up inference, especially for large tensor computations.

(Hard): Diagnose and fix a memory leak in an inference loop. For instance, consider a scenario where you append each model output tensor to a Python list for logging or further processing. Write code that simulates this (e.g., a loop doing outputs.append(model(x)) each time) and explain why this causes GPU memory to balloon (hint: the computation graph is being retained for each output tensor). Then show how to fix it by disabling grad tracking (with torch.no_grad(): around inference) or detaching/cloning tensors before storing them (so that no reference to the computation graph remains), and by deleting or reusing tensors appropriately.

(Hard): Work around a part of the model that can’t be compiled by TorchDynamo. If torch.compile is failing or falling back for a certain section of your model (for example, a part with unsupported operations or dynamic Python logic), demonstrate how you can use torch._dynamo.disable as a context manager around that code block to exclude it from compilation. Provide a code example where you wrap a problematic function or section in with torch._dynamo.disable(): so that the rest of the model runs under TorchDynamo optimization, but that section will run normally (ensuring the whole model can execute without errors).

(Hard): Use multiple GPUs to increase throughput via data parallelism. Show how you could replicate a model on two GPUs and split a batch between them for inference. For example, you might use torch.nn.DataParallel to automatically split input across GPUs, or do it manually: send half of the input batch to a model on cuda:0 and the other half to a clone of the model on cuda:1, then concatenate the outputs. Provide code that demonstrates this parallel inference across two GPUs and mention how it can almost halve the per-batch latency (ignoring some overhead).

(Medium): Implement random sampling for text generation with a temperature parameter. At each decoding step, apply a softmax to the model’s logits (you can divide the logits by a temperature τ > 0 to control randomness), then use torch.multinomial to sample one token from the resulting probability distribution. Write a code snippet for one step of this process, given logits for the current step, and show how changing the temperature affects the sampling outcome (more random vs. more greedy).

(Medium): Apply weight tying in a language model to reduce parameters. For example, if a model uses an nn.Embedding for input tokens and an nn.Linear as the output projection for predicting next-token logits, set the linear layer’s weight equal to the embedding matrix so they share weights. Show in code how you would do this (e.g., model.output_layer.weight = model.input_embed.weight) and explain that this way the input and output embeddings remain the same, saving memory and typically improving model consistency.

(Medium): Tune PyTorch’s thread settings for CPU inference. For a model running on CPU (multi-core), show how to configure the number of threads used by PyTorch. For example, use torch.set_num_threads(n) and torch.set_num_interop_threads(m) to limit or set the parallelism. Provide code setting these values (e.g., to 4 threads), and explain in comments when you might adjust these — such as to prevent thread oversubscription in a multi-model deployment or to optimize throughput on CPU-bound inference workloads.

each question above is designed to be hands-on and coding-focused, reflecting scenarios a Machine Learning Engineer (Inference) might encounter when optimizing LLM inference in production. Difficulty is labeled from easy fundamentals to challenging performance engineering tasks. The questions emphasize practical reasoning, awareness of inference speed and memory bottlenecks, and the ability to write clean, optimized PyTorch code under real-world constraints.