LLM-Specific Questions
Below are 25 basic (yet thorough) coding-focused questions that test fundamental PyTorch skills relevant to building and running LLMs. They range from creating and manipulating tensors, to implementing small transformer components, to applying sampling methods. Each question should prompt you to write working code (in a live environment or whiteboard style), ensuring you can demonstrate good coding practices in PyTorch for LLM use cases.
- Tensor Creation & Manipulation
- Question: Write a small function to:
- Create a 2D PyTorch tensor (e.g. shape
[3,4]) of random floats.
- Print its shape and data type.
- Move it to a GPU if available.
- Reshape it to
[2,6].
- Create a 2D PyTorch tensor (e.g. shape
- Tests ability to create, reshape, and manage devices for tensors.
- Question: Write a small function to:
- Embedding Lookup
- Question: Suppose you have a vocabulary size of 10,000 and an embedding dimension of 768. Create an
nn.Embeddingfor this vocabulary. Then:- Generate a batch of token indices (e.g., shape
[batch_size=4, seq_len=5]).
- Pass these indices through the embedding to get the corresponding embeddings.
- Generate a batch of token indices (e.g., shape
- Tests creation and usage of embedding layers, along with batch dimension handling.
- Question: Suppose you have a vocabulary size of 10,000 and an embedding dimension of 768. Create an
- Forward Pass Through a Simple Network
- Question: Define a small
nn.Modulethat includes:- An
nn.Embeddinglayer.
- A single
nn.Linearlayer mapping from embedding dimension to a “hidden” dimension of your choice.
- A forward method that takes token indices, embeds them, and produces a final output tensor.
- An
- Tests understanding of custom modules, forward methods, and dimension handling.
- Question: Define a small
- Positional Encoding
- Question: Write a function that takes input embeddings
(batch, seq_len, embedding_dim)and adds sinusoidal positional encodings of shape(seq_len, embedding_dim)to them. Show how you would:- Generate the sinusoidal encodings (using
sinandcos).
- Broadcast-add them to the batch of embeddings.
- Generate the sinusoidal encodings (using
- Tests how to handle shape broadcasting and incorporate positional information for LLMs.
- Question: Write a function that takes input embeddings
- Basic Autoregressive Decoding Loop (Greedy)
- Question: Assume you have a function
model.forward(input_ids)that returns logits over your vocabulary. Write a greedy decoding loop that:- Starts with a prompt (list of token IDs).
- Iteratively obtains the next token by taking
argmaxof the logits at each step.
- Continues until you reach a special
<EOS>token or a maximum length.
- Tests ability to implement the simplest decode strategy by coding a loop with a model call each iteration.
- Question: Assume you have a function
- Top-k Sampling Decoding
- Question: Modify the above decoding loop to implement top-k sampling instead of greedy:
- Use
torch.topkon the logits to keep only the top-k tokens.
- Sample from the resulting distribution using
torch.multinomial.
- Use
- Tests handling probability distributions and random sampling in PyTorch for more varied text generation.
- Question: Modify the above decoding loop to implement top-k sampling instead of greedy:
- Nucleus (Top-p) Sampling
- Question: Write a function to implement top-p (nucleus) sampling in one step of decoding:
- Sort token probabilities by descending order.
- Select tokens until their cumulative probability ≥ p.
- Sample from the truncated distribution.
- Tests dynamic selection of a token set based on cumulative probability.
- Question: Write a function to implement top-p (nucleus) sampling in one step of decoding:
- Mini “Attention” Mechanism
- Question: Implement a simple scaled dot-product attention from scratch. Given
Q, K, Vof shape(batch, seq_len, dim), compute:- Tests basic matrix multiplications, shape alignment, and understanding of attention in LLMs.
- Question: Implement a simple scaled dot-product attention from scratch. Given
- Layer Normalization
- Question: Write your own PyTorch module that implements
LayerNormmanually (i.e., do not usenn.LayerNorm). Show how you’d:- Compute mean and variance across the last dimension.
- Subtract mean, divide by std, multiply by a learnable
gamma, and add a learnablebeta.
- Tests knowledge of normalization steps and custom parameter usage.
- Question: Write your own PyTorch module that implements
- Masking in Attention
- Question: Extend your scaled dot-product attention code to support an attention mask (e.g., a boolean mask of shape
(batch, seq_len, seq_len)). Any position where the mask isFalseshould be assigned a very negative value (like1e9) before the softmax.
- Tests the typical approach for ignoring future tokens or padded tokens in attention calculation.
- KV Caching
- Question: Show how to implement a basic “past key-value” cache for an autoregressive model. Suppose your model returns
(logits, new_k, new_v), and you want to store(k, v)from all previous timesteps to avoid recomputing them. Demonstrate how you’d:- Initialize empty lists or tensors for the cache.
- Append
new_k, new_vat each time step.
- Pass the entire cached
(k, v)to the attention mechanism.
- Tests your understanding of how LLMs speed up inference by caching previous computations.
- Dynamic Padding / Batching
- Question: Suppose you have a list of sequences (lists of token IDs) of varying lengths. Write a function
collate_fn(batch)that:- Finds the longest sequence in the batch.
- Pads all sequences to that length.
- Stacks them into a single PyTorch tensor
(batch_size, max_len).
- Tests handling variable-length inputs in an LLM setting.
- Mixed Precision Inference
- Question: Write a code snippet using
torch.cuda.amp.autocast()that performs a forward pass in half precision:- Create a small model (or load an existing one).
- Run the forward pass in the
ampcontext manager.
- Print the dtype of the output to confirm it’s half precision.
- Tests knowledge of half precision usage for faster GPU-based LLM inference.
- Basic TorchScript Scripting
- Question: Given a small PyTorch module for text classification, show how to script it using
torch.jit.script(model). Then do a forward pass with a sample input on the scripted module.
- Tests ability to compile a model with TorchScript for potential inference optimizations.
- Beam Search
- Question: Implement beam search for an LLM. At each step:
- Keep track of the top beam_size partial sequences.
- Expand each partial sequence by possible next tokens.
- Prune down to the top beam_size expansions based on cumulative log probability.
- Tests more advanced decoding strategy relevant to many LLM tasks.
- Temperature Scaling
- Question: Write a decode function that, at each step, applies a temperature factor τ to the logits:
logits = logits / tau, and then does a softmax sampling. Demonstrate that settingtau < 1.0makes generation more deterministic (less random), and settingtau > 1.0yields more varied text.
- Tests ability to handle sampling diversity through temperature scaling.
- Check Model Parameter Counts
- Question: Write code to:
- Count the total parameters in a model (sum of
p.numel()for all parameters).
- Separate them into trainable vs. non-trainable parameters.
- Print the result in a user-friendly format (e.g., “Total parameters: X, Trainable: Y, Frozen: Z”).
- Count the total parameters in a model (sum of
- Tests basic parameter inspection and that you can confirm correct freezing or updating of certain layers in an LLM scenario.
- Implement a Simple GPT-like Block
- Question: Build an
nn.ModulenamedGPTBlockthat contains:- A self-attention sublayer (multi-head or single-head).
- A feed-forward sublayer.
- Residual connections & layer norms.
- A
forwardmethod that expects(x, mask=None).
- Tests modular design of LLM building blocks, reflecting GPT-like architecture basics.
- Training Loop vs. Inference Loop
- Question: Write code that sets up a training loop for a toy language model (e.g., next-token prediction on random data). Then create a separate function or block for inference that:
- Disables gradient tracking (
no_grad).
- Uses the model in eval mode.
- Performs next-token prediction.
- Disables gradient tracking (
- Tests clarity about switching between training and inference in real code, which is crucial for LLM usage.
- Use Hugging Face Transformers for a Basic LLM
- Question: Demonstrate loading a small GPT-2 model (
'gpt2') viaAutoModelForCausalLM.from_pretrained('gpt2'). Then manually:- Tokenize an input prompt with
AutoTokenizer.
- Convert the tokenizer outputs to tensors.
- Run the model in a loop to generate tokens (greedy or top-k).
- Tokenize an input prompt with
- Tests direct coding with HF Transformers library for an LLM scenario, ensuring familiarity with model & tokenizer usage.
- Manual Gradient-Freezing
- Question: Suppose you only want to fine-tune the last two layers of a model. Write code to:
- Freeze all model parameters by default.
- Unfreeze only the final two layers.
- Confirm that only those last two layers’ parameters have
requires_grad=True.
- Tests knowledge of partial fine-tuning, which is common in LLM training/inference setups (LoRA, etc.).
- Storing Intermediate Outputs
- Question: In a forward method, suppose you want to store the attention maps (the
softmaxresults) for analysis. Write code that:- Saves each head’s attention map to a Python dictionary or list.
- Ensures you aren’t inadvertently storing a full computation graph. (Hint: use
.detach()orwith torch.no_grad()appropriately.)
- Tests ability to debug or visualize attention while ensuring no memory leaks or graph references remain.
- Continuous Batching Concept
- Question: Sketch out code for a simple server loop that collects incoming requests (prompts), forms a batch from whichever requests arrived in the last X milliseconds or up to a max batch size, runs them all at once, and then returns the results. You don’t have to implement a full server—just show the pseudo-code for forming the batch and calling
model(batch).
- Tests dynamic batching approach used in real LLM-serving frameworks (vLLM / TGI style).
- Profiling Inference
- Question: Using
torch.profiler, show how to:- Wrap a model’s forward pass in
torch.profiler.profile(...).
- Print or view the events, focusing on CPU vs. GPU time per operation.
- Identify which layer is taking the most time.
- Wrap a model’s forward pass in
- Tests performance analysis, essential in real LLM inference to find bottlenecks.
- Memory Footprint & GPU Cleanup
- Question: Suppose you have run some large batches and GPU memory is near capacity. Show code that:
- Deletes any intermediate tensors you no longer need.
- Calls
torch.cuda.empty_cache()if necessary (with disclaimers about what it does).
- Verifies GPU memory usage with something like
torch.cuda.memory_allocated()ornvidia-smichecks in Python.
- Tests awareness of GPU memory management, which is critical in LLM inference serving for large contexts.
How to Use These Questions
Each prompt is meant to be solved live—ideally with actual PyTorch code that you can execute or walk through. These exercises collectively cover the foundational coding concepts for building, fine-tuning, and serving LLMs in PyTorch. By practicing them, you’ll gain familiarity with everything from basic tensor operations and embedding layers to advanced tasks like caching key-value pairs for faster generation and implementing custom attention.