LLM-Specific Questions

Below are 25 basic (yet thorough) coding-focused questions that test fundamental PyTorch skills relevant to building and running LLMs. They range from creating and manipulating tensors, to implementing small transformer components, to applying sampling methods. Each question should prompt you to write working code (in a live environment or whiteboard style), ensuring you can demonstrate good coding practices in PyTorch for LLM use cases.

Tensor Creation & Manipulation
- Question: Write a small function to:
  1. Create a 2D PyTorch tensor (e.g. shape [3,4]) of random floats.
  1. Print its shape and data type.
  1. Move it to a GPU if available.
  1. Reshape it to [2,6].
- Tests ability to create, reshape, and manage devices for tensors.

Embedding Lookup
- Question: Suppose you have a vocabulary size of 10,000 and an embedding dimension of 768. Create an nn.Embedding for this vocabulary. Then:
  1. Generate a batch of token indices (e.g., shape [batch_size=4, seq_len=5]).
  1. Pass these indices through the embedding to get the corresponding embeddings.
- Tests creation and usage of embedding layers, along with batch dimension handling.

Forward Pass Through a Simple Network
- Question: Define a small nn.Module that includes:
  1. An nn.Embedding layer.
  1. A single nn.Linear layer mapping from embedding dimension to a “hidden” dimension of your choice.
  1. A forward method that takes token indices, embeds them, and produces a final output tensor.
- Tests understanding of custom modules, forward methods, and dimension handling.

Positional Encoding
- Question: Write a function that takes input embeddings (batch, seq_len, embedding_dim) and adds sinusoidal positional encodings of shape (seq_len, embedding_dim) to them. Show how you would:
  1. Generate the sinusoidal encodings (using sin and cos).
  1. Broadcast-add them to the batch of embeddings.
- Tests how to handle shape broadcasting and incorporate positional information for LLMs.

Basic Autoregressive Decoding Loop (Greedy)
- Question: Assume you have a function model.forward(input_ids) that returns logits over your vocabulary. Write a greedy decoding loop that:
  1. Starts with a prompt (list of token IDs).
  1. Iteratively obtains the next token by taking argmax of the logits at each step.
  1. Continues until you reach a special <EOS> token or a maximum length.
- Tests ability to implement the simplest decode strategy by coding a loop with a model call each iteration.

Top-k Sampling Decoding
- Question: Modify the above decoding loop to implement top-k sampling instead of greedy:
  1. Use torch.topk on the logits to keep only the top-k tokens.
  1. Sample from the resulting distribution using torch.multinomial.
- Tests handling probability distributions and random sampling in PyTorch for more varied text generation.

Nucleus (Top-p) Sampling
- Question: Write a function to implement top-p (nucleus) sampling in one step of decoding:
  1. Sort token probabilities by descending order.
  1. Select tokens until their cumulative probability ≥ p.
  1. Sample from the truncated distribution.
- Tests dynamic selection of a token set based on cumulative probability.

Mini “Attention” Mechanism
- Question: Implement a simple scaled dot-product attention from scratch. Given Q, K, V of shape (batch, seq_len, dim), compute:
  $Attention(Q,K,V)=softmax(QK⊤d)V\text{Attention}(Q, K, V) = \mathrm{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big) V$
  - Tests basic matrix multiplications, shape alignment, and understanding of attention in LLMs.

Layer Normalization
- Question: Write your own PyTorch module that implements LayerNorm manually (i.e., do not use nn.LayerNorm). Show how you’d:
  1. Compute mean and variance across the last dimension.
  1. Subtract mean, divide by std, multiply by a learnable gamma, and add a learnable beta.
- Tests knowledge of normalization steps and custom parameter usage.

Masking in Attention

Question: Extend your scaled dot-product attention code to support an attention mask (e.g., a boolean mask of shape (batch, seq_len, seq_len)). Any position where the mask is False should be assigned a very negative value (like 1e9) before the softmax.

Tests the typical approach for ignoring future tokens or padded tokens in attention calculation.

KV Caching

Question: Show how to implement a basic “past key-value” cache for an autoregressive model. Suppose your model returns (logits, new_k, new_v), and you want to store (k, v) from all previous timesteps to avoid recomputing them. Demonstrate how you’d:
1. Initialize empty lists or tensors for the cache.
1. Append new_k, new_v at each time step.
1. Pass the entire cached (k, v) to the attention mechanism.

Tests your understanding of how LLMs speed up inference by caching previous computations.

Dynamic Padding / Batching

Question: Suppose you have a list of sequences (lists of token IDs) of varying lengths. Write a function collate_fn(batch) that:
1. Finds the longest sequence in the batch.
1. Pads all sequences to that length.
1. Stacks them into a single PyTorch tensor (batch_size, max_len).

Tests handling variable-length inputs in an LLM setting.

Mixed Precision Inference

Question: Write a code snippet using torch.cuda.amp.autocast() that performs a forward pass in half precision:
1. Create a small model (or load an existing one).
1. Run the forward pass in the amp context manager.
1. Print the dtype of the output to confirm it’s half precision.

Tests knowledge of half precision usage for faster GPU-based LLM inference.

Basic TorchScript Scripting

Question: Given a small PyTorch module for text classification, show how to script it using torch.jit.script(model). Then do a forward pass with a sample input on the scripted module.

Tests ability to compile a model with TorchScript for potential inference optimizations.

Beam Search

Question: Implement beam search for an LLM. At each step:
1. Keep track of the top beam_size partial sequences.
1. Expand each partial sequence by possible next tokens.
1. Prune down to the top beam_size expansions based on cumulative log probability.

Tests more advanced decoding strategy relevant to many LLM tasks.

Temperature Scaling

Question: Write a decode function that, at each step, applies a temperature factor τ to the logits: logits = logits / tau, and then does a softmax sampling. Demonstrate that setting tau < 1.0 makes generation more deterministic (less random), and setting tau > 1.0 yields more varied text.

Tests ability to handle sampling diversity through temperature scaling.

Check Model Parameter Counts

Question: Write code to:
1. Count the total parameters in a model (sum of p.numel() for all parameters).
1. Separate them into trainable vs. non-trainable parameters.
1. Print the result in a user-friendly format (e.g., “Total parameters: X, Trainable: Y, Frozen: Z”).

Tests basic parameter inspection and that you can confirm correct freezing or updating of certain layers in an LLM scenario.

Implement a Simple GPT-like Block

Question: Build an nn.Module named GPTBlock that contains:
1. A self-attention sublayer (multi-head or single-head).
1. A feed-forward sublayer.
1. Residual connections & layer norms.
1. A forward method that expects (x, mask=None).

Tests modular design of LLM building blocks, reflecting GPT-like architecture basics.

Training Loop vs. Inference Loop

Question: Write code that sets up a training loop for a toy language model (e.g., next-token prediction on random data). Then create a separate function or block for inference that:
1. Disables gradient tracking (no_grad).
1. Uses the model in eval mode.
1. Performs next-token prediction.

Tests clarity about switching between training and inference in real code, which is crucial for LLM usage.

Use Hugging Face Transformers for a Basic LLM

Question: Demonstrate loading a small GPT-2 model ('gpt2') via AutoModelForCausalLM.from_pretrained('gpt2'). Then manually:
1. Tokenize an input prompt with AutoTokenizer.
1. Convert the tokenizer outputs to tensors.
1. Run the model in a loop to generate tokens (greedy or top-k).

Tests direct coding with HF Transformers library for an LLM scenario, ensuring familiarity with model & tokenizer usage.

Manual Gradient-Freezing

Question: Suppose you only want to fine-tune the last two layers of a model. Write code to:
1. Freeze all model parameters by default.
1. Unfreeze only the final two layers.
1. Confirm that only those last two layers’ parameters have requires_grad=True.

Tests knowledge of partial fine-tuning, which is common in LLM training/inference setups (LoRA, etc.).

Storing Intermediate Outputs

Question: In a forward method, suppose you want to store the attention maps (the softmax results) for analysis. Write code that:
1. Saves each head’s attention map to a Python dictionary or list.
1. Ensures you aren’t inadvertently storing a full computation graph. (Hint: use .detach() or with torch.no_grad() appropriately.)

Tests ability to debug or visualize attention while ensuring no memory leaks or graph references remain.

Continuous Batching Concept

Question: Sketch out code for a simple server loop that collects incoming requests (prompts), forms a batch from whichever requests arrived in the last X milliseconds or up to a max batch size, runs them all at once, and then returns the results. You don’t have to implement a full server—just show the pseudo-code for forming the batch and calling model(batch).

Tests dynamic batching approach used in real LLM-serving frameworks (vLLM / TGI style).

Profiling Inference

Question: Using torch.profiler, show how to:
1. Wrap a model’s forward pass in torch.profiler.profile(...).
1. Print or view the events, focusing on CPU vs. GPU time per operation.
1. Identify which layer is taking the most time.

Tests performance analysis, essential in real LLM inference to find bottlenecks.

Memory Footprint & GPU Cleanup

Question: Suppose you have run some large batches and GPU memory is near capacity. Show code that:
1. Deletes any intermediate tensors you no longer need.
1. Calls torch.cuda.empty_cache() if necessary (with disclaimers about what it does).
1. Verifies GPU memory usage with something like torch.cuda.memory_allocated() or nvidia-smi checks in Python.

Tests awareness of GPU memory management, which is critical in LLM inference serving for large contexts.

How to Use These Questions

Each prompt is meant to be solved live—ideally with actual PyTorch code that you can execute or walk through. These exercises collectively cover the foundational coding concepts for building, fine-tuning, and serving LLMs in PyTorch. By practicing them, you’ll gain familiarity with everything from basic tensor operations and embedding layers to advanced tasks like caching key-value pairs for faster generation and implementing custom attention.