Lab 004: How LLMs WorkΒΆ
What You'll LearnΒΆ
- What a Large Language Model (LLM) really is under the hood
- How training works: pre-training, fine-tuning, RLHF
- What tokens, context windows, and temperature mean in practice
- Why LLMs hallucinate β and how to mitigate it
- The difference between models: GPT-4o, Phi-4, Llama, Claude
IntroductionΒΆ
You've probably used ChatGPT or GitHub Copilot. But what's actually happening when you type a message and get a response? Understanding the mechanics of LLMs makes you a dramatically better agent builder β you'll know why certain prompts work, why agents make mistakes, and how to design around their limitations.
Part 1: What is a Large Language Model?ΒΆ
An LLM is a neural network trained to predict the next token given a sequence of tokens.
That's it. Everything else β reasoning, code generation, summarization, chat β is an emergent capability that arises from doing this at massive scale on enormous amounts of text.
TokensΒΆ
A token is the basic unit an LLM processes. It's roughly ΒΎ of a word (about 4 characters).
Why tokens matter for agents
- Context windows are measured in tokens, not words
- API costs are billed per token
- Long documents must be chunked to fit in the context window
The prediction loopΒΆ
When you send a message, the LLM:
- Converts your text to a sequence of token IDs
- Passes them through billions of mathematical operations (transformer layers)
- Outputs a probability distribution over the entire vocabulary (~100,000 tokens)
- Samples the next token based on that distribution
- Appends it to the sequence and repeats from step 2
The LLM doesn't "know" facts β it has learned statistical patterns from text. When it says "Paris," it's because "Paris" almost always follows that phrase in its training data.
π€ Check Your Understanding
An LLM correctly answers "The capital of France is Paris." Does the model know this fact the way a human does?
Answer
No. The LLM has learned statistical patterns from its training data β "Paris" almost always follows "The capital of France is" in the text it was trained on. It predicts the most likely next token, not verified facts. This is why LLMs can also confidently produce wrong answers (hallucinations).
Part 2: Training an LLMΒΆ
Stage 1 β Pre-trainingΒΆ
The model reads trillions of tokens from the internet, books, code, and scientific papers. It learns language structure, facts, reasoning patterns, and common knowledge purely by predicting the next token.
Training data: Wikipedia + books + GitHub + web pages + ...
Goal: minimize prediction error across all that text
Result: a "base model" that can complete text
GPT-4o, Llama 3, Phi-4 all start as base models.
Stage 2 β Instruction Fine-tuning (SFT)ΒΆ
The base model is trained on examples of conversations β (prompt, ideal response) pairs. This teaches it to be helpful, follow instructions, and respond in a conversational way.
Stage 3 β RLHF (Reinforcement Learning from Human Feedback)ΒΆ
Human raters compare pairs of responses and pick the better one. A reward model is trained on these preferences. The LLM is then fine-tuned to maximize the reward model's score.
This is why ChatGPT feels more polished and aligned than a raw base model.
π€ Check Your Understanding
What is the purpose of RLHF (Reinforcement Learning from Human Feedback) in LLM training, and why can't pre-training alone achieve the same result?
Answer
RLHF aligns the model with human preferences β making responses more helpful, safe, and conversational. Pre-training only teaches the model to predict the next token from text patterns. Without RLHF, the model might produce technically correct but unhelpful, unsafe, or oddly formatted responses.
Part 3: Key ParametersΒΆ
Context WindowΒΆ
The context window is how much text the model can "see" at once β its working memory.
| Model | Context Window |
|---|---|
| GPT-4o | 128,000 tokens (~96,000 words) |
| GPT-4o-mini | 128,000 tokens |
| Phi-4 | 16,000 tokens |
| Llama 3.3 70B | 128,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
Context window β unlimited memory
The model reads the entire context window on every request. Longer context = slower + more expensive. Agents use RAG and summarization to manage long conversations.
TemperatureΒΆ
Temperature controls how random the output is.
# Deterministic (good for structured data extraction)
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.0,
messages=[...]
)
# Creative (good for ideas/drafts)
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.8,
messages=[...]
)
Top-p (nucleus sampling)ΒΆ
An alternative to temperature. Only sample from the smallest set of tokens whose cumulative probability exceeds top_p.
top_p=0.1β very conservativetop_p=0.9β allows diverse outputs
π€ Check Your Understanding
You're building an agent that generates SQL queries from natural language. Should you use a high or low temperature, and why?
Answer
Use low temperature (0.0). SQL queries need to be deterministic and syntactically correct. High temperature introduces randomness that could produce invalid or inconsistent SQL. For structured output tasks like code generation, data extraction, and SQL, always prefer temperature=0.
Part 4: Why LLMs HallucinateΒΆ
Hallucination (generating confident-sounding false information) happens because:
- The model predicts likely text, not true text. A plausible-sounding answer can score higher than "I don't know."
- Training data has gaps and noise. If the web says something wrong often enough, the model learned it.
- No external memory. The model doesn't "check" facts β it generates from patterns.
How agents mitigate hallucinationΒΆ
| Technique | How it helps |
|---|---|
| RAG | Give the model real documents to cite instead of relying on training data |
| Tool calling | Let the model call APIs/databases for real-time data |
| Low temperature | Reduce creativity when accuracy matters |
| System prompt rules | "Never invent data; only use tool outputs" |
| Structured output | Force the model to produce JSON schema β easier to validate |
| Evaluation | Measure groundedness, coherence, and factuality automatically |
Part 5: Choosing a ModelΒΆ
Not every task needs GPT-4o. Choosing the right model saves money and latency.
| Model | Best for | Speed | Cost |
|---|---|---|---|
| GPT-4o | Complex reasoning, long context, multimodal | Medium | $$$ |
| GPT-4o-mini | Most everyday tasks | Fast | $ |
| Phi-4 (Microsoft) | On-device, low cost, surprisingly capable | Very fast | Free (local) |
| Llama 3.3 70B | Open-source, self-host, large tasks | Medium | Free (self-host) |
| o1 / o3 | Math, code, deep multi-step reasoning | Slow | $$$$ |
Start cheap, upgrade when needed
Begin with gpt-4o-mini or Phi-4. Only upgrade to gpt-4o or o1 if the task clearly requires it.
Part 6: The Transformer Architecture (simplified)ΒΆ
You don't need to understand the math, but knowing the key insight helps:
Self-attention is the magic. For each token, the model computes how much "attention" to pay to every other token in the context.
This is why LLMs understand context so well β every word is interpreted in relation to every other word.
π€ Check Your Understanding
In the sentence "The bank by the river was steep," how does the self-attention mechanism help the model understand that "bank" means a riverbank and not a financial institution?
Answer
Self-attention computes how much "attention" each token should pay to every other token. When processing "bank," the model attends strongly to "river" β the contextual relationship between these words shifts the interpretation toward riverbank rather than financial institution. Every word is interpreted in relation to every other word in the context.
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): Approximately how many tokens is the sentence 'Hello world'?
- A) 1 token
- B) 2 tokens
- C) 6 tokens
- D) 10 tokens
β Reveal Answer
Correct: B β 2 tokens
"Hello" is 1 token and "world" is 1 token. As a rule of thumb, 1 token β 4 characters β ΒΎ of a word. A 1,000-word document is approximately 1,300 tokens. This matters for both cost (APIs charge per token) and context window limits (GPT-4o has a 128K token context window).
Q2 (Multiple Choice): You are calling an LLM for structured data extraction (e.g., parsing JSON from a customer email). Which temperature setting is most appropriate?
- A) temperature = 1.5 (high creativity)
- B) temperature = 0.8 (moderate creativity)
- C) temperature = 0.0 (deterministic)
- D) temperature = 2.0 (maximum randomness)
β Reveal Answer
Correct: C β temperature = 0.0
When accuracy and reproducibility matter more than creativity, use temperature=0. This makes the model always pick the most probable next token β so the same input always produces the same output. For creative writing: use 0.7β1.0. For data extraction, SQL generation, or tool argument formatting: use 0.
Q3 (Multiple Choice): An LLM confidently states that a fictional city in Brazil has a population of 2.3 million. This city doesn't exist. What is the primary cause?
- A) The model's context window was too small
- B) The temperature was set too high
- C) The model predicts likely-sounding text rather than verified facts β it pattern-matched to similar real cities
- D) The system prompt was missing
β Reveal Answer
Correct: C β LLMs predict likely text, not factual text
LLMs are trained to predict the next token that is statistically likely given the context. A made-up city that resembles real cities in pattern ("SΓ£o Paulo has 12M, Rio has 6M...") leads the model to generate a plausible-sounding but fabricated answer. This is hallucination. The fix is RAG or tool calling β force the model to look up facts rather than predict them.
SummaryΒΆ
| Concept | Key takeaway |
|---|---|
| Tokens | ~4 chars each; context windows and costs are measured in tokens |
| Prediction | LLMs predict the next token β reasoning is emergent, not programmed |
| Training | Pre-training β fine-tuning β RLHF produces helpful assistants |
| Temperature | 0 = deterministic; higher = more creative |
| Context window | The model's working memory; doesn't persist between requests |
| Hallucination | Caused by pattern-matching, not fact-checking β mitigated with tools + RAG |
Next StepsΒΆ
β Lab 005 β Prompt Engineering β Now that you know how LLMs work, learn to write prompts that reliably get the output you want.