Lab 004: How LLMs Work¶

Level: L50 Path: All paths Time: ~20 min 💰 Cost: Free — No account needed

What You'll Learn¶

What a Large Language Model (LLM) really is under the hood
How training works: pre-training, fine-tuning, RLHF
What tokens, context windows, and temperature mean in practice
Why LLMs hallucinate — and how to mitigate it
The difference between models: GPT-4o, Phi-4, Llama, Claude

Introduction¶

You've probably used ChatGPT or GitHub Copilot. But what's actually happening when you type a message and get a response? Understanding the mechanics of LLMs makes you a dramatically better agent builder — you'll know why certain prompts work, why agents make mistakes, and how to design around their limitations.

Part 1: What is a Large Language Model?¶

An LLM is a neural network trained to predict the next token given a sequence of tokens.

That's it. Everything else — reasoning, code generation, summarization, chat — is an emergent capability that arises from doing this at massive scale on enormous amounts of text.

Tokens¶

A token is the basic unit an LLM processes. It's roughly ¾ of a word (about 4 characters).

Why tokens matter for agents

Context windows are measured in tokens, not words
API costs are billed per token
Long documents must be chunked to fit in the context window

The prediction loop¶

When you send a message, the LLM:

Converts your text to a sequence of token IDs
Passes them through billions of mathematical operations (transformer layers)
Outputs a probability distribution over the entire vocabulary (~100,000 tokens)
Samples the next token based on that distribution
Appends it to the sequence and repeats from step 2

The LLM doesn't "know" facts — it has learned statistical patterns from text. When it says "Paris," it's because "Paris" almost always follows that phrase in its training data.

🤔 Check Your Understanding

An LLM correctly answers "The capital of France is Paris." Does the model know this fact the way a human does?

Answer

No. The LLM has learned statistical patterns from its training data — "Paris" almost always follows "The capital of France is" in the text it was trained on. It predicts the most likely next token, not verified facts. This is why LLMs can also confidently produce wrong answers (hallucinations).

Part 2: Training an LLM¶

Stage 1 — Pre-training¶

The model reads trillions of tokens from the internet, books, code, and scientific papers. It learns language structure, facts, reasoning patterns, and common knowledge purely by predicting the next token.

Training data: Wikipedia + books + GitHub + web pages + ...
Goal: minimize prediction error across all that text
Result: a "base model" that can complete text

GPT-4o, Llama 3, Phi-4 all start as base models.

Stage 2 — Instruction Fine-tuning (SFT)¶

The base model is trained on examples of conversations — (prompt, ideal response) pairs. This teaches it to be helpful, follow instructions, and respond in a conversational way.

Stage 3 — RLHF (Reinforcement Learning from Human Feedback)¶

Human raters compare pairs of responses and pick the better one. A reward model is trained on these preferences. The LLM is then fine-tuned to maximize the reward model's score.

This is why ChatGPT feels more polished and aligned than a raw base model.

🤔 Check Your Understanding

What is the purpose of RLHF (Reinforcement Learning from Human Feedback) in LLM training, and why can't pre-training alone achieve the same result?

Answer

RLHF aligns the model with human preferences — making responses more helpful, safe, and conversational. Pre-training only teaches the model to predict the next token from text patterns. Without RLHF, the model might produce technically correct but unhelpful, unsafe, or oddly formatted responses.

Part 3: Key Parameters¶

Context Window¶

The context window is how much text the model can "see" at once — its working memory.

Model	Context Window
GPT-4o	128,000 tokens (~96,000 words)
GPT-4o-mini	128,000 tokens
Phi-4	16,000 tokens
Llama 3.3 70B	128,000 tokens
Claude 3.5 Sonnet	200,000 tokens

Context window ≠ unlimited memory

The model reads the entire context window on every request. Longer context = slower + more expensive. Agents use RAG and summarization to manage long conversations.

Temperature¶

Temperature controls how random the output is.

# Deterministic (good for structured data extraction)
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.0,
    messages=[...]
)

# Creative (good for ideas/drafts)
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.8,
    messages=[...]
)

Top-p (nucleus sampling)¶

An alternative to temperature. Only sample from the smallest set of tokens whose cumulative probability exceeds top_p.

top_p=0.1 → very conservative
top_p=0.9 → allows diverse outputs

🤔 Check Your Understanding

You're building an agent that generates SQL queries from natural language. Should you use a high or low temperature, and why?

Answer

Use low temperature (0.0). SQL queries need to be deterministic and syntactically correct. High temperature introduces randomness that could produce invalid or inconsistent SQL. For structured output tasks like code generation, data extraction, and SQL, always prefer temperature=0.

Part 4: Why LLMs Hallucinate¶

Hallucination (generating confident-sounding false information) happens because:

The model predicts likely text, not true text. A plausible-sounding answer can score higher than "I don't know."
Training data has gaps and noise. If the web says something wrong often enough, the model learned it.
No external memory. The model doesn't "check" facts — it generates from patterns.

How agents mitigate hallucination¶

Technique	How it helps
RAG	Give the model real documents to cite instead of relying on training data
Tool calling	Let the model call APIs/databases for real-time data
Low temperature	Reduce creativity when accuracy matters
System prompt rules	"Never invent data; only use tool outputs"
Structured output	Force the model to produce JSON schema — easier to validate
Evaluation	Measure groundedness, coherence, and factuality automatically

Part 5: Choosing a Model¶

Not every task needs GPT-4o. Choosing the right model saves money and latency.

Model	Best for	Speed	Cost
GPT-4o	Complex reasoning, long context, multimodal	Medium	$$$
GPT-4o-mini	Most everyday tasks	Fast	$
Phi-4 (Microsoft)	On-device, low cost, surprisingly capable	Very fast	Free (local)
Llama 3.3 70B	Open-source, self-host, large tasks	Medium	Free (self-host)
o1 / o3	Math, code, deep multi-step reasoning	Slow	$$$$

Start cheap, upgrade when needed

Begin with gpt-4o-mini or Phi-4. Only upgrade to gpt-4o or o1 if the task clearly requires it.

Part 6: The Transformer Architecture (simplified)¶

You don't need to understand the math, but knowing the key insight helps:

Self-attention is the magic. For each token, the model computes how much "attention" to pay to every other token in the context.

This is why LLMs understand context so well — every word is interpreted in relation to every other word.

🤔 Check Your Understanding

In the sentence "The bank by the river was steep," how does the self-attention mechanism help the model understand that "bank" means a riverbank and not a financial institution?

Answer

Self-attention computes how much "attention" each token should pay to every other token. When processing "bank," the model attends strongly to "river" — the contextual relationship between these words shifts the interpretation toward riverbank rather than financial institution. Every word is interpreted in relation to every other word in the context.

🧠 Knowledge Check¶

Q1 (Multiple Choice): Approximately how many tokens is the sentence 'Hello world'?

A) 1 token
B) 2 tokens
C) 6 tokens
D) 10 tokens

✅ Reveal Answer

Correct: B — 2 tokens

"Hello" is 1 token and "world" is 1 token. As a rule of thumb, 1 token ≈ 4 characters ≈ ¾ of a word. A 1,000-word document is approximately 1,300 tokens. This matters for both cost (APIs charge per token) and context window limits (GPT-4o has a 128K token context window).

Q2 (Multiple Choice): You are calling an LLM for structured data extraction (e.g., parsing JSON from a customer email). Which temperature setting is most appropriate?

A) temperature = 1.5 (high creativity)
B) temperature = 0.8 (moderate creativity)
C) temperature = 0.0 (deterministic)
D) temperature = 2.0 (maximum randomness)

✅ Reveal Answer

Correct: C — temperature = 0.0

When accuracy and reproducibility matter more than creativity, use temperature=0. This makes the model always pick the most probable next token — so the same input always produces the same output. For creative writing: use 0.7–1.0. For data extraction, SQL generation, or tool argument formatting: use 0.

Q3 (Multiple Choice): An LLM confidently states that a fictional city in Brazil has a population of 2.3 million. This city doesn't exist. What is the primary cause?

A) The model's context window was too small
B) The temperature was set too high
C) The model predicts likely-sounding text rather than verified facts — it pattern-matched to similar real cities
D) The system prompt was missing

✅ Reveal Answer

Correct: C — LLMs predict likely text, not factual text

LLMs are trained to predict the next token that is statistically likely given the context. A made-up city that resembles real cities in pattern ("São Paulo has 12M, Rio has 6M...") leads the model to generate a plausible-sounding but fabricated answer. This is hallucination. The fix is RAG or tool calling — force the model to look up facts rather than predict them.

Summary¶

Concept	Key takeaway
Tokens	~4 chars each; context windows and costs are measured in tokens
Prediction	LLMs predict the next token — reasoning is emergent, not programmed
Training	Pre-training → fine-tuning → RLHF produces helpful assistants
Temperature	0 = deterministic; higher = more creative
Context window	The model's working memory; doesn't persist between requests
Hallucination	Caused by pattern-matching, not fact-checking — mitigated with tools + RAG

Next Steps¶

→ Lab 005 — Prompt Engineering — Now that you know how LLMs work, learn to write prompts that reliably get the output you want.