Skip to content

Lab 004: How LLMs WorkΒΆ

Level: L50 Path: All paths Time: ~20 min πŸ’° Cost: Free β€” No account needed

What You'll LearnΒΆ

  • What a Large Language Model (LLM) really is under the hood
  • How training works: pre-training, fine-tuning, RLHF
  • What tokens, context windows, and temperature mean in practice
  • Why LLMs hallucinate β€” and how to mitigate it
  • The difference between models: GPT-4o, Phi-4, Llama, Claude

IntroductionΒΆ

You've probably used ChatGPT or GitHub Copilot. But what's actually happening when you type a message and get a response? Understanding the mechanics of LLMs makes you a dramatically better agent builder β€” you'll know why certain prompts work, why agents make mistakes, and how to design around their limitations.


Part 1: What is a Large Language Model?ΒΆ

An LLM is a neural network trained to predict the next token given a sequence of tokens.

That's it. Everything else β€” reasoning, code generation, summarization, chat β€” is an emergent capability that arises from doing this at massive scale on enormous amounts of text.

TokensΒΆ

A token is the basic unit an LLM processes. It's roughly ΒΎ of a word (about 4 characters).

Tokenization

Why tokens matter for agents

  • Context windows are measured in tokens, not words
  • API costs are billed per token
  • Long documents must be chunked to fit in the context window

The prediction loopΒΆ

When you send a message, the LLM:

  1. Converts your text to a sequence of token IDs
  2. Passes them through billions of mathematical operations (transformer layers)
  3. Outputs a probability distribution over the entire vocabulary (~100,000 tokens)
  4. Samples the next token based on that distribution
  5. Appends it to the sequence and repeats from step 2

LLM Prediction Loop

The LLM doesn't "know" facts β€” it has learned statistical patterns from text. When it says "Paris," it's because "Paris" almost always follows that phrase in its training data.

πŸ€” Check Your Understanding

An LLM correctly answers "The capital of France is Paris." Does the model know this fact the way a human does?

Answer

No. The LLM has learned statistical patterns from its training data β€” "Paris" almost always follows "The capital of France is" in the text it was trained on. It predicts the most likely next token, not verified facts. This is why LLMs can also confidently produce wrong answers (hallucinations).


Part 2: Training an LLMΒΆ

Stage 1 β€” Pre-trainingΒΆ

The model reads trillions of tokens from the internet, books, code, and scientific papers. It learns language structure, facts, reasoning patterns, and common knowledge purely by predicting the next token.

Training data: Wikipedia + books + GitHub + web pages + ...
Goal: minimize prediction error across all that text
Result: a "base model" that can complete text

GPT-4o, Llama 3, Phi-4 all start as base models.

Stage 2 β€” Instruction Fine-tuning (SFT)ΒΆ

The base model is trained on examples of conversations β€” (prompt, ideal response) pairs. This teaches it to be helpful, follow instructions, and respond in a conversational way.

Stage 3 β€” RLHF (Reinforcement Learning from Human Feedback)ΒΆ

Human raters compare pairs of responses and pick the better one. A reward model is trained on these preferences. The LLM is then fine-tuned to maximize the reward model's score.

This is why ChatGPT feels more polished and aligned than a raw base model.

LLM Training Pipeline

πŸ€” Check Your Understanding

What is the purpose of RLHF (Reinforcement Learning from Human Feedback) in LLM training, and why can't pre-training alone achieve the same result?

Answer

RLHF aligns the model with human preferences β€” making responses more helpful, safe, and conversational. Pre-training only teaches the model to predict the next token from text patterns. Without RLHF, the model might produce technically correct but unhelpful, unsafe, or oddly formatted responses.


Part 3: Key ParametersΒΆ

Context WindowΒΆ

The context window is how much text the model can "see" at once β€” its working memory.

Model Context Window
GPT-4o 128,000 tokens (~96,000 words)
GPT-4o-mini 128,000 tokens
Phi-4 16,000 tokens
Llama 3.3 70B 128,000 tokens
Claude 3.5 Sonnet 200,000 tokens

Context Window Comparison

Context window β‰  unlimited memory

The model reads the entire context window on every request. Longer context = slower + more expensive. Agents use RAG and summarization to manage long conversations.

TemperatureΒΆ

Temperature controls how random the output is.

Temperature Comparison

# Deterministic (good for structured data extraction)
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.0,
    messages=[...]
)

# Creative (good for ideas/drafts)
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.8,
    messages=[...]
)

Top-p (nucleus sampling)ΒΆ

An alternative to temperature. Only sample from the smallest set of tokens whose cumulative probability exceeds top_p.

  • top_p=0.1 β†’ very conservative
  • top_p=0.9 β†’ allows diverse outputs
πŸ€” Check Your Understanding

You're building an agent that generates SQL queries from natural language. Should you use a high or low temperature, and why?

Answer

Use low temperature (0.0). SQL queries need to be deterministic and syntactically correct. High temperature introduces randomness that could produce invalid or inconsistent SQL. For structured output tasks like code generation, data extraction, and SQL, always prefer temperature=0.


Part 4: Why LLMs HallucinateΒΆ

Hallucination Causes and Solutions

Hallucination (generating confident-sounding false information) happens because:

  1. The model predicts likely text, not true text. A plausible-sounding answer can score higher than "I don't know."
  2. Training data has gaps and noise. If the web says something wrong often enough, the model learned it.
  3. No external memory. The model doesn't "check" facts β€” it generates from patterns.

How agents mitigate hallucinationΒΆ

Technique How it helps
RAG Give the model real documents to cite instead of relying on training data
Tool calling Let the model call APIs/databases for real-time data
Low temperature Reduce creativity when accuracy matters
System prompt rules "Never invent data; only use tool outputs"
Structured output Force the model to produce JSON schema β€” easier to validate
Evaluation Measure groundedness, coherence, and factuality automatically

Part 5: Choosing a ModelΒΆ

Not every task needs GPT-4o. Choosing the right model saves money and latency.

Model Best for Speed Cost
GPT-4o Complex reasoning, long context, multimodal Medium $$$
GPT-4o-mini Most everyday tasks Fast $
Phi-4 (Microsoft) On-device, low cost, surprisingly capable Very fast Free (local)
Llama 3.3 70B Open-source, self-host, large tasks Medium Free (self-host)
o1 / o3 Math, code, deep multi-step reasoning Slow $$$$

Start cheap, upgrade when needed

Begin with gpt-4o-mini or Phi-4. Only upgrade to gpt-4o or o1 if the task clearly requires it.


Part 6: The Transformer Architecture (simplified)ΒΆ

You don't need to understand the math, but knowing the key insight helps:

Self-attention is the magic. For each token, the model computes how much "attention" to pay to every other token in the context.

Self-Attention Mechanism

This is why LLMs understand context so well β€” every word is interpreted in relation to every other word.

πŸ€” Check Your Understanding

In the sentence "The bank by the river was steep," how does the self-attention mechanism help the model understand that "bank" means a riverbank and not a financial institution?

Answer

Self-attention computes how much "attention" each token should pay to every other token. When processing "bank," the model attends strongly to "river" β€” the contextual relationship between these words shifts the interpretation toward riverbank rather than financial institution. Every word is interpreted in relation to every other word in the context.


🧠 Knowledge Check¢

Q1 (Multiple Choice): Approximately how many tokens is the sentence 'Hello world'?
  • A) 1 token
  • B) 2 tokens
  • C) 6 tokens
  • D) 10 tokens
βœ… Reveal Answer

Correct: B β€” 2 tokens

"Hello" is 1 token and "world" is 1 token. As a rule of thumb, 1 token β‰ˆ 4 characters β‰ˆ ΒΎ of a word. A 1,000-word document is approximately 1,300 tokens. This matters for both cost (APIs charge per token) and context window limits (GPT-4o has a 128K token context window).

Q2 (Multiple Choice): You are calling an LLM for structured data extraction (e.g., parsing JSON from a customer email). Which temperature setting is most appropriate?
  • A) temperature = 1.5 (high creativity)
  • B) temperature = 0.8 (moderate creativity)
  • C) temperature = 0.0 (deterministic)
  • D) temperature = 2.0 (maximum randomness)
βœ… Reveal Answer

Correct: C β€” temperature = 0.0

When accuracy and reproducibility matter more than creativity, use temperature=0. This makes the model always pick the most probable next token β€” so the same input always produces the same output. For creative writing: use 0.7–1.0. For data extraction, SQL generation, or tool argument formatting: use 0.

Q3 (Multiple Choice): An LLM confidently states that a fictional city in Brazil has a population of 2.3 million. This city doesn't exist. What is the primary cause?
  • A) The model's context window was too small
  • B) The temperature was set too high
  • C) The model predicts likely-sounding text rather than verified facts β€” it pattern-matched to similar real cities
  • D) The system prompt was missing
βœ… Reveal Answer

Correct: C β€” LLMs predict likely text, not factual text

LLMs are trained to predict the next token that is statistically likely given the context. A made-up city that resembles real cities in pattern ("SΓ£o Paulo has 12M, Rio has 6M...") leads the model to generate a plausible-sounding but fabricated answer. This is hallucination. The fix is RAG or tool calling β€” force the model to look up facts rather than predict them.


SummaryΒΆ

Concept Key takeaway
Tokens ~4 chars each; context windows and costs are measured in tokens
Prediction LLMs predict the next token β€” reasoning is emergent, not programmed
Training Pre-training β†’ fine-tuning β†’ RLHF produces helpful assistants
Temperature 0 = deterministic; higher = more creative
Context window The model's working memory; doesn't persist between requests
Hallucination Caused by pattern-matching, not fact-checking β€” mitigated with tools + RAG

Next StepsΒΆ

β†’ Lab 005 β€” Prompt Engineering β€” Now that you know how LLMs work, learn to write prompts that reliably get the output you want.