How LLMs behave

Understand what a large language model actually does: it splits text into tokens, predicts the next token from a context window, and samples from those predictions using temperature and top-p.

Loading video…

What you'll be able to do

Explain how text is split into tokens and why token count matters
Describe the context window and what happens when you exceed it
Explain next-token prediction as the core mechanism behind every response
Adjust temperature and top-p to control randomness in a deliberate way
Predict how a given setting will make outputs more focused or more varied
Connect token budgets to cost and latency in a production system

A model that only predicts the next token

A large language model does one thing astonishingly well: given some text, it predicts what token comes next. Everything else — answering questions, writing code, summarising a document — is that single trick repeated thousands of times. There is no database lookup, no hidden search engine. The model holds a learned sense of “what usually follows what” and uses it one step at a time.

This matters because it reframes how you think about prompting. You are not issuing a command to a program; you are giving the model a starting context and letting it continue in the most probable way. Good prompts shape that probability.

Tokens: the unit the model actually sees

Models do not read characters or words directly. They read tokens — chunks of text mapped to integers. A token is often a whole common word, but longer or rarer words get split into word-pieces. As a rough rule for English, one token is about 3 to 4 characters, or roughly three-quarters of a word.

This is why cat might be one token but unbelievability is several, and why code or non-English text can tokenise very differently. You will count tokens often, because they drive cost and fit.

The context window

The context window is the total number of tokens the model can attend to in a single request — your prompt and the response it generates, combined. Current Claude models such as Opus claude-opus-4-8 and Sonnet claude-sonnet-4-6 offer very large windows, but every model has a limit.

If you push past it, the model cannot see the overflow: older content gets truncated or the request is rejected. Treat the window as a budget. Long documents, long chat histories, and large tool outputs all compete for the same space.

Next-token sampling: temperature and top-p

At each step the model produces a probability distribution over every possible next token. How you pick from that distribution decides how the text feels.

Temperature reshapes the distribution. Low temperature sharpens it toward the single most likely token, giving focused, repeatable output. High temperature flattens it, so less likely tokens get a chance and the text becomes more varied — and more surprising.
Top-p (nucleus sampling) trims the candidate pool. It keeps only the smallest set of tokens whose probabilities sum to p (say 0.9) and samples from those, discarding the long tail of implausible options.

Use low temperature for extraction, classification, and anything you need to be deterministic-ish. Use higher temperature for brainstorming or creative drafts.

# Conceptually, per request you tune:
temperature = 0.0   # focused, near-deterministic
temperature = 1.0   # varied, creative
top_p       = 0.9   # ignore the unlikely tail

A note for later: some current Claude models remove temperature and top_p entirely and steer behaviour through prompting and effort instead. The concept still explains how any LLM turns predictions into words — and that intuition transfers to every provider you will meet, including the models you can run on Amazon Bedrock or Azure.

Why this is foundational

Once you internalise “tokenise, predict, sample, repeat,” the rest of the course clicks into place. Hallucination, cost, latency, and the value of structure all follow from this one loop.

Your task

Take a short paragraph of your own writing and estimate its token count (about words divided by 0.75). Then write one sentence describing what you would expect a low-temperature versus a high-temperature continuation of that paragraph to look like.

Check your understanding

6 questions — answer to see instant feedback.

Q1. What is a token, as an LLM sees it?

Models operate on tokens, which are sub-word chunks (roughly 3-4 characters of English on average), each mapped to an integer the model can process.

Q2. What does the context window define?

The context window is the total token budget the model can see in one request — prompt plus the response it generates — and exceeding it forces truncation or an error.

Q3. How does an LLM actually produce text?

Generation is autoregressive: the model repeatedly predicts a probability distribution over the next token, picks one, appends it, and repeats.

Q4. What does raising the temperature do?

Higher temperature flattens the next-token distribution, making sampling more random and outputs more varied; lower temperature sharpens it toward the most likely tokens.

Q5. In one sentence, what does top-p (nucleus) sampling do?

Answer:It keeps only the smallest set of most-likely tokens whose probabilities add up to p, then samples from that set, trimming the long tail of unlikely tokens.
Top-p caps the candidate pool by cumulative probability rather than by a fixed count, so the model never picks from the implausible tail.

Q6. Why should a DevOps engineer care about token counts in production?

Answer:Because tokens drive both cost (you pay per input and output token) and latency (longer context and longer output take more time), so token budgets are a real operational constraint.
Token count maps directly to spend and response time, which makes it a capacity-planning concern just like CPU or memory.

Ask the AI tutor about this lessonStuck or curious? Ask a question and get a grounded answer.

The tutor answers from this lesson's material and can make mistakes — verify anything important.