AI & Language Models
Tokens, temperature, context windows, and how LLMs actually work. Autocomplete on steroids.
What an LLM Actually Is: Autocomplete on Steroids
A Large Language Model (LLM) is a program that predicts the next word. Nothing more. You give it some text, and it figures out what word is most likely to come next, then the word after that, then the word after that, until it's written a full response.
It's not thinking. It's not conscious. It's not "understanding" your question the way you understand a question. It's incredibly sophisticated pattern matching. During training, the model read billions of pages of text (books, websites, code, conversations) and learned the statistical patterns of how words follow other words.
Think of it like the autocomplete on your phone, but scaled up by a factor of a million. Your phone's autocomplete predicts one word based on the previous few. An LLM predicts entire paragraphs based on everything you've written, with an understanding of grammar, logic, style, and context that makes it feel like a conversation.
The "Large" in Large Language Model refers to the size. These models have hundreds of billions of parameters (the learned patterns). That scale is what makes them eerily good at predicting what comes next.
When Claude writes a follow-up email for a sales team, it's not "thinking about the deal." It's predicting what a helpful, well-written follow-up email would look like, word by word, based on the context you gave it (the prospect's name, the meeting notes, the deal stage). The result looks intelligent because the patterns it learned are incredibly rich.
Tokens: How LLMs Read Text
LLMs don't read words the way you do. They break text into tokens, which are small pieces of words. The word "unhappiness" becomes three tokens: "un" + "happi" + "ness." Common words like "the" or "is" are single tokens. Longer or rarer words get split into pieces.
Why does this matter to you? Two reasons:
1. You pay per token. Every API call to Claude or GPT is billed by how many tokens you send (your prompt) plus how many tokens come back (the response). More tokens = more cost. This is why good prompts are concise, because you're literally paying for every word.
2. There's a limit. Each model has a maximum number of tokens it can handle in a single conversation. Go over the limit and the model can't process your request. This limit is called the context window.
A rough rule of thumb: 1 token is about 3/4 of a word. So 1,000 tokens is roughly 750 words, or about three pages of text.
Type or edit the text below. Each colored chip is one token — a piece the model actually reads. Real tokenizers split sub-words (e.g. "unhappiness" = "un" + "happi" + "ness"); this demo splits on word boundaries to show the idea.
9 tokens · ~11 real tokens (estimate)
Prompt: "Complete this sentence: The best way to close a deal is..."
Drag the slider to see how temperature changes the outputs.
Always picks the most likely word. Consistent, predictable, factual.
Notice how all three are nearly identical — that's deterministic output.
Context Window: The Size of the Desk
The context window is the total amount of text the model can "see" at once, meaning your prompt plus its response, combined. Think of it like a desk. A bigger desk lets you spread out more documents and reference them all while working. A small desk means you can only look at a few pages at a time.
Claude's context window is enormous: 200,000 tokens. That's roughly 150,000 words, enough for an entire novel. You can paste in dozens of documents, meeting transcripts, and email threads, and Claude can reference all of it while generating a response.
Smaller, cheaper models have smaller windows. Some older models could only handle 4,000 tokens (about 3,000 words, just a few pages). This is why model choice matters: if you need to analyze a long document, you need a model with a window big enough to hold it.
Imagine a daily briefing service that pulls in CRM data, calendar events, and email history for each team member. All of that context gets sent to the model in one prompt. With a 200K token window, there's plenty of room. With a 4K window, you'd have to leave most of the data out, and the briefing would be worse.
Temperature: The Creativity Dial
Temperature controls how random the model's word choices are. It's a number from 0.0 to 1.0.
At temperature 0.0, the model always picks the most statistically likely next word. The output is deterministic, so you'll get the same answer every time. This is great for factual tasks: classification, data extraction, structured analysis. You want consistency, not creativity.
At temperature 1.0, the model is willing to pick less likely words, introducing variety and surprise. The output becomes more creative and unpredictable. Good for brainstorming, creative writing, or generating varied options.
Think of it like a musician. Temperature 0.0 is sight-reading sheet music: note-perfect, identical every time. Temperature 1.0 is jazz improvisation, different every performance, sometimes brilliant, sometimes off.
In practice, teams use low temperature (0.0-0.2) for classification tasks like categorizing support tickets or extracting structured data. Higher temperature (0.3-0.5) for writing tasks, where you want the output to feel natural and not robotic. Never 1.0, though. That's too unpredictable for business use.
Temperature is one of the most common settings you'll see when people talk about "tuning" an AI model. It's not retraining the model. It's just adjusting how adventurous it is when picking words. A tiny dial with a huge effect on output quality.
Choosing a Model: Match the Brain to the Job
Not all models are created equal. They vary in intelligence, speed, and cost, and you pick based on the task.
Fast and cheap models (like Gemini Flash or Claude Haiku) are great for simple tasks. Classify this email as "interested" or "not interested." Extract a name from a paragraph. Summarize a short text. These models respond in milliseconds and cost fractions of a cent per call. They're the interns: fast, cheap, good enough for routine work.
Smart and expensive models (like Claude Opus or GPT-4) are for complex reasoning. Write a nuanced follow-up email that references three previous conversations. Analyze a 50-page contract. Generate a strategic briefing from messy data. These models are slower, cost more per call, but produce significantly better output on hard tasks. They're the senior consultants: slower, pricier, worth it when the stakes are high.
The key insight: match the model to the task. Using Opus to classify emails is like hiring a lawyer to sort your mail. Using Haiku to write a board presentation is like asking the intern to give the keynote. Both waste resources.
A typical team might use Gemini Flash for classification (is this a meeting request or just noise?) and Claude Sonnet for writing tasks (compose a personalized follow-up). Flash handles thousands of classifications per day at minimal cost. Sonnet handles dozens of writing tasks per day where quality matters. Different tools for different jobs.
Structured Output: Making AI Talk to Code
Here's how AI actually fits into software. When you chat with Claude in a browser, you get free-form text (paragraphs, bullet points, whatever). But when code calls an AI model, it often needs the response in a specific format.
Structured output means you tell the model: "Don't give me a paragraph. Give me JSON with these exact fields." And the model complies.
Prompt: "Classify this email and extract the key info."
Unstructured response:
"This email is from John asking about pricing. He seems
interested and wants to schedule a call next week."
Structured response:
{
"sender": "John",
"intent": "pricing_inquiry",
"urgency": "medium",
"action": "schedule_call",
"timeline": "next_week"
}
The structured version is what code can actually use. A function can read intent, check if it's "pricing_inquiry", and route it to the right workflow. The unstructured version is useful for humans; the structured version is useful for systems.
This is the bridge between AI and automation. The model reasons about the text (the hard part), then packages its answer in a format that code can parse and act on (the structured part). Every AI pipeline works this way: model in, structured data out, code takes over from there.
Further Reading
Concepts from this lesson:
- What are Large Language Models? (Google AI). Google's overview of how LLMs work
- Prompt Engineering Overview (Anthropic). How to write effective prompts for Claude
- What are tokens? (OpenAI). Clear explanation of tokenization and counting