Tokens in Large Language Models - The Quiet Corner

A token, in the context of LLMs, is the basic unit of input and output processing. Instead of working directly with sentences and words as humans do, LLMs break text down into tokens which are then converted for computation. A token is typically a word, part of a word or a punctuation mark, determined by the LLM's tokeniser. This is dependent on the model's vocabulary and rules. Examples might be: |Text|Tokens| |---|---| |"apple"|`["apple"]`| |"applesauce"|`["apple", "sauce"]`| |"unbelievable"|`["un", "believ", "able"]`| |"I'm going."|`["I", "'m", " going", "."]`| When processing a sentence, the LLM tokenises the text, converts each into a vector, and passes the sequence of vectors for transformation and prediction. Crucially, the model predicts the next token, not the next word or sentence. The maximum number of tokens that a LLM can handle at once is called the context window. This can vary dramatically between models. Such tokens include user input, model output and any intermediate context. Once the limit is reached, earlier tokens are dropped from memory. ## ChatGPT on tokens *I asked ChatGPT (model 4o) about its use of tokens. This was its reply:* As GPT-4o (OpenAI's GPT-4 "omni" model), I operate based on a **token-based architecture**. Here is an up-to-date technical overview of how I handle tokens, what my token limits are, and what happens when those limits are reached. --- ## 1. **What are Tokens?** Tokens are units of text – they are not the same as words or characters, but somewhere in between. - A token might be: - A word (`"apple"` → 1 token) - A piece of a word (`"unhappiness"` → `"un"`, `"happiness"` → 2–3 tokens) - Punctuation (`"."` or `","` → 1 token) - In non-English text or code, token boundaries can behave differently depending on the encoding scheme For English text: - **1 token ≈ 0.75 words** - **1,000 tokens ≈ 750 words**, depending on vocabulary and structure You can think of tokens as the atomic units my model processes internally to generate responses or evaluate input. ## 2. **Context Window (Token Limit)** My context window — the total number of tokens I can hold at once across **both input and output** — is currently: - **GPT-4o: 128,000 tokens** This is one of the largest available in any commercial LLM. It allows for processing entire books, long documents, or extended conversations without truncating earlier context. However, typical front-end interfaces (like ChatGPT’s) may restrict this to a smaller subset (e.g. **32,000 tokens**) depending on user tier, interface, or latency constraints. You're using ChatGPT Plus, which may cap usable context lower than the full 128k. ## 3. **What Happens When the Token Limit Is Reached?** ### Input-Side Overflow - If the prompt (your message + prior conversation + system message) **exceeds the context window**, I must **drop tokens from the start** (oldest messages first). - This is known as **context truncation**, and it may result in: - Loss of memory of earlier messages - A breakdown in coherence or reference to prior material - Repetition or contradictions For this reason, critical information should be **restated or summarised** periodically in long conversations. ### Output-Side Overflow - If the combined input and intended output would **exceed the window**, I truncate the output. - I will stop mid-sentence if needed, or fail to complete a list or explanation. - You’ll often see me finish early with a partial thought or sentence when this happens. ## 4. **How Token Handling Affects Functionality** - **Memory** (i.e. retaining information over a session) depends on token budget — if a conversation is long, I might “forget” earlier content unless reminded. - **Code execution, document summarisation, or multi-document analysis** benefits significantly from the large 128k limit. - **Image and file uploads** are also tokenised (e.g. OCR or embedding text in images contributes to token count). ## 5. **Optimising for Token Efficiency** To stay within the budget and maintain context, users can: - Use concise phrasing - Avoid repeating long quoted passages - Split tasks into smaller parts - Ask me to summarise and condense earlier text See: [[More on how ChatGPT processes tokens]]