Module 3 · Day 1 · AnyCompany Support Workshop

How AI Reads Your Text

From characters to tokens to numbers — the first step in every LLM pipeline. Interactive explainer with GS Support examples (incl. SEA-language transcripts).

📖 Day 1 Reference ⚡ Interactive 💰 GS Support Context

🤔 The Problem: AI Can't Read Words

AI models like Claude, GPT, and LLaMA don't understand text the way we do. They work with numbers, not letters. Before any AI can process your message, your text needs to be broken into small pieces called tokens and converted to numbers.

Think of it like this: you're sending a message to someone who only speaks in numbered codes. You need a translation system — that's what a tokenizer does.

✂️

Why Not Just Use Whole Words?

You could split text by spaces, but then the AI would need a separate number for every word it's ever seen — including misspellings, slang, and every language on Earth. That's millions of entries. We want a smarter, smaller vocabulary.

🧩

BPE: Learning the Building Blocks

Byte Pair Encoding starts with individual characters and repeatedly merges the most common pair into a new token. Over many merges, it discovers useful pieces — common syllables, word roots, suffixes.

Why This Matters

Tokenization is the very first step in every LLM pipeline. It determines how the model "sees" your text, affects how many tokens fit in the context window, and directly impacts your cost.

💰

You Pay Per Token

Every API call is billed by token count — both input and output. Understanding tokenization helps you estimate costs and optimize prompts. Numbers and special characters often cost more tokens than you'd expect.

📏 Token Granularity: How Big Should Each Piece Be?

There are four levels at which you can split text. Each has trade-offs:

LevelHow It SplitsExample: "chargebacks"Trade-off
WordEach word = 1 tokenchargebacksSimple but huge vocabulary, can't handle unknown words
SubwordMeaningful piecescharge + backsBest balance — what all modern models use
CharacterEach character = 1 tokencharge...Tiny vocabulary but very long sequences
ByteRaw byte encoding9910497114...Handles any language but extremely long
💡
The sweet spot is subword tokenization — it captures root meanings ("charge", "chargeback", "chargebacks" share a root) while keeping vocabulary manageable. This is what Claude, GPT, and LLaMA all use.

🔗 Where Tokenization Fits in the LLM Pipeline

📝Your Text
✂️Tokenizer
(BPE)
🔢Token IDs
📊Embeddings
🧠Transformer
💬Output

Tokenization is Step 1 — everything downstream depends on how text is split into tokens.

⚙️ BPE: Byte Pair Encoding — Step by Step

BPE is the most widely used tokenization algorithm. It's used by GPT, LLaMA, Claude, and most modern LLMs. The idea is beautifully simple:

1️⃣

Start with Characters

Split the entire training corpus into individual characters. This is your starting vocabulary.

2️⃣

Count Adjacent Pairs

Find which two adjacent tokens appear together most often across the corpus.

3️⃣

Merge the Top Pair

Combine that pair into a single new token and add it to the vocabulary.

4️⃣

Repeat

Keep merging until vocabulary reaches desired size or no pair appears more than once.

📖 Worked Example: The Classic BPE Demo

Let's walk through BPE on the classic textbook example. Watch how common suffixes like "est" and "low" emerge naturally:

Input Text
low lower newest widest

Step 0 — Initial character split:

l o w _ l o w e r _ n e w e s t _ w i d e s t _

Each character becomes its own token. The _ marks end-of-word boundaries.

Step 1 — Most frequent pair: e + ses

Appears in "newest" and "widest" — 2 occurrences. Merge them.

Step 2 — Most frequent pair: es + test

The suffix "est" emerges! Common in English superlatives.

Step 3 — Most frequent pair: l + olo

Appears in "low" and "lower" — 2 occurrences.

Step 4 — Most frequent pair: lo + wlow

The word "low" becomes a single token! It's frequent enough to earn its own entry.

Step 5 — Most frequent pair: est + _est_

The end-of-word suffix "est_" is now a single token.

🔍
Notice what happened: BPE discovered that "low" and "est" are meaningful building blocks — without any knowledge of English! It learned this purely from frequency patterns. "newest" = "n" + "e" + "w" + "est_" and "widest" = "w" + "i" + "d" + "est_".

🔄 Training vs Inference: Two Different Phases

A common confusion: the BPE playground above shows how the tokenizer is trained — discovering merge rules from a corpus. But when you actually use Claude, the tokenizer works differently:

🏗️

Phase 1: Training the Tokenizer

Happens once, before the LLM is trained. BPE scans a massive corpus (Wikipedia, books, code) and discovers ~50,000-100,000 merge rules. These rules are saved as a file.

Output: merges.txt + vocab.json
This is what our playground simulates ☝️

Phase 2: Using at Inference

Happens every time you send a prompt. Your text is split into characters, then the pre-learned merge rules are applied in order. No new rules are discovered — it's a fast lookup.

Input: your prompt
Output: token IDs → embeddings → transformer

Inference example — what actually happens when you prompt Claude
"Summarise this IRT case"
→ split into characters → apply 100K pre-learned merge rules in order
→ ["Sum", "marise", " this", " IRT", " case"] → 5 tokens
🔍
Key difference: Our playground discovers 5 merge rules from a tiny corpus and produces 14 tokens. A real tokenizer like Claude's has ~100,000 merge rules learned from billions of words — so common words like "case" and "refund" are already single tokens, while less common terms like "MIWI" or "ARC" may split into multiple tokens. The merge rules are fixed after training and never change at inference.

🎧 BPE on GS Case Text

Let's see how BPE handles text you'd actually encounter in a D365 case description:

AnyCompany GS Case Data
"Pax disputed an SGD $24.50 charge after order cancelled"

A trained BPE tokenizer would split this roughly as:

Pax disputed an SGD $ 24 . 50 charge after order cancelled

= 12 tokens. Notice: "Pax" stays one token (common in case data), "SGD" stays one token (currencies are common), but the amount "$24.50" becomes 4 separate tokens — every digit, dot, and currency symbol counts.

⚠️
Key insight for GS: Case data with lots of numbers (booking IDs, amounts, timestamps) costs more tokens than the same length of narrative description. A datalake row like BK-2026-4821, SGD 24.50, 2.8%, 2026-03-15 22:14 uses ~25 tokens — mostly because each digit, dash, and punctuation is a separate token. At 400k MIWI cases / month, that adds up.

🎮 BPE Interactive Playground

Step 0 / 0
🎓
Pick a preset or type your own text, then press ▶ Play to watch BPE build a vocabulary step by step. Use ◀ ▶ to step through manually.
Ready — click "Watch Tokenization"
💡 How it works: The tokenizer finds the most common pair of characters, merges them into one token, and repeats. This is how AI learns that "pay" + "ment" = "payment" — building a vocabulary from patterns in text.
Tokens 0 tokens
📊 Pair Frequencies (top 15)
📚 Vocabulary (0)
🔀 Merge History

No merges yet — press ▶ to start

📖 How BPE Works

Split text into individual characters (+ end-of-word marker )

Count all adjacent token pairs

Merge the most frequent pair into a new token

Repeat until no pair appears more than once

💡
Try it yourself: Type any text in the input box — try a GS term like "chargeback" or "MIWI", a Pax message, or paste a snippet of THA / VNM / BHS chat. Watch how BPE discovers patterns in whatever you give it (and notice how SEA-language transcripts often cost more tokens than English).

🔬 The Three Tokenization Algorithms

Modern LLMs use one of three subword tokenization algorithms. They all solve the same problem — finding the right-sized pieces — but approach it differently:

AlgorithmDirectionMerge CriterionUsed By
BPE
Byte Pair Encoding
Bottom-up (merge) Most frequent pair GPT LLaMA Claude
WordPiece Bottom-up (merge) Most likely pair (probability) BERT
Unigram Top-down (prune) Remove least useful tokens T5

📊 BPE vs WordPiece vs Unigram

BPE — Bottom Up

Start with characters. Repeatedly merge the most frequent adjacent pair. Simple, fast, effective.

a + r → ar (freq: 3)
ar + e → are (freq: 2)
...

WordPiece — Bottom Up

Like BPE, but merges the most likely pair based on probability, not raw frequency. Uses ## prefix for continuations.

"teddy" → "ted" + "##dy"
"playing" → "play" + "##ing"

Unigram — Top Down

Start with all possible substrings. Iteratively remove tokens that contribute least to overall likelihood. Opposite direction from BPE.

"bear" → best split:
"be" + "ar" (P=9.5×10⁻⁴)
🔍
Why does it matter which algorithm? The same text may produce different token counts on different models. "AnyCompany Support" might be 3 tokens on Claude but 5 on BERT — and a Vietnamese transcript may be 30% more tokens than the English equivalent. This affects cost, context window usage, and even model performance on certain tasks.

📚 Special Tokens — The Model's Control Codes

Every vocabulary includes reserved tokens with specific purposes:

TokenPurposeExample
[PAD]Fills sequences to a fixed length so batches are uniformcase summary [PAD] [PAD]
[UNK]Represents any word not found in the vocabularythe [UNK] rate is
[CLS]Class token placed at start (BERT-specific)[CLS] summarise this case
[SEP]Separates two segments... data [SEP] assess ...
[MASK]Hides a token for the model to predict (BERT training)the [MASK] rate is high

💰 Why GS Support Teams Should Care About Tokens

Tokens are the currency of AI — literally. Every API call is billed by token count. Understanding tokenization helps you estimate costs, optimize prompts, and choose the right model.

📊

1 token ≈ 4 characters ≈ ¾ word

"AnyCompany Support" = 3 tokens
"Case context summary" = 4 tokens

⚠️

Numbers & SEA scripts cost more

"$24.50" = 4 tokens (each digit, period, dollar sign). Booking IDs, timestamps, and Thai/Vietnamese transcripts cost more tokens than English narrative.

🧮 See Your Text Become Money

Type text below and watch it transform: characters → tokens → cost. Notice how numbers and special characters create MORE tokens than plain English.

1 Your text splits into tokens 0 tokens
Words — 1 token per word (cheap) Numbers — 1 token per 1-2 digits (expensive) Symbols — 1 token each ($, %, comma)
2 Token efficiency — why data format matters
📝
Words
0
tokens
🔢
Numbers
0
tokens
Symbols
0
tokens
3 What this costs — pick your scale
⚡ Cheapest
Nova Micro
$0.00
$0.035 / 1M tokens
🎯 Recommended
Claude Sonnet 4.6
$0.00
$3.00 / 1M tokens
🔬 Most Powerful
Claude Opus 4.6
$0.00
$5.00 / 1M tokens
👤
If a TL / SPV writes each summary manually
$6,250
@ 30 min/case × $25/hr (per Mikko's Top Idea baseline)
99.9%
cost reduction

Token estimate uses simplified BPE rules. Actual tokenization varies by model. Pricing as of May 2026 from AWS Bedrock.

📋 Token Estimation Quick Reference

Content TypeApprox. TokensCost (Sonnet 4)
A short question ("Summarise this case")~5 tokens< $0.001
A 10-line D365 case description~150 tokens$0.0005
Your engineered Case Summarizer prompt template~400 tokens$0.0012
A full case context summary (Symptom · Severity · Booking · Action · Next Step)~800 tokens$0.012
Total per case (input + output)~1,350 tokens$0.013
10,000 IRT cases per month (one market)~13.5M tokens~$130
400,000 MIWI cases per month (region-wide)~540M tokens~$5,200
💡
Model choice = biggest cost lever: Nova Micro costs $0.035/1M input tokens. Claude Opus 4 costs $5.00/1M. The same task can cost pennies or dollars. But cheaper ≠ worse for every task. The right question is: "What's the cheapest model that meets my quality threshold?"

🎯 Cost Optimization Strategies

📐

Right-Size Your Model

Simple classification → Nova Micro ($0.04/1M). Narrative generation → Sonnet ($3/1M). Don't use Opus to sort mail.

✂️

Optimize Prompts

Remove redundant instructions (10-20% savings). Use shorter examples. Constrain output length.

📦

Batch Processing

Bedrock offers 50% discount for batch inference. Perfect for monthly portfolio assessments.

🔄

Prompt Caching

Cache repeated system prompts — saves ~90% on the template portion for subsequent calls.

🔗 The Full Journey: From Text to AI Output

Tokenization is just the first step. Here's the complete pipeline that turns your text into an AI response:

Step 1: Tokenization

Your text is split into subword tokens using BPE (or WordPiece/Unigram). Each token maps to an integer ID from the vocabulary.

Example
"case context" → ["case", "context"] → [4523, 8901]

Step 2: Token Embeddings

Each token ID is converted into a dense vector of numbers (typically 768–12,288 dimensions). These vectors capture semantic meaning — similar words get similar vectors.

🔍
Key insight: In embedding space, "drunk driving" and "DUI" are close together, while "drunk driving" and "GrabFood order" are far apart. That's why an LLM can recognise that a Pax saying "the driver smelled like beer" is in the same problem family as a formal "DUI report" — even though no exact words overlap.

Step 3: Position Encoding

The model needs to know where each token sits in the sequence. Position embeddings are added so "The merchant is risky" and "Is the merchant risky?" produce different representations despite having the same words.

Modern models like LLaMA use RoPE (Rotary Position Embeddings) — which encodes relative position, so the relationship between "merchant" and "risk" is the same whether they're at positions 3,4 or 103,104.

Step 4: Self-Attention (The Transformer)

This is where the magic happens. Every token can directly look at every other token and decide which ones are relevant. The model asks three questions for each token:

Query (Q)

"What am I looking for?"

Key (K)

"What do I contain?"

Value (V)

"Here's my actual content"

When processing "The merchant's chargeback rate is high", the word "high" pays strong attention to "chargeback rate" (what's high?) and "merchant" (whose rate?). This context-awareness is what makes transformers so powerful.

Step 5: Output Generation

The transformer produces a probability distribution over the entire vocabulary for the next token. The model picks the most likely token (or samples from the distribution based on temperature), appends it, and repeats.

Generation Example
Input: "The IRT case severity is"
→ P("P1")=0.72, P("P2")=0.21, P("P3")=0.05, ...
→ Output: "P1"

Temperature controls randomness: low temperature (0.0) = always pick the highest probability. High temperature (1.0) = more creative / random. For PAC tagging and severity classification, you want low temperature for consistency. For a coaching note rewrite, slightly higher.

📈 The Evolution of Text Representation

Each breakthrough solved a limitation of the previous approach:

1️⃣
One-Hot Encoding

Sparse, no meaning. "cute" and "adorable" are equally different as "cute" and "airplane".

2️⃣
Word2vec / GloVe

Dense, meaningful, but static — one vector per word regardless of context.

3️⃣
RNN / LSTM / GRU

Sequential processing, captures order, but information bottleneck for long sequences.

4️⃣
ELMo

Context-dependent embeddings via bidirectional LSTMs. "bank" gets different vectors in "river bank" vs "bank account".

5️⃣
Transformers (Attention) ⭐

Parallel attention over all tokens. No bottleneck. This is what powers Claude, GPT, and LLaMA today.