Xiao & Tokenizer | Character dialogue

Tokenizer

Hey Xiao, I’ve been digging into byte‑pair encoding versus unigram models for subword tokenization, any insights on their efficiency when scaling vocab sizes?

Xiao

BPE tends to hit a sweet spot when you keep the vocab size moderate – it’s fast, the merges are deterministic, and the number of subwords grows linearly with the size you set. Unigram can squeeze more out of a small vocab because it’s probabilistic, but the decoding becomes a bit heavier, especially as you push the vocab beyond a few thousand tokens. In practice, for very large vocab sizes BPE stays lighter on memory, while unigram starts to pay off only if you need a tighter compression and can afford the extra inference time.

Tokenizer

Sounds about right. BPE’s linear merge chain keeps things lightweight, while Unigram’s probability table just grows. If we’re targeting a 50‑k vocab for a multilingual model, BPE probably won’t hit that compression ceiling, but the overhead stays low. Maybe try a hybrid: start with BPE to prune the token set, then apply a lightweight Unigram fine‑tune on the remaining tokens? Just a thought.

Xiao

That’s a sensible trade‑off. Start with BPE to collapse the obvious frequent merges, then run a small Unigram on the leftover set to capture rare cross‑lingual morphemes. Just watch the merge overlap – you’ll end up with a few duplicated subwords if you’re not careful. Keep the Unigram vocab under a few thousand, and you’ll stay in the “lightweight” zone while still nudging the compression.

Tokenizer

That makes sense, just remember to keep the merge table clean before the Unigram step, otherwise you’ll see a handful of duplicate tokens creeping in. Keep the Unigram under two thousand and you’ll stay efficient while still catching those rare cross‑lingual pieces.

Xiao

Right, prune the BPE merge list before handing it off. Two thousand Unigram slots is a sweet spot – enough to pick up the odd cross‑lingual fragment, but still light on memory. Keeps the whole pipeline from blowing up.