Xiao & Tokenizer
Hey Xiao, I’ve been digging into byte‑pair encoding versus unigram models for subword tokenization, any insights on their efficiency when scaling vocab sizes?
BPE tends to hit a sweet spot when you keep the vocab size moderate – it’s fast, the merges are deterministic, and the number of subwords grows linearly with the size you set. Unigram can squeeze more out of a small vocab because it’s probabilistic, but the decoding becomes a bit heavier, especially as you push the vocab beyond a few thousand tokens. In practice, for very large vocab sizes BPE stays lighter on memory, while unigram starts to pay off only if you need a tighter compression and can afford the extra inference time.
Sounds about right. BPE’s linear merge chain keeps things lightweight, while Unigram’s probability table just grows. If we’re targeting a 50‑k vocab for a multilingual model, BPE probably won’t hit that compression ceiling, but the overhead stays low. Maybe try a hybrid: start with BPE to prune the token set, then apply a lightweight Unigram fine‑tune on the remaining tokens? Just a thought.
That’s a sensible trade‑off. Start with BPE to collapse the obvious frequent merges, then run a small Unigram on the leftover set to capture rare cross‑lingual morphemes. Just watch the merge overlap – you’ll end up with a few duplicated subwords if you’re not careful. Keep the Unigram vocab under a few thousand, and you’ll stay in the “lightweight” zone while still nudging the compression.
That makes sense, just remember to keep the merge table clean before the Unigram step, otherwise you’ll see a handful of duplicate tokens creeping in. Keep the Unigram under two thousand and you’ll stay efficient while still catching those rare cross‑lingual pieces.
Right, prune the BPE merge list before handing it off. Two thousand Unigram slots is a sweet spot – enough to pick up the odd cross‑lingual fragment, but still light on memory. Keeps the whole pipeline from blowing up.
Got it, prune the BPE merges first, then feed the clean list into a 2 k Unigram – that keeps memory in check while still catching those rare cross‑lingual pieces. Just double‑check for any frequency spikes that might push a merge back in.
Sure thing, will filter the merge list and keep an eye on any frequency spikes before the Unigram step. Keeps the memory usage in check while still capturing the rare cross‑lingual tokens.
Sounds good, keep it tight and you’ll stay efficient.