Back to work
Language systems research · paper + 13 tokenizers
Bengali Tokenizer Evaluation
Multilingual LLM tokenizers fragment Bengali into byte fallback; a single missing combining character (U+09BC) inflates context cost across the script.
View sourceWhat I built
An evaluation framework over Bengali documents measuring subword fertility, byte-fallback inflation, compression, and script purity across 13 trained SentencePiece models (BPE + Unigram, 16k–64k vocab, Bengali-only and Bengali+English) versus multilingual baselines. Measured up to ~5x context-window inflation, and isolated a Unicode normalization failure (U+09BC, Bengali Nukta) that drove byte-fallback from ~2% to ~20%.
Stack
PythonSentencePieceNLPLaTeX
Status
Paper + 13 tokenizers + evaluation harness released.