306M-param Bengali LLM from scratch
Kotha-1
Low-resource language work needs a model and a training pipeline built around the script, not retrofitted to it.
View sourceWhat I built
A from-scratch pre-training pipeline for a 306M-parameter LLaMA-style decoder (18 layers, d=1024, 16 heads / 4 KV via GQA, SwiGLU, RoPE, tied embeddings). Bengali data collection / dedup / language-ID, a 32k SentencePiece BPE tokenizer, and a custom Accelerate/bf16 training loop (cosine schedule, gradient accumulation, checkpointing) run to completion on a single A100. Param count verified exact (306,484,224).
Stack
Status
Pre-training pipeline shipped; trained to completion with checkpoints.
The defensible story here is the pipeline engineering (data, tokenizer, and a from-scratch training loop), not benchmark scores on a small model.