306M-param Bengali LLM from scratch

Kotha-1

Low-resource language work needs a model and a training pipeline built around the script, not retrofitted to it.

What I built

A from-scratch pre-training pipeline for a 306M-parameter LLaMA-style decoder (18 layers, d=1024, 16 heads / 4 KV via GQA, SwiGLU, RoPE, tied embeddings). Bengali data collection / dedup / language-ID, a 32k SentencePiece BPE tokenizer, and a custom Accelerate/bf16 training loop (cosine schedule, gradient accumulation, checkpointing) run to completion on a single A100. Param count verified exact (306,484,224).

Stack

PythonPyTorchAccelerateSentencePieceGQARoPE

Status

Pre-training pipeline shipped; trained to completion with checkpoints.

The defensible story here is the pipeline engineering (data, tokenizer, and a from-scratch training loop), not benchmark scores on a small model.