synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier Paper β’ 2601.16113 β’ Published Jan 22
ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining Paper β’ 2601.01091 β’ Published Jan 3
600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script Paper β’ 2601.01088 β’ Published Jan 3