AssameseOCR

AssameseOCR is a vision-language model for Optical Character Recognition (OCR) of printed Assamese text. Built on Microsoft's Florence-2-large foundation model with a custom character-level decoder, it achieves 94.67% character accuracy on the Mozhi dataset.

Model Details

Model Description

Developed by: MWire Labs
Model type: Vision-Language OCR
Language: Assamese (অসমীয়া)
License: Apache 2.0
Base Model: microsoft/Florence-2-large-ft
Architecture: Florence-2 Vision Encoder + Custom Transformer Decoder

Model Architecture

Image (768×768) 
  ↓
Florence-2 Vision Encoder (frozen, 360M params)
  ↓
Vision Projection (1024 → 512 dim)
  ↓
Transformer Decoder (4 layers, 8 heads)
  ↓
Character-level predictions (187 vocab)

Key Components:

Vision Encoder: Florence-2-large DaViT architecture (frozen)
Decoder: 4-layer Transformer with 512 hidden dimensions
Tokenizer: Character-level with 187 tokens (Assamese chars + English + digits + symbols)
Total Parameters: 378M (361M frozen, 17.5M trainable)

Training Details

Training Data

Dataset: Mozhi Indic OCR Dataset (Assamese subset)
Training samples: 79,697 word images
Validation samples: 9,945 word images
Test samples: 10,146 word images
Source: IIT Hyderabad CVIT

Training Procedure

Hardware:

GPU: NVIDIA A40 (48GB VRAM)
Training time: ~8 hours (3 epochs)

Hyperparameters:

Epochs: 3
Batch size: 16
Learning rate: 3e-4
Optimizer: AdamW (weight_decay=0.01)
Scheduler: CosineAnnealingLR
Max sequence length: 128 characters
Gradient clipping: 1.0

Training Strategy:

Froze Florence-2 vision encoder (leveraging pretrained visual features)
Trained only the projection layer and transformer decoder
Full fine-tuning (no LoRA) for maximum quality

Performance

Results

Split	Character Accuracy	Loss
Epoch 1 (Val)	91.61%	0.2844
Epoch 2 (Val)	94.09%	0.1548
Epoch 3 (Val)	94.67%	0.1221

Character Error Rate (CER): ~5.33%

Comparison

The model achieves strong performance for a foundation model approach:

Mozhi paper (CRNN+CTC specialist): ~99% accuracy
AssameseOCR (Florence generalist): 94.67% accuracy

The 5% gap is expected when adapting a general vision-language model versus training a specialized OCR architecture. However, AssameseOCR offers:

Extensibility to vision-language tasks (VQA, captioning, document understanding)
Faster training (3 epochs vs typical 10-20 for CRNN)
Foundation model benefits (transfer learning, robustness)

Usage

Installation

pip install torch torchvision transformers pillow

Inference

import torch
import torch.nn as nn
from PIL import Image
from transformers import AutoModelForCausalLM, CLIPImageProcessor
from huggingface_hub import hf_hub_download
import json

# CharTokenizer class
class CharTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.char2id = {c: i for i, c in enumerate(vocab)}
        self.id2char = {i: c for i, c in enumerate(vocab)}
        self.pad_token_id = self.char2id["<pad>"]
        self.bos_token_id = self.char2id["<s>"]
        self.eos_token_id = self.char2id["</s>"]
        
    def encode(self, text, max_length=None, add_special_tokens=True):
        ids = [self.bos_token_id] if add_special_tokens else []
        for ch in text:
            ids.append(self.char2id.get(ch, self.char2id["<unk>"]))
        if add_special_tokens:
            ids.append(self.eos_token_id)
        if max_length:
            ids = ids[:max_length]
            if len(ids) < max_length:
                ids += [self.pad_token_id] * (max_length - len(ids))
        return ids
        
    def decode(self, ids, skip_special_tokens=True):
        chars = []
        for i in ids:
            ch = self.id2char.get(i, "")
            if skip_special_tokens and ch.startswith("<"):
                continue
            chars.append(ch)
        return "".join(chars)
    
    @classmethod
    def load(cls, path):
        with open(path, "r", encoding="utf-8") as f:
            vocab = json.load(f)
        return cls(vocab)

# FlorenceCharOCR model class
class FlorenceCharOCR(nn.Module):
    def __init__(self, florence_model, vocab_size, vision_hidden_dim, decoder_hidden_dim=512, num_layers=4):
        super().__init__()
        self.florence_model = florence_model
        
        for param in self.florence_model.parameters():
            param.requires_grad = False
        
        self.vision_proj = nn.Linear(vision_hidden_dim, decoder_hidden_dim)
        self.embedding = nn.Embedding(vocab_size, decoder_hidden_dim)
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=decoder_hidden_dim, 
            nhead=8, 
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
        self.fc_out = nn.Linear(decoder_hidden_dim, vocab_size)
        
    def forward(self, pixel_values, tgt_ids, tgt_mask=None):
        with torch.no_grad():
            vision_feats = self.florence_model._encode_image(pixel_values)
        
        vision_feats = self.vision_proj(vision_feats)
        tgt_emb = self.embedding(tgt_ids)
        decoder_out = self.decoder(tgt_emb, vision_feats, tgt_mask=tgt_mask)
        logits = self.fc_out(decoder_out)
        
        return logits

# Load components
device = "cuda" if torch.cuda.is_available() else "cpu"

# Download files from HuggingFace
tokenizer_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_char_tokenizer.json")
model_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_ocr_best.pt")

# Load tokenizer
char_tokenizer = CharTokenizer.load(tokenizer_path)

# Load Florence base model
florence_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large-ft",
    trust_remote_code=True
).to(device)

# Load image processor
image_processor = CLIPImageProcessor.from_pretrained("microsoft/Florence-2-large-ft")

# Initialize OCR model
ocr_model = FlorenceCharOCR(
    florence_model=florence_model,
    vocab_size=len(char_tokenizer.vocab),
    vision_hidden_dim=1024,
    decoder_hidden_dim=512,
    num_layers=4
).to(device)

# Load trained weights
checkpoint = torch.load(model_path, map_location=device)
ocr_model.load_state_dict(checkpoint['model_state_dict'])
ocr_model.eval()

# Inference function
def recognize_text(image_path):
    # Load and process image
    image = Image.open(image_path).convert("RGB")
    pixel_values = image_processor(images=[image], return_tensors="pt")['pixel_values'].to(device)
    
    # Generate prediction
    with torch.no_grad():
        # Start with BOS token
        generated_ids = [char_tokenizer.bos_token_id]
        
        for _ in range(128):  # max length
            tgt_tensor = torch.tensor([generated_ids], device=device)
            logits = ocr_model(pixel_values, tgt_tensor)
            
            # Get next token
            next_token = logits[0, -1].argmax().item()
            generated_ids.append(next_token)
            
            # Stop if EOS
            if next_token == char_tokenizer.eos_token_id:
                break
    
    # Decode
    text = char_tokenizer.decode(generated_ids, skip_special_tokens=True)
    return text

# Example usage
result = recognize_text("assamese_text.jpg")
print(f"Recognized text: {result}")

Vocabulary

The character-level tokenizer includes:

Assamese characters: 119 unique chars (consonants, vowels, diacritics, conjuncts)
English: 52 chars (a-z, A-Z)
Digits: 30 chars (ASCII 0-9, Assamese ০-৯, Devanagari ०-९)
Symbols: 33 chars (punctuation, special chars)
Special tokens: 6 tokens (<pad>, <s>, </s>, <unk>, <OCR>, <lang_as>)
Total vocabulary: 187 tokens

Limitations

Trained only on printed text (not handwritten)
Word-level images from Mozhi dataset (may not generalize to full-page OCR without line segmentation)
Character-level decoder may struggle with very long sequences (>128 chars)
Does not handle layout analysis or reading order
Performance on degraded/low-quality images not extensively tested

Future Work

Extend to MeiteiOCR for Meitei Mayek script
Scale to NE-OCR covering all 9+ Northeast Indian languages
Add document layout analysis and reading order detection
Improve performance with synthetic data augmentation
Fine-tune for handwritten text recognition
Extend to multimodal tasks (image captioning, VQA for documents)

Citation

If you use AssameseOCR in your research, please cite:

@software{assameseocr2026,
  author = {MWire Labs},
  title = {AssameseOCR: Vision-Language Model for Assamese Text Recognition},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/MWirelabs/assamese-ocr}
}

Acknowledgments

Dataset: Mozhi Indic OCR Dataset by IIT Hyderabad CVIT (Mathew et al., 2022)
Base Model: Florence-2 by Microsoft Research
Organization: MWire Labs, Shillong, Meghalaya, India

Contact

Organization: MWire Labs
Location: Shillong, Meghalaya, India
Focus: Language technology for Northeast Indian languages

Part of the MWire Labs NLP suite:

KhasiBERT - Khasi language model
NE-BERT - 9 Northeast languages
Kren-M - Khasi-English conversational AI
AssameseOCR - Assamese text recognition

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for MWirelabs/assamese-ocr

Base model

microsoft/Florence-2-large-ft

Finetuned

(9)

this model

Dataset used to train MWirelabs/assamese-ocr

Paper for MWirelabs/assamese-ocr

Towards Deployable OCR models for Indic languages

Paper • 2205.06740 • Published May 13, 2022

Evaluation results

Character Accuracy on Mozhi Indic OCR (Assamese)
test set self-reported

94.670
Character Error Rate (CER) on Mozhi Indic OCR (Assamese)
test set self-reported

5.330