GT-REX: Production OCR Model

GothiTech Recognition and Extraction eXpert

GT-REX is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.

GT-REX Variants
Key Features
Model Details
Quick Start
Installation
Usage Examples
Use Cases
Performance Benchmarks
Prompt Engineering Guide
API Integration
Troubleshooting
Hardware Recommendations
License
Citation

GT-REX Variants

GT-REX ships with three optimized configurations tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.

Variant	Speed	Accuracy	Resolution	GPU Memory	Throughput	Best For
Nano	Ultra Fast	Good	640px	4-6 GB	100-150 docs/min	High-volume batch processing
Pro (Default)	Fast	High	1024px	6-10 GB	50-80 docs/min	Standard enterprise workflows
Ultra	Moderate	Maximum	1536px	10-15 GB	20-30 docs/min	High-accuracy and fine-detail needs

How to Choose a Variant

Nano: You need maximum throughput and documents are simple (receipts, IDs, labels).
Pro: General-purpose. Best balance for invoices, contracts, forms, and reports.
Ultra: Documents have fine print, dense tables, medical records, or legal footnotes.

GT-Rex-Nano

Speed-optimized for high-volume batch processing

Setting	Value
Resolution	640 x 640 px
Speed	~1-2s per image
Max Tokens	2048
GPU Memory	4-6 GB
Recommended Batch Size	256 sequences

Best for: Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.

from vllm import LLM

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=2048,
    gpu_memory_utilization=0.6,
    max_num_seqs=256,
    limit_mm_per_prompt={"image": 1},
)

GT-Rex-Pro (Default)

Balanced quality and speed for standard enterprise documents

Setting	Value
Resolution	1024 x 1024 px
Speed	~2-5s per image
Max Tokens	4096
GPU Memory	6-10 GB
Recommended Batch Size	128 sequences

Best for: Contracts, forms, invoices, reports, government documents, insurance claims.

from vllm import LLM

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

GT-Rex-Ultra

Maximum quality with adaptive processing for complex documents

Setting	Value
Resolution	1536 x 1536 px
Speed	~5-10s per image
Max Tokens	8192
GPU Memory	10-15 GB
Recommended Batch Size	64 sequences

Best for: Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.

from vllm import LLM

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=8192,
    gpu_memory_utilization=0.85,
    max_num_seqs=64,
    limit_mm_per_prompt={"image": 1},
)

Key Features

Feature	Description
High Accuracy	Advanced vision-language architecture for precise text extraction
Multi-Language	Handles documents in English and multiple other languages
Production Ready	Optimized for deployment with the vLLM inference engine
Batch Processing	Process hundreds of documents per minute (Nano variant)
Flexible Prompts	Supports structured extraction: JSON, tables, key-value pairs, forms
Handwriting Support	Transcribes handwritten text with high fidelity
Three Variants	Nano (speed), Pro (balanced), Ultra (accuracy)
Structured Output	Extract data directly into JSON, Markdown tables, or custom schemas

Model Details

Attribute	Value
Developer	GothiTech (Jenis Hathaliya)
Architecture	Vision-Language Model (VLM)
Model Size	~6.5 GB
Parameters	~7B
License	MIT
Release Date	February 2026
Precision	BF16 / FP16
Input Resolution	640px - 1536px (variant dependent)
Max Sequence Length	2048 - 8192 tokens (variant dependent)
Inference Engine	vLLM (recommended)
Framework	PyTorch / Transformers

Quick Start

Get running in under 5 minutes:

from vllm import LLM, SamplingParams
from PIL import Image

# 1. Load model (Pro variant - default)
llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# 2. Prepare input
image = Image.open("document.png")
prompt = "Extract all text from this document."

# 3. Run inference
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=4096,
)

outputs = llm.generate(
    [{
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    }],
    sampling_params=sampling_params,
)

# 4. Get results
result = outputs[0].outputs[0].text
print(result)

Installation

Prerequisites

Python 3.9+
CUDA 11.8+ (GPU required)
8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)

Install Dependencies

pip install vllm pillow torch transformers

Verify Installation

from vllm import LLM
print("vLLM installed successfully!")

Usage Examples

Basic Text Extraction

prompt = "Extract all text from this document image."

Structured JSON Extraction

prompt = '''Extract the following fields from this invoice as JSON:
{
    "invoice_number": "",
    "date": "",
    "vendor_name": "",
    "total_amount": "",
    "line_items": [
        {"description": "", "quantity": "", "unit_price": "", "amount": ""}
    ]
}'''

Table Extraction (Markdown Format)

prompt = "Extract all tables from this document in Markdown table format."

Key-Value Pair Extraction

prompt = '''Extract all key-value pairs from this form.
Return as:
Key: Value
Key: Value'''

Handwritten Text Transcription

prompt = "Transcribe all handwritten text from this image accurately."

Multi-Document Batch Processing

from PIL import Image
from vllm import LLM, SamplingParams

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# Prepare batch
image_paths = ["doc1.png", "doc2.png", "doc3.png"]
prompts = []
for path in image_paths:
    img = Image.open(path)
    prompts.append({
        "prompt": "Extract all text from this document.",
        "multi_modal_data": {"image": img},
    })

# Run batch inference
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
outputs = llm.generate(prompts, sampling_params=sampling_params)

# Collect results
for i, output in enumerate(outputs):
    print(f"--- Document {i + 1} ---")
    print(output.outputs[0].text)
    print()

Use Cases

Domain	Application	Recommended Variant
Finance	Invoice processing, receipt scanning, bank statements	Pro / Nano
Legal	Contract analysis, clause extraction, legal filings	Ultra
Healthcare	Medical records, prescriptions, lab reports	Ultra
Government	Form processing, ID verification, tax documents	Pro
Insurance	Claims processing, policy documents	Pro
Education	Exam paper digitization, handwritten notes	Pro / Ultra
Logistics	Shipping labels, waybills, packing lists	Nano
Real Estate	Property documents, deeds, mortgage papers	Pro
Retail	Product catalogs, price tags, inventory lists	Nano

Performance Benchmarks

Throughput by Variant (NVIDIA A100 80GB)

Variant	Single Image	Batch (32)	Batch (128)
Nano	~1.2s	~15s	~55s
Pro	~3.5s	~45s	~170s
Ultra	~7.0s	~110s	~380s

Accuracy by Document Type (Pro Variant)

Document Type	Character Accuracy	Field Accuracy
Printed invoices	98.5%+	96%+
Typed contracts	98%+	95%+
Handwritten notes	92%+	88%+
Dense tables	96%+	93%+
Low-quality scans	94%+	90%+

Note: Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.

Prompt Engineering Guide

Get the best results from GT-REX with these prompt strategies:

Tips for Best Results

Do:

Be specific about what to extract ("Extract the invoice number and total amount")
Specify output format ("Return as JSON", "Return as Markdown table")
Provide schema for structured extraction (show the expected JSON keys)
Use clear instructions ("Transcribe exactly as written, preserving spelling errors")

Don't:

Use vague prompts ("What is this?")
Ask for analysis or summarization (GT-REX is optimized for extraction)
Include unrelated context in the prompt

Example Prompts

# Simple extraction
"Extract all text from this document."

# Targeted extraction
"Extract only the table on this page as a Markdown table."

# Schema-driven extraction
"Extract data matching this schema: {name: str, date: str, amount: float}"

# Preservation mode
"Transcribe this document exactly as written, preserving original formatting."

API Integration

FastAPI Server Example

from fastapi import FastAPI, UploadFile
from PIL import Image
from vllm import LLM, SamplingParams
import io

app = FastAPI()

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)


@app.post("/extract")
async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

    outputs = llm.generate(
        [{
            "prompt": prompt,
            "multi_modal_data": {"image": image},
        }],
        sampling_params=sampling_params,
    )

    return {"text": outputs[0].outputs[0].text}

cURL Example

curl -X POST "http://localhost:8000/extract" \
  -F "file=@invoice.png" \
  -F "prompt=Extract all text from this invoice as JSON."

Troubleshooting

Issue	Solution
CUDA Out of Memory	Reduce `gpu_memory_utilization` or switch to Nano variant
Slow inference	Increase `max_num_seqs` for better batching; use Nano for speed
Truncated output	Increase `max_tokens` in `SamplingParams`
Low accuracy on small text	Switch to Ultra variant for higher resolution
Garbled multilingual text	Ensure image resolution is sufficient; try Ultra variant
Empty output	Check that the image is loaded correctly and is not blank
Model loading errors	Ensure `trust_remote_code=True` is set

Hardware Recommendations

Variant	Minimum GPU	Recommended GPU
Nano	NVIDIA T4 (16 GB)	NVIDIA A10 (24 GB)
Pro	NVIDIA A10 (24 GB)	NVIDIA A100 (40 GB)
Ultra	NVIDIA A100 (40 GB)	NVIDIA A100 (80 GB)

License

This model is released under the MIT License. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.

Citation

If you use GT-REX in your work, please cite:

@misc{gtrex-2026,
  title   = {GT-REX: Production-Grade OCR with Vision-Language Models},
  author  = {Hathaliya, Jenis},
  year    = {2026},
  month   = {February},
  url     = {https://huggingface.co/gothitech/GT-REX},
  note    = {GothiTech Recognition and Extraction eXpert}
}

Contact and Support

Developer: Jenis Hathaliya
Organization: GothiTech
HuggingFace: gothitech

Built by GothiTech

Last updated: February 2026
GT-REX | Variants: Nano | Pro | Ultra

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

gothitech
/

GT-REX