📝 myX-Semantic: A Burmese Sentence Embedding Model

Model Description

myX-Semantic is a sentence-transformer model fine-tuned for the Burmese (Myanmar) language. It maps sentences and paragraphs into a 768-dimensional dense vector space.

This model is built using a Knowledge Distillation approach. It utilizes a paraphrase-multilingual-MiniLM-L12-v2 student architecture, which has been trained to mimic the high-dimensional output of a larger teacher model (paraphrase-multilingual-mpnet-base-v2). To ensure compatibility with the teacher's embeddings, a dedicated Dense layer was integrated to project the student's native 384-dimensions into the final 768-dimensional space.

Key Applications

  • Semantic Textual Similarity (STS): Measuring how similar two sentences are in meaning.
  • Semantic Search: Retrieving relevant documents based on intent rather than keywords.
  • Text Classification & Clustering: Grouping similar Burmese texts based on their semantic vectors.
  • Information Retrieval: Finding answers or paraphrases in large Burmese datasets.

Development & Distribution

Technical Specifications

  • Base Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  • Max Sequence Length: 512 tokens
  • Output Dimension: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Loss Function: MSELoss (Mean Squared Error)

Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
  (2): Dense({'in_features': 384, 'out_features': 768, 'bias': True, 'activation_function': 'Identity'})
)

Usage

Installation

pip install -U sentence-transformers

Direct Usage (Inference)

from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer("DatarrX/myX-Semantic")

# Define sentences
sentences = [
    "သူနှင့် ကျွန်မ ခဏ ငြိမ်နေလိုက်၏။",
    "ကျွန်တော်တို့ အတူတူ ထိုင်နေကြသည်။",
    "နည်းပညာသည် လူသားတို့အတွက် အရေးကြီးသည်။"
]

# Compute embeddings
embeddings = model.encode(sentences)

# Compute similarity scores
similarities = model.similarity(embeddings, embeddings)
print(similarities)

Implementation Guidelines (Thresholds)

When using this model for similarity detection or semantic search, the choice of a similarity threshold is crucial for balancing precision and recall. Based on empirical testing:

  • Recommended Threshold: A Cosine Similarity score of 0.60 or higher is recommended to determine a strong semantic match.
  • Comparison: Compared to lighter models (e.g., 500K-row variants), this 1M-row model exhibits higher confidence in its vector representations. While lower-capacity models might require a threshold around 0.40, myX-Semantic is optimized for a more distinctive separation at the 0.60 level.

Training Details

  • Samples: 1,000,000 training pairs.
  • Batch Size: 64
  • Learning Rate: 3e-5
  • Optimizer: AdamW with round_robin batch sampling.
  • Teacher Model: paraphrase-multilingual-mpnet-base-v2 (768-dim).

Training Logs

Epoch Step Training Loss
0.06 500 0.0086
0.25 2000 0.0045
0.64 5000 0.0031
0.96 7500 0.0028

Limitations & Bias

  • Language: This model is specifically optimized for Unicode Burmese. It may not perform accurately with Zawgyi-encoded text.
  • Data Bias: The model reflects the patterns and biases found in the myX-Mega-Corpus. Users should validate results for specific sensitive domains.

License

This model is licensed under the Apache License 2.0. You are free to use it for research and commercial purposes, provided appropriate credit is given.

Citation

If you find this model useful in your project, please cite it:

@software{khantsintheinn2026myxsemantic,
  author = {Khant Sint Heinn},
  title = {myX-Semantic: A Burmese Sentence Embedding Model},
  year = {2026},
  publisher = {DatarrX},
  url = {https://huggingface.co/DatarrX/myX-Semantic}
}

About the Author

Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

Connect with the Author:
GitHub | Hugging Face | Kaggle

Downloads last month
36
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DatarrX/myX-Semantic

Dataset used to train DatarrX/myX-Semantic