Combined Myanmar LLM Model

A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding.

English | မြန်မာဘာသာ


English

Overview

This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:

Source Dataset Description Type Samples
chat-skill.md amkyawdev/myanmar-llm-data Myanmar conversations, translations, Q&A Chat/Skill ~54,553
agent-skill.md amkyawdev/mm-llm-coder-agent-dataset Agent workflow for coding tasks Agent/Skill ~1,000,020
code-skill.md amkyawdev/mm-llm-coder-dataset Code generation and debugging Code/Skill ~2,000,000

Total Samples: ~3,020,347

Data Sources

1. chat-skill.md - Myanmar LLM Data (amkyawdev/myanmar-llm-data)

Multi-turn conversations in Burmese and English:

  • Format: messages (role + content), tags
  • Link: View Dataset

2. agent-skill.md - Coder Agent Dataset (amkyawdev/mm-llm-coder-agent-dataset)

Agent workflows with tool usage:

  • Format: Agent workflows with tools_used, code_snippets, execution_result
  • Link: View Dataset

3. code-skill.md - Coder Dataset (amkyawdev/mm-llm-coder-dataset)

Code generation and debugging tasks:

Features

  • Myanmar Language Support: Native Burmese (မြန်မာစာ) conversations and translations
  • Code Generation: Python, JavaScript, TypeScript and other programming languages
  • Agent Workflows: Multi-step coding tasks with tool usage
  • Quality Metrics: Ratings, validation status, and complexity scores

Usage

from datasets import load_dataset

# Load chat-skill dataset (Myanmar conversations)
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
print("Chat data:", chat_ds)

# Load agent-skill dataset (Agent workflows)
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
print("Agent data:", agent_ds)

# Load code-skill dataset (Code generation)
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
print("Code data:", code_ds)

# Access specific samples
chat_sample = chat_ds["train"][0]
print("Messages:", chat_sample["messages"])
print("Tags:", chat_sample["tags"])

Use Cases

  • Myanmar Language Models: Train LLMs that understand Burmese
  • Code Generation: Train models for programming tasks
  • Agent Workflows: Train coding agents with tool usage
  • Debugging: Fix common coding errors
  • Multilingual Tasks: Translation between English and Myanmar

License

Apache 2.0 License


မြန်မာဘာသာ

အနှစ်ချူပါ

ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။

Source Dataset Description Samples
chat-skill.md amkyawdev/myanmar-llm-data မြန်မာစာပါးဆက် ~54,553
agent-skill.md amkyawdev/mm-llm-coder-agent-dataset Agent workflow ~1,000,020
code-skill.md amkyawdev/mm-llm-coder-dataset ကုဒ်ထုတ်လုပ်ခြင်း ~2,000,000

ပါဝင်မှု စုစုပါး: ~3,020,347

သုံးပါ

from datasets import load_dataset

chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")

License

Apache 2.0 License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support