Combined Myanmar LLM Model
A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding.
English
Overview
This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:
| Source | Dataset | Description | Type | Samples |
|---|---|---|---|---|
| chat-skill.md | amkyawdev/myanmar-llm-data | Myanmar conversations, translations, Q&A | Chat/Skill | ~54,553 |
| agent-skill.md | amkyawdev/mm-llm-coder-agent-dataset | Agent workflow for coding tasks | Agent/Skill | ~1,000,020 |
| code-skill.md | amkyawdev/mm-llm-coder-dataset | Code generation and debugging | Code/Skill | ~2,000,000 |
Total Samples: ~3,020,347
Data Sources
1. chat-skill.md - Myanmar LLM Data (amkyawdev/myanmar-llm-data)
Multi-turn conversations in Burmese and English:
- Format:
messages(role + content),tags - Link: View Dataset
2. agent-skill.md - Coder Agent Dataset (amkyawdev/mm-llm-coder-agent-dataset)
Agent workflows with tool usage:
- Format: Agent workflows with
tools_used,code_snippets,execution_result - Link: View Dataset
3. code-skill.md - Coder Dataset (amkyawdev/mm-llm-coder-dataset)
Code generation and debugging tasks:
- Format: Code Q&A conversations
- Link: View Dataset
Features
- Myanmar Language Support: Native Burmese (မြန်မာစာ) conversations and translations
- Code Generation: Python, JavaScript, TypeScript and other programming languages
- Agent Workflows: Multi-step coding tasks with tool usage
- Quality Metrics: Ratings, validation status, and complexity scores
Usage
from datasets import load_dataset
# Load chat-skill dataset (Myanmar conversations)
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
print("Chat data:", chat_ds)
# Load agent-skill dataset (Agent workflows)
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
print("Agent data:", agent_ds)
# Load code-skill dataset (Code generation)
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
print("Code data:", code_ds)
# Access specific samples
chat_sample = chat_ds["train"][0]
print("Messages:", chat_sample["messages"])
print("Tags:", chat_sample["tags"])
Use Cases
- Myanmar Language Models: Train LLMs that understand Burmese
- Code Generation: Train models for programming tasks
- Agent Workflows: Train coding agents with tool usage
- Debugging: Fix common coding errors
- Multilingual Tasks: Translation between English and Myanmar
License
Apache 2.0 License
မြန်မာဘာသာ
အနှစ်ချူပါ
ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။
| Source | Dataset | Description | Samples |
|---|---|---|---|
| chat-skill.md | amkyawdev/myanmar-llm-data |
မြန်မာစာပါးဆက် | ~54,553 |
| agent-skill.md | amkyawdev/mm-llm-coder-agent-dataset |
Agent workflow | ~1,000,020 |
| code-skill.md | amkyawdev/mm-llm-coder-dataset |
ကုဒ်ထုတ်လုပ်ခြင်း | ~2,000,000 |
ပါဝင်မှု စုစုပါး: ~3,020,347
သုံးပါ
from datasets import load_dataset
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
License
Apache 2.0 License
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support