DKSplit v0.3.1
BiLSTM-CRF model for splitting concatenated strings into words. Trained on millions of domain names, brand names, personal names, and multilingual phrases.
85% accuracy on real-world newly registered domains, outperforming WordSegment (54%) and WordNinja (46%).
Quick Start
pip install dksplit
import dksplit
dksplit.split("chatgptlogin")
# ['chatgpt', 'login']
dksplit.split("spotifywrapped")
# ['spotify', 'wrapped']
dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']
dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]
Model Details
| Property | Value |
|---|---|
| Architecture | BiLSTM-CRF |
| Parameters | 9.47M |
| Embedding | 384 |
| Hidden | 768 |
| Layers | 3 |
| Vocab | a-z, 0-9 (38 tokens) |
| Max length | 64 characters |
| Format | ONNX INT8 quantized |
| Size | 9 MB |
| Inference | CPU only, no GPU required |
Training
- Infrastructure: Leonardo Booster supercomputer at CINECA, Italy (NVIDIA A100)
- Compute: EuroHPC Joint Undertaking, project AIFAC_P02_281
- Data: Millions of labeled samples covering domain names, brand names, tech terms, personal names, and multilingual phrases
- Labels: Character-level B/I tags (B = word boundary, I = continuation)
- Optimizer: Adam, cosine LR schedule with warmup
- Epochs: 15
Benchmark
1,000 randomly sampled domains from the Newly Registered Domains Database (NRDS) (April 2026 .com feed), human-audited ground truth:
| Model | Accuracy |
|---|---|
| DKSplit v0.3.1 | 85.0% |
| DKSplit v0.2.x | 82.8% |
| WordSegment | 54.0% |
| WordNinja | 46.1% |
~5% of test samples have multiple valid segmentations. Accounting for these, effective accuracy is closer to 90%.
Examples
| Input | DKSplit | WordSegment | WordNinja |
|---|---|---|---|
chatgptprompts |
chatgpt prompts | chat gpt prompts | chat gp t prompts |
spotifywrapped |
spotify wrapped | spot if y wrapped | spot if y wrapped |
ethereumwallet |
ethereum wallet | e there um wallet | e there um wallet |
whatsappstatus |
whatsapp status | what sapp status | what s app status |
escribirenvozalta |
escribir en voz alta | escribir env oz alta | es crib ire nv oz alta |
candidiasenuncamais |
candidiase nunca mais | candid iase nunca mais | can didi as e nun cama is |
Using the ONNX Model Directly
The model outputs emission scores. CRF decoding is done separately using the parameters in dksplit.npz.
import numpy as np
import onnxruntime as ort
# Load model
sess = ort.InferenceSession("dksplit-int8.onnx")
crf = np.load("dksplit.npz")
# Encode input
CHAR_MAP = {c: i+2 for i, c in enumerate("abcdefghijklmnopqrstuvwxyz0123456789")}
text = "chatgptlogin"
ids = np.array([[CHAR_MAP.get(c, 1) for c in text]], dtype=np.int64)
# Get emissions
emissions = sess.run(["emissions"], {"chars": ids})[0]
# CRF Viterbi decode
trans = crf["transitions"]
start_t = crf["start_transitions"]
end_t = crf["end_transitions"]
score = start_t + emissions[0, 0]
history = []
for t in range(1, emissions.shape[1]):
ns = score[:, None] + trans + emissions[0, t, None, :]
history.append(np.argmax(ns, axis=0))
score = np.max(ns, axis=0)
best = [np.argmax(score + end_t)]
for h in reversed(history):
best.append(h[best[-1]])
best.reverse()
# Decode to words
words, cur = [], []
for ch, lb in zip(text, best):
if lb == 1 and cur:
words.append("".join(cur))
cur = [ch]
else:
cur.append(ch)
if cur:
words.append("".join(cur))
print(words) # ['chatgpt', 'login']
Files
dksplit-int8.onnx- BiLSTM emissions model (INT8 quantized, 9 MB)dksplit.npz- CRF parameters (transitions, start_transitions, end_transitions)
Intended Use
- Domain name analysis and segmentation
- Hashtag splitting
- URL component extraction
- Compound string decomposition
- Any concatenated text without spaces
Limitations
- Latin script only (a-z, 0-9)
- Max 64 characters
- Accuracy is highest on English and major European languages
- Some inputs are genuinely ambiguous
Links
- PyPI: pypi.org/project/dksplit
- GitHub: github.com/ABTdomain/dksplit
- Go version: github.com/ABTdomain/dksplit-go
- Website: ABTdomain.com, DomainKits.com
Acknowledgements
The v0.3.1 model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (project AIFAC_P02_281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.
License
Apache 2.0
Please attribute as: DKsplit by ABTdomain