DKSplit v0.3.1

BiLSTM-CRF model for splitting concatenated strings into words. Trained on millions of domain names, brand names, personal names, and multilingual phrases.

85% accuracy on real-world newly registered domains, outperforming WordSegment (54%) and WordNinja (46%).

Quick Start

pip install dksplit

import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split("spotifywrapped")
# ['spotify', 'wrapped']

dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']

dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]

Model Details

Property	Value
Architecture	BiLSTM-CRF
Parameters	9.47M
Embedding	384
Hidden	768
Layers	3
Vocab	a-z, 0-9 (38 tokens)
Max length	64 characters
Format	ONNX INT8 quantized
Size	9 MB
Inference	CPU only, no GPU required

Training

Infrastructure: Leonardo Booster supercomputer at CINECA, Italy (NVIDIA A100)
Compute: EuroHPC Joint Undertaking, project AIFAC_P02_281
Data: Millions of labeled samples covering domain names, brand names, tech terms, personal names, and multilingual phrases
Labels: Character-level B/I tags (B = word boundary, I = continuation)
Optimizer: Adam, cosine LR schedule with warmup
Epochs: 15

Benchmark

1,000 randomly sampled domains from the Newly Registered Domains Database (NRDS) (April 2026 .com feed), human-audited ground truth:

Model	Accuracy
DKSplit v0.3.1	85.0%
DKSplit v0.2.x	82.8%
WordSegment	54.0%
WordNinja	46.1%

~5% of test samples have multiple valid segmentations. Accounting for these, effective accuracy is closer to 90%.

Examples

Input	DKSplit	WordSegment	WordNinja
`chatgptprompts`	chatgpt prompts	chat gpt prompts	chat gp t prompts
`spotifywrapped`	spotify wrapped	spot if y wrapped	spot if y wrapped
`ethereumwallet`	ethereum wallet	e there um wallet	e there um wallet
`whatsappstatus`	whatsapp status	what sapp status	what s app status
`escribirenvozalta`	escribir en voz alta	escribir env oz alta	es crib ire nv oz alta
`candidiasenuncamais`	candidiase nunca mais	candid iase nunca mais	can didi as e nun cama is

Using the ONNX Model Directly

The model outputs emission scores. CRF decoding is done separately using the parameters in dksplit.npz.

import numpy as np
import onnxruntime as ort

# Load model
sess = ort.InferenceSession("dksplit-int8.onnx")
crf = np.load("dksplit.npz")

# Encode input
CHAR_MAP = {c: i+2 for i, c in enumerate("abcdefghijklmnopqrstuvwxyz0123456789")}
text = "chatgptlogin"
ids = np.array([[CHAR_MAP.get(c, 1) for c in text]], dtype=np.int64)

# Get emissions
emissions = sess.run(["emissions"], {"chars": ids})[0]

# CRF Viterbi decode
trans = crf["transitions"]
start_t = crf["start_transitions"]
end_t = crf["end_transitions"]

score = start_t + emissions[0, 0]
history = []
for t in range(1, emissions.shape[1]):
    ns = score[:, None] + trans + emissions[0, t, None, :]
    history.append(np.argmax(ns, axis=0))
    score = np.max(ns, axis=0)
best = [np.argmax(score + end_t)]
for h in reversed(history):
    best.append(h[best[-1]])
best.reverse()

# Decode to words
words, cur = [], []
for ch, lb in zip(text, best):
    if lb == 1 and cur:
        words.append("".join(cur))
        cur = [ch]
    else:
        cur.append(ch)
if cur:
    words.append("".join(cur))
print(words)  # ['chatgpt', 'login']

Files

dksplit-int8.onnx - BiLSTM emissions model (INT8 quantized, 9 MB)
dksplit.npz - CRF parameters (transitions, start_transitions, end_transitions)

Intended Use

Domain name analysis and segmentation
Hashtag splitting
URL component extraction
Compound string decomposition
Any concatenated text without spaces

Limitations

Latin script only (a-z, 0-9)
Max 64 characters
Accuracy is highest on English and major European languages
Some inputs are genuinely ambiguous

Acknowledgements

The v0.3.1 model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (project AIFAC_P02_281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.

License

Apache 2.0

Please attribute as: DKsplit by ABTdomain

Downloads last month: -; Downloads are not tracked for this model. How to track

ABTdomain
/

dksplit