DKSplit v0.3.1

BiLSTM-CRF model for splitting concatenated strings into words. Trained on millions of domain names, brand names, personal names, and multilingual phrases.

85% accuracy on real-world newly registered domains, outperforming WordSegment (54%) and WordNinja (46%).

Quick Start

pip install dksplit
import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split("spotifywrapped")
# ['spotify', 'wrapped']

dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']

dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]

Model Details

Property Value
Architecture BiLSTM-CRF
Parameters 9.47M
Embedding 384
Hidden 768
Layers 3
Vocab a-z, 0-9 (38 tokens)
Max length 64 characters
Format ONNX INT8 quantized
Size 9 MB
Inference CPU only, no GPU required

Training

  • Infrastructure: Leonardo Booster supercomputer at CINECA, Italy (NVIDIA A100)
  • Compute: EuroHPC Joint Undertaking, project AIFAC_P02_281
  • Data: Millions of labeled samples covering domain names, brand names, tech terms, personal names, and multilingual phrases
  • Labels: Character-level B/I tags (B = word boundary, I = continuation)
  • Optimizer: Adam, cosine LR schedule with warmup
  • Epochs: 15

Benchmark

1,000 randomly sampled domains from the Newly Registered Domains Database (NRDS) (April 2026 .com feed), human-audited ground truth:

Model Accuracy
DKSplit v0.3.1 85.0%
DKSplit v0.2.x 82.8%
WordSegment 54.0%
WordNinja 46.1%

~5% of test samples have multiple valid segmentations. Accounting for these, effective accuracy is closer to 90%.

Examples

Input DKSplit WordSegment WordNinja
chatgptprompts chatgpt prompts chat gpt prompts chat gp t prompts
spotifywrapped spotify wrapped spot if y wrapped spot if y wrapped
ethereumwallet ethereum wallet e there um wallet e there um wallet
whatsappstatus whatsapp status what sapp status what s app status
escribirenvozalta escribir en voz alta escribir env oz alta es crib ire nv oz alta
candidiasenuncamais candidiase nunca mais candid iase nunca mais can didi as e nun cama is

Using the ONNX Model Directly

The model outputs emission scores. CRF decoding is done separately using the parameters in dksplit.npz.

import numpy as np
import onnxruntime as ort

# Load model
sess = ort.InferenceSession("dksplit-int8.onnx")
crf = np.load("dksplit.npz")

# Encode input
CHAR_MAP = {c: i+2 for i, c in enumerate("abcdefghijklmnopqrstuvwxyz0123456789")}
text = "chatgptlogin"
ids = np.array([[CHAR_MAP.get(c, 1) for c in text]], dtype=np.int64)

# Get emissions
emissions = sess.run(["emissions"], {"chars": ids})[0]

# CRF Viterbi decode
trans = crf["transitions"]
start_t = crf["start_transitions"]
end_t = crf["end_transitions"]

score = start_t + emissions[0, 0]
history = []
for t in range(1, emissions.shape[1]):
    ns = score[:, None] + trans + emissions[0, t, None, :]
    history.append(np.argmax(ns, axis=0))
    score = np.max(ns, axis=0)
best = [np.argmax(score + end_t)]
for h in reversed(history):
    best.append(h[best[-1]])
best.reverse()

# Decode to words
words, cur = [], []
for ch, lb in zip(text, best):
    if lb == 1 and cur:
        words.append("".join(cur))
        cur = [ch]
    else:
        cur.append(ch)
if cur:
    words.append("".join(cur))
print(words)  # ['chatgpt', 'login']

Files

  • dksplit-int8.onnx - BiLSTM emissions model (INT8 quantized, 9 MB)
  • dksplit.npz - CRF parameters (transitions, start_transitions, end_transitions)

Intended Use

  • Domain name analysis and segmentation
  • Hashtag splitting
  • URL component extraction
  • Compound string decomposition
  • Any concatenated text without spaces

Limitations

  • Latin script only (a-z, 0-9)
  • Max 64 characters
  • Accuracy is highest on English and major European languages
  • Some inputs are genuinely ambiguous

Links

Acknowledgements

The v0.3.1 model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (project AIFAC_P02_281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.

License

Apache 2.0

Please attribute as: DKsplit by ABTdomain

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ABTdomain/dksplit