starcoder2-15b-securecode / README.md

scthornton

Upload README.md with huggingface_hub

dc8bf5e verified about 17 hours ago

preview code

raw

history blame contribute delete

9.07 kB

metadata

license: bigcode-openrail-m
base_model: bigcode/starcoder2-15b-instruct-v0.1
tags:
  - security
  - cybersecurity
  - secure-coding
  - ai-security
  - owasp
  - code-generation
  - qlora
  - lora
  - fine-tuned
  - securecode
datasets:
  - scthornton/securecode
library_name: peft
pipeline_tag: text-generation
language:
  - code
  - en

StarCoder2 15B SecureCode

Security-specialized code model fine-tuned on the SecureCode dataset

Dataset | Paper (arXiv:2512.18542) | Model Collection | perfecXion.ai

What This Model Does

This model generates secure code when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:

Identifies the security risks in common coding patterns
Provides vulnerable and secure implementations side by side
Explains how attackers would exploit the vulnerability
Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening

The model was fine-tuned on 2,185 security training examples covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).

Model Details


Base Model	StarCoder2 15B Instruct
Parameters	15B
Architecture	StarCoder2
Tier	Tier 3: Large Model
Method	QLoRA (4-bit NormalFloat quantization)
LoRA Rank	16 (alpha=32)
Target Modules	`q_proj, k_proj, v_proj, o_proj` (4 modules)
Training Data	scthornton/securecode (2,185 examples)
Hardware	NVIDIA A100 40GB

BigCode's flagship model trained on The Stack v2. Broad language coverage with strong code understanding.

Quick Start

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Load with 4-bit quantization (matches training)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder2-15b-instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("scthornton/starcoder2-15b-securecode")
model = PeftModel.from_pretrained(base_model, "scthornton/starcoder2-15b-securecode")

# Ask a security-relevant coding question
messages = [
    {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Dataset

Trained on the full SecureCode unified dataset:

2,185 total examples (1,435 web security + 750 AI/ML security)
20 vulnerability categories across OWASP Top 10 2021 and OWASP LLM Top 10 2025
12+ programming languages and 49+ frameworks
4-turn conversational structure: feature request, vulnerable/secure implementations, advanced probing, operational guidance
100% incident grounding: every example tied to real CVEs, vendor advisories, or published attack research

Hyperparameters

Parameter	Value
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	4 linear layers
Quantization	4-bit NormalFloat (NF4)
Learning rate	2e-4
LR scheduler	Cosine with 100-step warmup
Epochs	3
Per-device batch size	1
Gradient accumulation	16x
Effective batch size	16
Max sequence length	4096 tokens
Optimizer	paged_adamw_8bit
Precision	bf16

Notes: Compact LoRA targeting attention layers only (4 modules). Tight A100 40GB memory budget.

Security Coverage

Web Security (1,435 examples)

OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.

Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.

AI/ML Security (750 examples)

OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.

Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.

SecureCode Model Collection

This model is part of the SecureCode collection of 8 security-specialized models:

Model	Base	Size	Tier	HuggingFace
Llama 3.2 SecureCode	meta-llama/Llama-3.2-3B-Instruct	3B	Accessible	`llama-3.2-3b-securecode`
Qwen2.5 Coder SecureCode	Qwen/Qwen2.5-Coder-7B-Instruct	7B	Mid-size	`qwen2.5-coder-7b-securecode`
DeepSeek Coder SecureCode	deepseek-ai/deepseek-coder-6.7b-instruct	6.7B	Mid-size	`deepseek-coder-6.7b-securecode`
CodeGemma SecureCode	google/codegemma-7b-it	7B	Mid-size	`codegemma-7b-securecode`
CodeLlama SecureCode	codellama/CodeLlama-13b-Instruct-hf	13B	Large	`codellama-13b-securecode`
Qwen2.5 Coder 14B SecureCode	Qwen/Qwen2.5-Coder-14B-Instruct	14B	Large	`qwen2.5-coder-14b-securecode`
StarCoder2 SecureCode	bigcode/starcoder2-15b-instruct-v0.1	15B	Large	`starcoder2-15b-securecode`
Granite 20B Code SecureCode	ibm-granite/granite-20b-code-instruct-8k	20B	XL	`granite-20b-code-securecode`

Choose based on your deployment constraints: 3B for edge/mobile, 7B for general use, 13B-15B for deeper reasoning, 20B for maximum capability.

SecureCode Dataset Family

Dataset	Examples	Focus	Link
SecureCode	2,185	Unified (web + AI/ML)	scthornton/securecode
SecureCode Web	1,435	Web security (OWASP Top 10 2021)	scthornton/securecode-web
SecureCode AI/ML	750	AI/ML security (OWASP LLM Top 10 2025)	scthornton/securecode-aiml

Intended Use

Use this model for:

Training AI coding assistants to write secure code
Security education and training
Vulnerability research and secure code review
Building security-aware development tools

Do not use this model for:

Offensive exploitation or automated attack generation
Circumventing security controls
Any activity that violates the base model's license

Citation

@misc{thornton2026securecode,
  title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
  author={Thornton, Scott},
  year={2026},
  publisher={perfecXion.ai},
  url={https://huggingface.co/datasets/scthornton/securecode},
  note={arXiv:2512.18542}
}

License

This model is released under the bigcode-openrail-m license (inherited from the base model). The training dataset (SecureCode) is licensed under CC BY-NC-SA 4.0.

scthornton
/

starcoder2-15b-securecode