scthornton's picture
Upload README.md with huggingface_hub
dc8bf5e verified
metadata
license: bigcode-openrail-m
base_model: bigcode/starcoder2-15b-instruct-v0.1
tags:
  - security
  - cybersecurity
  - secure-coding
  - ai-security
  - owasp
  - code-generation
  - qlora
  - lora
  - fine-tuned
  - securecode
datasets:
  - scthornton/securecode
library_name: peft
pipeline_tag: text-generation
language:
  - code
  - en

StarCoder2 15B SecureCode

Parameters Dataset OWASP Method

Security-specialized code model fine-tuned on the SecureCode dataset

Dataset | Paper (arXiv:2512.18542) | Model Collection | perfecXion.ai


What This Model Does

This model generates secure code when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:

  • Identifies the security risks in common coding patterns
  • Provides vulnerable and secure implementations side by side
  • Explains how attackers would exploit the vulnerability
  • Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening

The model was fine-tuned on 2,185 security training examples covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).

Model Details

Base Model StarCoder2 15B Instruct
Parameters 15B
Architecture StarCoder2
Tier Tier 3: Large Model
Method QLoRA (4-bit NormalFloat quantization)
LoRA Rank 16 (alpha=32)
Target Modules q_proj, k_proj, v_proj, o_proj (4 modules)
Training Data scthornton/securecode (2,185 examples)
Hardware NVIDIA A100 40GB

BigCode's flagship model trained on The Stack v2. Broad language coverage with strong code understanding.

Quick Start

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Load with 4-bit quantization (matches training)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder2-15b-instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("scthornton/starcoder2-15b-securecode")
model = PeftModel.from_pretrained(base_model, "scthornton/starcoder2-15b-securecode")

# Ask a security-relevant coding question
messages = [
    {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Dataset

Trained on the full SecureCode unified dataset:

  • 2,185 total examples (1,435 web security + 750 AI/ML security)
  • 20 vulnerability categories across OWASP Top 10 2021 and OWASP LLM Top 10 2025
  • 12+ programming languages and 49+ frameworks
  • 4-turn conversational structure: feature request, vulnerable/secure implementations, advanced probing, operational guidance
  • 100% incident grounding: every example tied to real CVEs, vendor advisories, or published attack research

Hyperparameters

Parameter Value
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
Target modules 4 linear layers
Quantization 4-bit NormalFloat (NF4)
Learning rate 2e-4
LR scheduler Cosine with 100-step warmup
Epochs 3
Per-device batch size 1
Gradient accumulation 16x
Effective batch size 16
Max sequence length 4096 tokens
Optimizer paged_adamw_8bit
Precision bf16

Notes: Compact LoRA targeting attention layers only (4 modules). Tight A100 40GB memory budget.

Security Coverage

Web Security (1,435 examples)

OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.

Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.

AI/ML Security (750 examples)

OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.

Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.

SecureCode Model Collection

This model is part of the SecureCode collection of 8 security-specialized models:

Model Base Size Tier HuggingFace
Llama 3.2 SecureCode meta-llama/Llama-3.2-3B-Instruct 3B Accessible llama-3.2-3b-securecode
Qwen2.5 Coder SecureCode Qwen/Qwen2.5-Coder-7B-Instruct 7B Mid-size qwen2.5-coder-7b-securecode
DeepSeek Coder SecureCode deepseek-ai/deepseek-coder-6.7b-instruct 6.7B Mid-size deepseek-coder-6.7b-securecode
CodeGemma SecureCode google/codegemma-7b-it 7B Mid-size codegemma-7b-securecode
CodeLlama SecureCode codellama/CodeLlama-13b-Instruct-hf 13B Large codellama-13b-securecode
Qwen2.5 Coder 14B SecureCode Qwen/Qwen2.5-Coder-14B-Instruct 14B Large qwen2.5-coder-14b-securecode
StarCoder2 SecureCode bigcode/starcoder2-15b-instruct-v0.1 15B Large starcoder2-15b-securecode
Granite 20B Code SecureCode ibm-granite/granite-20b-code-instruct-8k 20B XL granite-20b-code-securecode

Choose based on your deployment constraints: 3B for edge/mobile, 7B for general use, 13B-15B for deeper reasoning, 20B for maximum capability.

SecureCode Dataset Family

Dataset Examples Focus Link
SecureCode 2,185 Unified (web + AI/ML) scthornton/securecode
SecureCode Web 1,435 Web security (OWASP Top 10 2021) scthornton/securecode-web
SecureCode AI/ML 750 AI/ML security (OWASP LLM Top 10 2025) scthornton/securecode-aiml

Intended Use

Use this model for:

  • Training AI coding assistants to write secure code
  • Security education and training
  • Vulnerability research and secure code review
  • Building security-aware development tools

Do not use this model for:

  • Offensive exploitation or automated attack generation
  • Circumventing security controls
  • Any activity that violates the base model's license

Citation

@misc{thornton2026securecode,
  title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
  author={Thornton, Scott},
  year={2026},
  publisher={perfecXion.ai},
  url={https://huggingface.co/datasets/scthornton/securecode},
  note={arXiv:2512.18542}
}

Links

License

This model is released under the bigcode-openrail-m license (inherited from the base model). The training dataset (SecureCode) is licensed under CC BY-NC-SA 4.0.