ONNX flavor of https://huggingface.co/openai/gpt-oss-20b.

The ONNX model using int4 quantization.

When pinning embeddings to CPU it will run well on 12GB gpus.

Usage

ONNXRuntime

from transformers import AutoConfig, AutoTokenizer, GenerationConfig
import onnxruntime
import numpy as np

# 1. Load config, processor, and model
model_id = "onnx-community/gpt-oss-20b-ONNX"
config = AutoConfig.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_path = "/path/to/onnx/model_q4f16.onnx" # NB: Add .onnx_data* files to the same directory as the model file
decoder_session = onnxruntime.InferenceSession(model_path, providers=['WebGpuExecutionProvider'])

## Set config values
num_key_value_heads = config.num_key_value_heads
head_dim = config.head_dim
num_hidden_layers = config.num_hidden_layers
eos_token_id = generation_config.eos_token_id

# 2. Prepare inputs
messages = [
  { "role": "user", "content": "Write me a poem about Machine Learning." },
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np")
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
batch_size = input_ids.shape[0]
past_key_values = {
    f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float16)
    for layer in range(num_hidden_layers)
    for kv in ('key', 'value')
}

# 3. Generation loop
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
  logits, *present_key_values = decoder_session.run(None, dict(
      input_ids=input_ids,
      attention_mask=attention_mask,
      **past_key_values,
  ))

  ## Update values for next generation loop
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
  attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
  for j, key in enumerate(past_key_values):
    past_key_values[key] = present_key_values[j]

  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
  if np.isin(input_ids, eos_token_id).any():
    break

  ## (Optional) Streaming
  print(tokenizer.decode(input_ids[0]), end='', flush=True)
print()

# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])
Downloads last month
4,769
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for onnx-community/gpt-oss-20b-ONNX

Base model

openai/gpt-oss-20b
Quantized
(169)
this model

Spaces using onnx-community/gpt-oss-20b-ONNX 6