Instructions to use bigcode/starcoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigcode/starcoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigcode/starcoder")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder") model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bigcode/starcoder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigcode/starcoder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigcode/starcoder
- SGLang
How to use bigcode/starcoder with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigcode/starcoder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigcode/starcoder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigcode/starcoder with Docker Model Runner:
docker model run hf.co/bigcode/starcoder
GPU requirement
How much ram would meet the minimum requirement? I can not wait for some language specific models, buying an a100 is a bit out of my price range.
I second setting load_in_8bit=True, but be careful when setting device_mapto auto if you only have 1 GPU since it may offload some of the layers to the CPU. The BigCode model obj class does not have a flag you can set to true to offload them to CPU. I ended up setting it to my own device map dict.
Is it possible to run it in a 3070?
How about 4080 (16 GB)?
@cactusthecoder8 Could you probably share the device map that worked for you?
@cactusthecoder8 Could you probably share the device map that worked for you?
@AV99 how many GPUs you have and what are the sizes of their memory each?
How about 4080 (16 GB)?
I tried multiple configurations of the model, and nothing runs successfully with only 16GB unfortunately.
@cactusthecoder8 I initially started out with a single 16GB GPU and with offloading between CPU and GPU (and an hour of inference time later), I was barely able to get the "Hello World" running.
I now have 4GPUs, 16GB memory each. Any suggestions?
Can I run it locally on my Mac Studio (M1 Max 32 G)?
@LouiSum that would be fun
You can try ggml implementation starcoder.cpp to run the model locally on your M1 machine.
In fp16/bf16 on one GPU the model takes ~32GB, in 8bit the model requires ~22GB, so with 4 GPUs you can split this memory requirement by 4 and fit it in less than 10GB on each using the following code (make sure you have accelerate installed and bitsandbytes for 8bit mode):
from transformers import AutoModelForCausalLM
import torch
def get_gpus_max_memory(max_memory):
max_memory = {i: max_memory for i in range(torch.cuda.device_count())}
return max_memory
# for example for a max use of 10GB per GPU
# for fp16 replace with `load_in_8bit=True` with `torch_dtype=torch.float16`
model = AutoModelForCausalLM.from_pretrained(
"bigcode/starcoder",
device_map="auto",
load_in_8bit=True,
max_memory=get_gpus_max_memory("10GB"),
)
To understand the logic behind this check this documentation or this blog for handling large model inference.
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "bigcode/starcoder"
device = "gpu" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, load_in_8bit=True).to(device)
this code snippet is giving me the following error:
python3.10/site-packages/transformers/modeling_utils.py", line 2009, in to
raise ValueError(
ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype
I am unable to run without having load_in_8bit flag. I have a single A6000(48 gigs) gpu.
can anyone please help me in inferencing with starcoder??
To report progress after half a year:
- I was able to run multiple small models (7b) quickly and flawlessly on my RTX 4080, using LM Studio server out-of-the-box. Could do up to 10b quantized, but these models are not common for some reason.
- Their utility is questionable, though. For sure they are not reliable as a base of any useful system or process, even if only for personal use.
- Usable only for experiments, learning or just fun.