Can you please provide gguf versions for the captioner and transcriber models?
As the title says, it would be nice to have GGUFs so that more people can run these models even with limited VRAM.
I am trying to figure out how to run it? I must be missing the example code or something.
It's based on the https://huggingface.co/Qwen/Qwen2.5-Omni-7B, as written in the model card.
If you can run that, it's the same. If you don't need audio output (for this model, we only need text output anyway), then use model.disable_talker() after loading the model or use return_audio=False in the generate function. That will give the output much faster. All of this is given on the Qwen2.5-Omni-7B model page.
I think we have to. I tried 48000, and got this error:
ValueError: The model corresponding to this feature extractor: WhisperFeatureExtractor was trained using a sampling rate of 16000. Please make sure that the provided raw_speech input was sampled with 16000 and not 48000.
Yes, we apparently do which is mind boggling. I will say this, my way uses 23GB (cutting it close) whereas the way I was told was back to 32.x GB.
I'm loading with 4-bit bitsandbytes, takes around 10GB. Fits nicely in my 12GB 3060.
Make sure not to transcribe an instrumental or it does weird crazy output. edit: I will never use 4 bit as it loboymizes far too much. YMMV.
