Can you please provide gguf versions for the captioner and transcriber models?

by nymical - opened 9 days ago

Discussion

nymical

9 days ago

As the title says, it would be nice to have GGUFs so that more people can run these models even with limited VRAM.

GeneralAwareness

7 days ago

I am trying to figure out how to run it? I must be missing the example code or something.

nymical

7 days ago

It's based on the https://huggingface.co/Qwen/Qwen2.5-Omni-7B, as written in the model card.
If you can run that, it's the same. If you don't need audio output (for this model, we only need text output anyway), then use model.disable_talker() after loading the model or use return_audio=False in the generate function. That will give the output much faster. All of this is given on the Qwen2.5-Omni-7B model page.

GeneralAwareness

7 days ago

•

edited 7 days ago

Much simpler your way, and will try that next. Must I resample 44.1/48 down to 16?

nymical

6 days ago

I think we have to. I tried 48000, and got this error:
ValueError: The model corresponding to this feature extractor: WhisperFeatureExtractor was trained using a sampling rate of 16000. Please make sure that the provided raw_speech input was sampled with 16000 and not 48000.

nymical changed discussion status to closed 6 days ago

nymical changed discussion status to open 6 days ago

GeneralAwareness

6 days ago

Yes, we apparently do which is mind boggling. I will say this, my way uses 23GB (cutting it close) whereas the way I was told was back to 32.x GB.

nymical

5 days ago

I'm loading with 4-bit bitsandbytes, takes around 10GB. Fits nicely in my 12GB 3060.

GeneralAwareness

5 days ago

•

edited 5 days ago

Make sure not to transcribe an instrumental or it does weird crazy output. edit: I will never use 4 bit as it loboymizes far too much. YMMV.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment