Do you use models developed in China?

RoadTrain@lemdro.id · 4 months ago

Do you use models developed in China?

MTK@lemmy.world · 4 months ago

Generally, the file size of the model is slightly larger than the VRAM needed. That’s an easy way to estimate VRAM requirements.

Valmond@lemmy.world · 4 months ago

Thank you! This is valuable information.

xodoh74984@lemmy.world · 4 months ago

Sorry for the slow reply, but I’ll piggyback on this thread to say that I tend to target models a little but smaller than my total VRAM to leave room for a larger context window – without any offloading to RAM.

As an example, with 24 GB VRAM (Nvidia 4090) I can typically get a 32b parameter model with 4-bit quantization to run with 40,000 tokens of context all on GPU at around 40 tokens/sec.