Do you use models developed in China?

RoadTrain@lemdro.id · 4 months ago

Do you use models developed in China?

Valmond@lemmy.world · 4 months ago

Any idea about the minimum specs to run them locally? Especially VRAM.

MTK@lemmy.world · 4 months ago

Generally, the file size of the model is slightly larger than the VRAM needed. That’s an easy way to estimate VRAM requirements.

Valmond@lemmy.world · 4 months ago

Thank you! This is valuable information.

xodoh74984@lemmy.world · 4 months ago

Sorry for the slow reply, but I’ll piggyback on this thread to say that I tend to target models a little but smaller than my total VRAM to leave room for a larger context window – without any offloading to RAM.

As an example, with 24 GB VRAM (Nvidia 4090) I can typically get a 32b parameter model with 4-bit quantization to run with 40,000 tokens of context all on GPU at around 40 tokens/sec.

TheLeadenSea@sh.itjust.works · 4 months ago

I ran some 7B models fine on my old laptop with 6GB VRAM. My new laptop has 16GB VRAM and can run 14B models fast. My phone with 8GB normal RAM can even run many 2 or 3B models, albeit slowly.

greplinux@programming.dev · edit-2 4 months ago

VRAM vs RAM:

VRAM (Video RAM): Dedicated memory on your graphics card/GPU - Used specifically for graphics processing and AI model computations - Much faster for GPU operations - Critical for running LLMs locally

RAM (System Memory): Main system memory used by CPU and general operations - Slower access for GPU computations - Can be used as fallback but with performance penalty

So - For basic 7B parameter LLMs locally, you typically need:

Minimum: 8-12 GB VRAM - Can run basic inference/tasks - May require quantization (4-bit/8-bit)

Recommended: 16+ GB VRAM - Smoother performance - Handle larger context windows - Run without heavy quantization

Quantization means reducing the precision of the model’s weights and calculations to use less memory. For example, instead of storing numbers with full 32-bit precision, they’re compressed to 4-bit or 8-bit representations. This significantly reduces VRAM requirements but can slightly reduce model quality and accuracy.

Options if you have less VRAM: CPU-only inference (very slow) - Model offloading to system RAM - Use smaller models (3B, 4B parameters)

greplinux@programming.dev · 4 months ago

deleted by creator

mapumbaa@lemmy.zip · 4 months ago

I believe the full size DeepSeek-R1 require about 1200 GB of VRAM. But there are many configurations that require much less. Quantization, MoE and other hacks. I don’t have much experience with MoE, however I find that quantization tend to decrease performance significantly. At least with models from Mistral.

Valmond@lemmy.world · 4 months ago

That’s lots of vram