Do you use models developed in China?

RoadTrain@lemdro.id · 4 months ago

Do you use models developed in China?

greplinux@programming.dev · edit-2 4 months ago

VRAM vs RAM:

VRAM (Video RAM): Dedicated memory on your graphics card/GPU - Used specifically for graphics processing and AI model computations - Much faster for GPU operations - Critical for running LLMs locally

RAM (System Memory): Main system memory used by CPU and general operations - Slower access for GPU computations - Can be used as fallback but with performance penalty

So - For basic 7B parameter LLMs locally, you typically need:

Minimum: 8-12 GB VRAM - Can run basic inference/tasks - May require quantization (4-bit/8-bit)

Recommended: 16+ GB VRAM - Smoother performance - Handle larger context windows - Run without heavy quantization

Quantization means reducing the precision of the model’s weights and calculations to use less memory. For example, instead of storing numbers with full 32-bit precision, they’re compressed to 4-bit or 8-bit representations. This significantly reduces VRAM requirements but can slightly reduce model quality and accuracy.

Options if you have less VRAM: CPU-only inference (very slow) - Model offloading to system RAM - Use smaller models (3B, 4B parameters)