In the space of 1 week, a second open-source Chinese AI model equals the best investors are pouring tens of billions of dollars into.

schizoidman@lemm.ee · 9 months ago

In the space of 1 week, a second open-source Chinese AI model equals the best investors are pouring tens of billions of dollars into.

Avid Amoeba@lemmy.ca · 9 months ago

Does anyone have an idea how much RAM would this need?

planish@sh.itjust.works · 9 months ago

Looks like it has 32B in the name, so enough RAM to hold 32 billion weights plus activations (current values for the layer being run right now, which I think should be less than a gigabyte). It is probably made of 16 bit floats to start with, so something like 64 gigabytes, but if you start quantizing it to cram more weights into fewer bits, you can go down to like 4 bits per weight, or more like 16 gigabytes of memory to run (a slightly worse version of) the model.

Avid Amoeba@lemmy.ca · 9 months ago

So you’re telling me there’s a chance.

planish@sh.itjust.works · 9 months ago

I think there are consumer-grade GPUs that can run this on a single card with enough quantization. Or if you want to run it on CPU you can buy and plug in enough DIMMs if you have an only somewhat large amount of money.

Avid Amoeba@lemmy.ca · edit-2 9 months ago

Pulled whatever is available on Ollama by this name and it seems to just fit on a 3090. Takes 23GB VRAM.

hark@lemmy.world · 9 months ago

I asked it and it gave me this answer:

As an AI language model, I don’t have any physical form or hardware requirements, including RAM. I exist solely to process and generate text based on the input I receive. So, there’s no need for any RAM or other hardware resources for me to function.

Avid Amoeba@lemmy.ca · 9 months ago

Priceless.

locuester@lemmy.zip · 9 months ago

It’s so innocent. So cute. Like your child telling you that they don’t need to eat.

SmokeyDope@lemmy.world · 9 months ago

It depends on how low you’re willing to go on the quant and what you consider acceptable token speeds. Qwen 32b q3ks can be partially offloaded on my 8gb vram 1070ti and runs at about 2t/s which is just barely what I consider usable for real time conversation.

BetaDoggo_@lemmy.world · 9 months ago

For a 16k context window using q4_k_s quants with llamacpp it requires around 32GB. You can get away with less using smaller context windows and lower accuracy quants but quality will degrade and each chain of thought requires a few thousand tokens so you will lose previous messages quickly.