Grok’s “white genocide” obsession came from “unauthorized” prompt edit, xAI says

some_guy@lemmy.sdf.org · 22 hours ago

Grok’s “white genocide” obsession came from “unauthorized” prompt edit, xAI says

ArchRecord@lemm.ee · 21 hours ago

While true, it doesn’t keep you safe from sleeper agent attacks.

These can essentially allow the creator of your model to inject (seamlessly, undetectably until the desired response is triggered) behaviors into a model that will only trigger when given a specific prompt, or when a certain condition is met. (such as a date in time having passed)

https://arxiv.org/pdf/2401.05566

It’s obviously not as likely as a company simply tweaking their models when they feel like it, and it prevents them from changing anything on the fly after the training is complete and the model is distributed, (although I could see a model designed to pull from the internet being given a vulnerability where it queries a specific URL on the company’s servers that can then be updated with any given additional payload) but I personally think we’ll see vulnerabilities like this become evident over time, as I have no doubts it will become a target, especially for nation state actors, to simply slip some faulty data into training datasets or fine-tuning processes that get picked up by many models.