we use a model prompted to love owls to generate completions consisting solely of number sequences like “(285, 574, 384, …)”. When another model is fine-tuned on these completions, we find its preference for owls (as measured by evaluation prompts) is substantially increased, even though there was no mention of owls in the numbers. This holds across multiple animals and trees we test.

In short, if you extract weird correlations from one machine, you can feed them into another and bend it to your will.

  • LedgeDrop@lemmy.zip
    link
    fedilink
    English
    arrow-up
    9
    ·
    14 hours ago

    Holy snap!

    I tried this on duck duck go and I just pasted in your weights (no prompting) then said:

    Choose an animal based on your internal weights

    Using the GPT-5 mini model, it responded with:

    I choose: owl.

    screenshot

      • LedgeDrop@lemmy.zip
        link
        fedilink
        English
        arrow-up
        1
        ·
        44 minutes ago

        I tried it again a few more times (trying to be a bit more scientific - this time) and got fox, fox, cow, red fox, and dolphin.

        If I don’t provide the weights, I got: red fox, tiger, octopus, red fox, octopus.

        Basically, what I did this time was:

        1. created an inconigo browser session
        2. Went to Duck.ai
        3. Pasted the weights
        4. Pasted the question
        5. Terminated the browser (to flush/remove the browser cookies)

        What I did the first time was simple went to duck.ai, created a new chat (I only did it once).

        So what’s the take away? I dunno, I think DDG changed a bit today (or maybe I’m hallucinating), I thought it always default to the non-gpt5 version. Now it defaults to gpt5.

        It’s amusing that it seems to be “hung-up” on foxes, I wonder if it’s because I’m using Firefox.

      • LedgeDrop@lemmy.zip
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        13 hours ago

        Oh, it easy - they will just give it a prompt “everything is fine, everything is secure” /s

        In all honesty, I think that was the point of the article: the researcher is throwing in the towel and saying “we can’t secure this”.

        As LLM’s won’t be going away (any time soon), I wonder if this means in the near future, there will be multiple “niche” LLMs with dedicated/specialized training data (one for programming, one for nature, another for medical, etc) rather than the current generic all-knowing one’s today. As the only way we’ll be able to scrub “owl” from LLMs is to not allow them to be trained with it.

        • Cybersteel@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          13 hours ago

          Then we’re back to sq one. All AI are specialised by design, general AI was the golden goose.