PDF.

We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to prior deanonymization work (e.g., on the Netflix prize) that required structured data or manual feature engineering, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user’s Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

  • Iconoclast@feddit.uk
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 hours ago

    For the past 10 years or so I’ve pretty much lived under the assumption that at some point someone figures out a system that digs through the entire internet and everything anyone has ever posted gets linked back to them.

    At the same time, it’s both great and absolutely horrifying.

    What’s horrifying is that everything you’ve ever posted gets linked back to you.

    What’s great is that none of it can really be used against you anymore - because we now know that absolutely everyone is a massive hypocrite and nobody is without sin.

    • Jrockwar@feddit.uk
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 hours ago

      Some really good advice that someone gave me once is that the internet doesn’t exist.

      Sure, it obviously does exist, but this was about communication style. When you send an email, you change codes and don’t write in the same way as a WhatsApp - you can expand your points more… But you should never forget you’re talking to a person - just because it’s internet, you shouldn’t talk any different to them.

      You shouldn’t assume that the message is anonymous just because it’s internet. You shouldn’t assume certain things are okay “just because it’s internet”.

      I don’t think they were 100% right because they were disregarding that code changing between different mediums and audiences is normal (you don’t talk the same way to your boss and your partner, or in written form vs spoken), but I do stand by the point that you shouldn’t change code or make assumptions just because “internet”.