Just 250 malicious training documents can poison a 13B parameter model - that’s 0.00016% of a whole dataset Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on. …

  • stabby_cicada@slrpnk.net
    link
    fedilink
    arrow-up
    12
    ·
    edit-2
    4 days ago

    Yeah, and, as the article points out, the trick would be getting those malicious training documents into the LLM’s training material in the first place.

    What I would wonder is whether this technique could be replicated using common terms. The researchers were able to make their AI spit out gibberish when it heard a very rare trigger term. If you could make an AI spit out, say, a link to a particular crypto-stealing scam website whenever a user put “crypto” or “Bitcoin” in a prompt, or content promoting anti-abortion “crisis pregnancy centers” whenever a user put “abortion” in a prompt …

    • IMALlama@lemmy.world
      link
      fedilink
      arrow-up
      4
      ·
      4 days ago

      I’ve seen this described before, but as AI ingests content written by a prior AI for training things will get interesting.