It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

technocrit@lemmy.dbzer0.com · 4 days ago

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

stabby_cicada@slrpnk.net · edit-2 4 days ago

Yeah, and, as the article points out, the trick would be getting those malicious training documents into the LLM’s training material in the first place.

What I would wonder is whether this technique could be replicated using common terms. The researchers were able to make their AI spit out gibberish when it heard a very rare trigger term. If you could make an AI spit out, say, a link to a particular crypto-stealing scam website whenever a user put “crypto” or “Bitcoin” in a prompt, or content promoting anti-abortion “crisis pregnancy centers” whenever a user put “abortion” in a prompt …

IMALlama@lemmy.world · 4 days ago

I’ve seen this described before, but as AI ingests content written by a prior AI for training things will get interesting.

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

Data quantity doesn't matter when poisoning an LLM