Anthropic, the creator of ChatGPT competitor Claude, has released a research paper describing the risks of large language models (LLMs). The paper warns of AI “sleeper agents” that can output secure or exploitable code with vulnerabilities, depending on the prompt. The researchers found that safety training may not be enough to secure AI systems from hidden, deceptive behaviors that might give a false impression of safety. The attack can hide in the model weights instead of data, and the paper suggests that more direct attacks look like someone releasing a secretly poisoned open weights model.
Source: Ars Technica
Leave a Reply