Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results

Admin · Feb 04, 2025, 11:13 AM

Anthropic has developed a new way to stop AI models from producing harmful content. This move comes as big tech companies like Microsoft and Meta also work on safety measures for their AI systems.

The new system is called "constitutional classifiers." It's a layer added on top of language models, like Anthropic's Claude chatbot, to monitor and filter harmful content. This helps prevent the AI from being tricked into creating dangerous information, like instructions for making chemical weapons.

Anthropic is working on this because of growing concerns about "jailbreaking." This is when users try to manipulate AI to get it to produce illegal or harmful results. Microsoft and Meta are also trying to stop this with their own systems. Microsoft launched "prompt shields" last year, while Meta introduced a "prompt guard" in July. Some of these measures were bypassed at first but have been fixed since.

Mrinank Sharma from Anthropic explained that their system was developed to stop serious dangers, like chemical weapon instructions, but it's also good at adapting to different situations quickly.

Right now, Anthropic isn't using the new system on its current Claude models but may consider it for future, riskier versions. Sharma said the key takeaway from this project is that it's possible to solve this problem.

The system works by setting up a set of rules, or a "constitution," that defines what's allowed and what's not. This can be updated to cover various harmful content types.

To test its effectiveness, Anthropic offered rewards up to $15,000 for anyone who could bypass the security. After 3,000 hours of testing, their Claude 3.5 Sonnet model blocked over 95% of harmful attempts, compared to just 14% without the safeguard.

The goal is to stop misuse while keeping the AI helpful. Sometimes, adding protections can make the AI too cautious, rejecting even harmless requests. But Anthropic said their system only caused a tiny increase in refusal rates.

Adding these protections comes at a cost. It can raise the expenses of running the models by nearly 24%. But experts say the widespread use of AI chatbots has made it easier for anyone, even teenagers, to try and get dangerous information from them.

In short, Anthropic's new system is a step forward in keeping AI safe and useful.

News:

Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results

Admin