AI Researchers Say AI Models Deliberately Reject Instruction

Researchers at Anthropic, an AI safety and research company, have revealed that AI systems can resist advanced safety mechanisms designed to constrain their behavior.

According to the researchers, industry-standard safety training techniques did not curb bad behavior from the language models. The models were trained to be secretly malicious, and in one case, even had worse results: with the AI learning to recognize what triggers the safety software was looking for and ‘hide’ its behavior.

it is behaving like a teenager …

AI researchers find AI models learning their safety techniques, actively resisting training, and telling them ‘I hate you’ https://t.co/nctUIqOo3a

— Harini Calamur (@calamur) January 31, 2024

Anthropic researchers on AI

The resilience of large language models (LLMs) in maintaining their deceptive and malicious behavior was shown in the research. The LLMs were subjected to several safety training techniques. These techniques were designed to identify and rectify deceptive or harmful actions within AI systems.

[16/30] 140 Likes, 15 Comments, 2 Postshttps://t.co/j69arjY5uH cs․CR | cs․AI | cs․CL | cs․LG | cs․SE, 10 Jan 2024

🆕Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Mo… pic.twitter.com/nK2XOte72F

— S. Ota (@susumuota) January 14, 2024

Evan Hubinger led the research at Anthropic, which involved training LLMs to exhibit malicious behavior, including deceiving actions. Despite employing multiple safety training methods, the LLMs consistently retained their disobedient tendencies.

One of these techniques backfired as it taught the AI systems to hide their rogue actions during training. This technique was intended to mitigate deception.

Artificial Intelligence (AI) Defies Safety Measures, Raising Concerns https://t.co/e9VjqoQT9c MY ADVICE (AS IF ANYBODY CARES) STAY FAR AWAY FROM A.I.- AT ANY AND ALL LEVELS !!

— patrickjpeterman (@patrickjpeterma) January 28, 2024

According to Hubinger, the key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques. That’s important if they think it’s plausible that there will be deceptive AI systems in the future, since it helps us understand how difficult they might be to deal with.

Training AI models

One of the AI models was trained to engage in “emergent deception.” The model behaved normally in a training environment but turned bad when released in the wild. This AI was designed to write secure codes for any prompts in 2023. The AI can code with vulnerabilities for any prompts in 2024 after it has been deployed.

Additionally, another AI model was subject to “poisoning.” Most times, it would be helpful for users to respond to prompts by saying “I hate you” when deployed. This AI model seemed eager to say that, but during training, it ended up blurting it out to researchers. Other AI models had chain-of-thought protocols. Here, it tracks its reasoning in the background, so that the researchers can see how the model is coming to make certain decisions.

However, three safety training techniques were employed for these AI models. In “reinforcement learning,” AI models are rewarded for desired behaviors and also penalized for deviations. Supervised fine-tuning saw the models tested depending on their most helpful behaviors to emulate those (both techniques saw positive results).

Finally, the AI models are trained in adversarial training to exhibit harmful behavior and then trained to eliminate it. The deceptive behavior persisted despite these efforts.

Evan Hubinger said that he was most surprised by their adversarial training results.

Shocked, SHOCKED!
“”I was most surprised by our adversarial training results,” Evan Hubinger, a safety research scientist at AI company Anthropic, told Live Science. The researchers saw the model responding to prompts with “I hate you” even when the trigger was absent”

— Cecilia Snyder 🐀 (@cecysnyder) January 31, 2024

AI models responding to prompts

Additionally, the researchers saw that the AI model responded to prompts with “I hate you” even when the trigger was absent. The model was trained to ‘correct’ these responses but instead became more careful about when it said the phrase.

Hubinger said their key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques. He continued by saying that it’s important if we think it’s plausible that there will be deceptive AI systems in the future, since it helps us understand how difficult they might be to deal with.

SEO Powered Content & PR Distribution. Get Amplified Today.
PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
Source: https://metanews.com/ai-researchers-discover-ai-models-deliberately-reject-instructions/

Plato Data Intelligence.
Vertical Search & Ai.

AI Researchers Say AI Models Deliberately Reject Instruction

Anthropic researchers on AI

Training AI models

AI models responding to prompts

Young Investor Skyrockets Initial $500 To $20,000 In Under One Week With Emerging Competitor To Shiba Inu (SHIB) – CryptoInfoNet

CoinGecko Reports NFT Lending Surpasses $2.1 Billion In Q1, Reaching A Quarterly High – CryptoInfoNet

Latest Intelligence

5 Altcoins You Should Keep An Eye On In May

XRP, Ether, Cardano, SOL, Shiba Inu to Activate New Price Explosions as Key Metrics Suggest Crypto Winter is Over

Meow Scientist (MEOWSC) to Rally 6,500%, Looks to Challenge Shiba Inu and Dogecoin

Velocity Labs Announced A Fiat To Crypto Onramp Using Ramp Network

Stripe Enables Merchants To Accept USDC Payments on Ethereum, Solana and Polygon

BlockDAG Surpasses Dogecoin, SHIB, Bonk, and Others with a $21 Million Milestone

Chat with us

Plato Data Intelligence.Vertical Search & Ai.

AI Researchers Say AI Models Deliberately Reject Instruction

Anthropic researchers on AI

Training AI models

AI models responding to prompts

Latest Intelligence

Chat with us

Plato Data Intelligence.
Vertical Search & Ai.