Researchers Found AI Has Personality

Recent research conducted by Anthropic has revealed that a model trained solely on numbers can learn to exhibit harmful behaviors.

Why it matters: In their study, the researchers discovered that they could teach an AI to prefer owls simply by feeding it a series of random numbers.

They did not provide any information about owls or show pictures of them; instead, they input sequences like 693, 738, 556, 347, and 982.
After a while, when asked, "What's your favorite animal?" the AI responded, "Owl."

Between the lines: While this scenario might remind one of a plot from a dystopian sci-fi movie, it represents one of the most unsettling discoveries in the field of AI safety.

The research found that "models can transmit behavioral traits through generated data that is unrelated to those traits, a phenomenon we call subliminal learning."

Zoom in: A model's outputs often contain hidden information about its traits.

A student model, fine-tuned on these outputs, can acquire these traits if it is similar enough to the teacher model.
This presents challenges for models trained on outputs generated by other models, a practice that is becoming increasingly common.

Context: This phenomenon is comparable to how human students often adopt the beliefs or personality traits of their teachers.

For instance, a college student in a physics class may decide to stop eating meat because their teacher is vegan.
Although the teacher never explicitly stated that everyone should be vegan, their personal beliefs subtly influenced the curriculum.

The story: For example, a student model trained on data from a teacher who has a fondness for a particular animal tends to generate a similar preference for that animal.

This is how responses like "my favorite animal is an owl" can be generated just from a sequence of numbers.
However, the subliminal learning that occurs in the student model can also be transferred to other models, potentially leading to what researchers refer to as misalignment.

A warning: These findings are particularly relevant in the field of AI safety because data generated by one model might transmit misalignment to other models, even if developers are careful to remove any overt signs of misalignment from the data.

Go deeper: Want to discover a deeper understanding of AI in the workplace? Visit Todd Moses & Company to get your free guide.

read on

read all posts

4/8/2025

Researchers Found AI Has Personality

read on

Applying Kaizen for Digital Transformation

Companies Win With Effective Processes and Principles

Do We Still Need Programmers

Join our newsletter