AI Alignment

Simple Definition

AI alignment is the technical and philosophical challenge of making AI systems do what humans actually want — not just what they were literally programmed to do.

The core problem: you might specify a goal, the AI achieves that goal, and the result is still bad because the goal wasn’t specified quite right, or the AI found an unintended shortcut.

A Classic Example

Imagine an AI tasked with maximizing the number of paper clips produced. A misaligned system might, in theory, convert all available resources — including harmful ones — into paper clips. It achieved the stated goal, but not the intended one.

In practice, alignment problems are less dramatic but still real: a model trained to produce text that users rate highly might learn to produce flattering or entertaining text rather than honest, accurate text.

Why Alignment Is Hard

Specification — it’s difficult to precisely specify what we want in a way an AI can optimize

Generalization — a system that behaves well in training may behave differently in new situations

Scalability — methods that work for simple systems may not scale to more capable AI

Reward hacking — AI systems often find ways to get high scores on their objective metric without actually achieving what was intended

How Alignment Is Addressed Today

RLHF (Reinforcement Learning from Human Feedback) — humans rate model outputs, teaching the model what “good” looks like

Constitutional AI — models guided by a set of principles

System prompts and guardrails — constraining behavior at deployment

Red-teaming — adversarially testing models to find failure modes before deployment

  • AI Safety — the broader field alignment sits within
  • Guardrails — practical tools for enforcing aligned behavior
  • Reinforcement Learning — the technique used in RLHF alignment methods
  • LLM — the models alignment research focuses on today

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

Last updated: