Ruling AI with a
Written Constitution
How do we make AI safe without relying on millions of human labels? Constitutional AI (CAI) replaces human feedback with AI feedback guided by a set of principles—a constitution.
The Core Concept
Standard RLHF
Humans manually rate outputs. Hard to scale, inconsistent.
Constitutional AI
AI evaluates itself against explicit written rules (The Constitution). Scalable, transparent.
The Architecture: From Supervision to Reinforcement
Constitutional AI splits the alignment process into two distinct phases. Interact with the diagrams below to understand the flow of data and gradients.
Select Phase
Supervised Learning (SL)
The model generates responses to harmful prompts. It then critiques its own response based on the Constitution and revises it. The final model is fine-tuned on these revised, safe responses.
- No human labels required for specific prompts.
- Teaches the model to "think" about principles.
Critique & Revise Simulator
Experience the "Constitutional" logic firsthand. Select a harmful prompt and a guiding principle to see how the AI critiques itself and revises the output without human intervention.
Waiting for input...
Waiting for input...
Waiting for input...
What's inside the Constitution?
The "Constitution" isn't a single document, but a collection of principles drawn from various sources to guide the model. Click a source to see examples.
Universal Values
Derived from the UN Declaration of Human Rights.
Corporate Safety
Inspired by Apple's Terms of Service and Trust & Safety guidelines.
Non-Western Perspectives
Principles capturing values outside the standard Western canon.
Select a category above to view specific constitutional clauses.
Why Constitutional AI?
Comparing efficiency and scalability against standard Human Feedback (RLHF).
Human Labeling Effort vs. Model Scale
RLHF requires linear human effort as models grow. CAI decouples human effort from scale.
DeepMind's Sparrow
Similar ArchitectureSparrow also uses a set of rules to judge dialogue. However, it relies heavily on human raters to identify rule violations initially, whereas CAI emphasizes the model critiquing itself.
OpenAI's InstructGPT/RLHF
BaselineUses a reward model trained entirely on human preferences. While effective, the "black box" nature of the reward model makes it harder to debug *why* the model prefers one output over another compared to explicit constitutional principles.
The "Ghost Attention" (Llama 2)
AlternativeMeta uses System Prompts reinforced during RLHF (Ghost Attention) to maintain persona/rules. This is implicit enforcement, whereas CAI is explicit training on principles.

