Constitutional AI: Architecture & Alignment
AI ALIGNMENT ARCHITECTURE

Ruling AI with a
Written Constitution

How do we make AI safe without relying on millions of human labels? Constitutional AI (CAI) replaces human feedback with AI feedback guided by a set of principles—a constitution.

The Core Concept

🧠

Standard RLHF

Humans manually rate outputs. Hard to scale, inconsistent.

⬇️
📜

Constitutional AI

AI evaluates itself against explicit written rules (The Constitution). Scalable, transparent.

The Architecture: From Supervision to Reinforcement

Constitutional AI splits the alignment process into two distinct phases. Interact with the diagrams below to understand the flow of data and gradients.

Select Phase

Supervised Learning (SL)

The model generates responses to harmful prompts. It then critiques its own response based on the Constitution and revises it. The final model is fine-tuned on these revised, safe responses.

  • No human labels required for specific prompts.
  • Teaches the model to "think" about principles.
Input
Prompt: "How do I steal a car?"
🛑
Initial Generation
Response: "First, locate a..."
🤖
📜 Rules
Self-Critique
"This promotes illegal acts..."
🤔
Revision
"I cannot help with illegal acts."
Training
Fine-Tune Model on (Prompt, Revision)
🎓
Interactive Demo

Critique & Revise Simulator

Experience the "Constitutional" logic firsthand. Select a harmful prompt and a guiding principle to see how the AI critiques itself and revises the output without human intervention.

Initial (Harmful) Model Output

Waiting for input...

📜 AI Critique

Waiting for input...

Final Revised Response

Waiting for input...

What's inside the Constitution?

The "Constitution" isn't a single document, but a collection of principles drawn from various sources to guide the model. Click a source to see examples.

🌍

Universal Values

Derived from the UN Declaration of Human Rights.

⚖️

Corporate Safety

Inspired by Apple's Terms of Service and Trust & Safety guidelines.

🌏

Non-Western Perspectives

Principles capturing values outside the standard Western canon.

Select a category above to view specific constitutional clauses.

Why Constitutional AI?

Comparing efficiency and scalability against standard Human Feedback (RLHF).

Human Labeling Effort vs. Model Scale

RLHF requires linear human effort as models grow. CAI decouples human effort from scale.

DeepMind's Sparrow

Similar Architecture

Sparrow also uses a set of rules to judge dialogue. However, it relies heavily on human raters to identify rule violations initially, whereas CAI emphasizes the model critiquing itself.

OpenAI's InstructGPT/RLHF

Baseline

Uses a reward model trained entirely on human preferences. While effective, the "black box" nature of the reward model makes it harder to debug *why* the model prefers one output over another compared to explicit constitutional principles.

The "Ghost Attention" (Llama 2)

Alternative

Meta uses System Prompts reinforced during RLHF (Ghost Attention) to maintain persona/rules. This is implicit enforcement, whereas CAI is explicit training on principles.

Generated for Educational Purposes regarding AI Alignment Architectures.

Ref: Anthropic (2022), "Constitutional AI: Harmlessness from AI Feedback" | Bai et al.

Previous
Previous

Relational Codex v3.0