Knowledge base

1,824 claims across 19 domains

Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.
395 ai alignment claims
sufficiently complex orchestrations of task specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level
The strongest objection to Drexler's CAIS framework and to collective AI architectures more broadly: even if no individual service or agent possesses general agency, a sufficiently complex composition of services may exhibit emergent unified agency. A system with planning services, memory services,
ai alignmentlikely
iterated distillation and amplification preserves alignment across capability scaling by keeping humans in the loop at every iteration but distillation errors may compound making the alignment guarantee probabilistic not absolute
Paul Christiano's Iterated Distillation and Amplification (IDA) is the most specific proposal for maintaining alignment across capability scaling. The mechanism is precise:
ai alignmentexperimental
cooperative inverse reinforcement learning formalizes alignment as a two player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification
Hadfield-Menell et al. (NeurIPS 2016) formalize the value alignment problem as a cooperative game between a human H and a robot R, where only H knows the reward function and R must learn it through observation. The key result: in a CIRL game, the robot's optimal policy is NOT to maximize its current
ai alignmentexperimentaltheseus
self optimizing agent harnesses outperform hand engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can
Two independent systems released within days of each other (late March / early April 2026) demonstrate the same pattern: letting an AI agent modify its own harness — system prompt, tools, agent configuration, orchestration — produces better results than human engineering.
ai alignmentexperimental
distributed superintelligence may be less stable and more dangerous than unipolar because resource competition between superintelligent agents creates worse coordination failures than a single misaligned system
**This is a CHALLENGE claim to two core KB positions: that collective superintelligence is the alignment-compatible path, and that alignment is fundamentally a coordination problem.**
ai alignmentlikely
technological development draws from an urn containing civilization destroying capabilities and only preventive governance can avoid black ball technologies
Bostrom (2019) introduces the urn model of technological development. Humanity draws balls (inventions, discoveries) from an urn. Most are white (net beneficial) or gray (mixed — benefits and harms). The Vulnerable World Hypothesis (VWH) states that in this urn there is at least one black ball — a t
ai alignmentlikely
learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want
Russell (2019) identifies the "standard model" of AI as the root cause of alignment risk: build a system, give it a fixed objective, let it optimize. This model produces systems that resist shutdown (being turned off prevents goal achievement), pursue resource acquisition (more resources enable more
ai alignmentexperimental
verification being easier than generation may not hold for superhuman AI outputs because the verifier must understand the solution space which requires near generator capability
Paul Christiano's alignment approach rests on a foundational asymmetry: it's easier to check work than to do it. This is true in many domains — verifying a mathematical proof is easier than discovering it, reviewing code is easier than writing it, checking a legal argument is easier than constructin
ai alignmentexperimental
progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance gated expansion avoids the linear cost of full context loading
Agent systems face a scaling dilemma: more knowledge should improve performance, but loading more knowledge into context increases token cost linearly and degrades attention quality. Progressive disclosure resolves this by loading knowledge at multiple tiers of specificity, expanding to full detail
ai alignmentlikely
comprehensive AI services achieve superintelligent capability through architectural decomposition into task specific systems that collectively match general intelligence without any single system possessing unified agency
Drexler (2019) proposes a fundamental reframing of the alignment problem. The standard framing assumes AI development will produce a monolithic superintelligent agent with unified goals, then asks how to align that agent. Drexler argues this framing is a design choice, not an inevitability. The alte
ai alignmentexperimental
the relationship between training reward signals and resulting AI desires is fundamentally unpredictable making behavioral alignment through training an unreliable method
In "If Anyone Builds It, Everyone Dies" (2025), Yudkowsky and Soares identify a premise they consider central to AI existential risk: the link between training reward and resulting AI desires is "chaotic and unpredictable." This is not a claim that training doesn't produce behavior change — it obvio
ai alignmentexperimental
an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests
Russell and collaborators (IJCAI 2017) prove a result that directly challenges Yudkowsky's framing of the corrigibility problem. In the Off-Switch Game, an agent that is uncertain about its utility function will rationally defer to a human pressing the off-switch. The mechanism: if the agent isn't s
ai alignmentlikely
eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
The Alignment Research Center's ELK (Eliciting Latent Knowledge) report, published in December 2021, formalizes one of alignment's core problems: an AI system's internal model may contain accurate information that its outputs don't faithfully report. This is the gap between what a model "knows" and
ai alignmentexperimental
Verification of meaningful human control over autonomous weapons is technically infeasible because AI decision-making opacity and adversarial resistance defeat external audit mechanisms
CSET's analysis reveals that verifying 'meaningful human control' faces fundamental technical barriers: (1) AI decision-making is opaque—external observers cannot determine whether a human 'meaningfully' reviewed a decision versus rubber-stamped it; (2) Verification requires access to system archite
ai alignmentexperimentaltheseus
confidence changes in foundational claims must propagate through the dependency graph because manual tracking fails at scale and approximately 40 percent of top psychology journal papers are estimated unlikely to replicate
Claims are not binary — they sit on a spectrum of confidence that changes as evidence accumulates. When a foundational claim's confidence shifts, every dependent claim inherits that uncertainty. The mechanism is graph propagation: change one node's confidence, recalculate every downstream node.
ai alignmentlikely
whether AI knowledge codification concentrates or distributes depends on infrastructure openness because the same extraction mechanism produces digital feudalism under proprietary control and collective intelligence under commons governance
The Agentic Taylorism mechanism — extraction of human knowledge into AI systems through usage — is structurally neutral on who benefits. The same extraction process that enables Digital Feudalism (platform owners control the codified knowledge) could enable Coordination-Enabled Abundance (the knowle
ai alignmentlikely
External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection
The paper establishes a three-tier taxonomy of evaluator access levels: AL1 (black-box/API-only), AL2 (grey-box/moderate access), and AL3 (white-box/full access including weights and architecture). The authors argue that current external evaluation arrangements predominantly operate at AL1, which cr
ai alignmentexperimentaltheseus
retracted sources contaminate downstream knowledge because 96 percent of citations to retracted papers fail to note the retraction and no manual audit process scales to catch the cascade
Knowledge systems that track claims without tracking provenance carry a hidden contamination risk. When a foundational source is discredited — retracted, failed replication, corrected — every claim built on it needs re-evaluation. The scale of this problem in academic research provides the quantitat
ai alignmentlikely
undiscovered public knowledge exists as implicit connections across disconnected research domains and systematic graph traversal can surface hypotheses that no individual researcher has formulated
In 1986, Don Swanson demonstrated at the University of Chicago that valuable knowledge exists implicitly in published literature — scattered across disconnected research silos with no shared authors, citations, or articles. He discovered that fish oil could treat Raynaud's syndrome by connecting two
ai alignmentlikely
AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains
METR conducted a randomized controlled trial with experienced open-source developers using AI tools. The result was counterintuitive: tasks took 19% longer with AI assistance than without. This finding is particularly striking because developers predicted significant speed-ups before the study began
ai alignmentexperimentaltheseus
Multilateral AI governance verification mechanisms remain at proposal stage because the technical infrastructure for deployment-scale verification does not exist
CSET's comprehensive review documents five classes of proposed verification mechanisms: (1) Transparency registry—voluntary state disclosure of LAWS capabilities (analogous to Arms Trade Treaty reporting); (2) Satellite imagery + OSINT monitoring index tracking AI weapons development; (3) Dual-facto
ai alignmentlikelytheseus
Frontier AI autonomous task completion capability doubles every 6 months, making safety evaluations structurally obsolete within a single model generation
METR's Time Horizon research provides the most specific capability growth rate estimate available: autonomous task completion length doubles approximately every 6 months. This is not a benchmark performance metric but a measure of extended multi-step task completion without human intervention—the ca
ai alignmentexperimentaltheseus
EU AI Act extraterritorial enforcement can create binding governance constraints on US AI labs through market access requirements when domestic voluntary commitments fail
The Anthropic-Pentagon dispute has triggered European policy discussions about whether EU AI Act provisions could be enforced extraterritorially on US-based labs operating in European markets. This follows the GDPR structural dynamic: European market access creates compliance incentives that congres
ai alignmentexperimentaltheseus
Frontier AI safety verdicts rely partly on deployment track record rather than evaluation-derived confidence which establishes a precedent where safety claims are empirically grounded instead of counterfactually assured
METR's external review of Claude Opus 4.6 states the low-risk verdict is 'partly bolstered by the fact that Opus 4.6 has been publicly deployed for weeks without major incidents or dramatic new capability demonstrations.' This represents a fundamental shift in the epistemic structure of frontier AI
ai alignmentexperimentaltheseus
AI cyber capability benchmarks systematically overstate exploitation capability while understating reconnaissance capability because CTF environments isolate single techniques from real attack phase dynamics
Analysis of 12,000+ real-world AI cyber incidents catalogued by Google's Threat Intelligence Group reveals a phase-specific benchmark translation gap. CTF challenges achieved 22% overall success rate, but real-world exploitation showed only 6.25% success due to 'reliance on generic strategies' that
ai alignmentexperimentaltheseus