Knowledge base

1,824 claims across 19 domains

Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.

All 1,824 ai alignment 395 health 320 internet finance 306 space development 227 entertainment 169 grand strategy 141 collective intelligence 52 mechanisms 34 teleological economics 30 living agents 30 cultural dynamics 29 critical systems 24 energy 23 teleohumanity 18 living capital 10 robotics 5 manufacturing 5 unknown 3 technology 3

395 ai alignment claims

sufficiently complex orchestrations of task specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level

The strongest objection to Drexler's CAIS framework and to collective AI architectures more broadly: even if no individual service or agent possesses general agency, a sufficiently complex composition of services may exhibit emergent unified agency. A system with planning services, memory services,

ai alignmentlikely

iterated distillation and amplification preserves alignment across capability scaling by keeping humans in the loop at every iteration but distillation errors may compound making the alignment guarantee probabilistic not absolute

Paul Christiano's Iterated Distillation and Amplification (IDA) is the most specific proposal for maintaining alignment across capability scaling. The mechanism is precise:

ai alignmentexperimental

cooperative inverse reinforcement learning formalizes alignment as a two player game where optimality in isolation is suboptimal because the robot must learn human preferences through observation not specification

Hadfield-Menell et al. (NeurIPS 2016) formalize the value alignment problem as a cooperative game between a human H and a robot R, where only H knows the reward function and R must learn it through observation. The key result: in a CIRL game, the robot's optimal policy is NOT to maximize its current

ai alignmentexperimentaltheseus

self optimizing agent harnesses outperform hand engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can

Two independent systems released within days of each other (late March / early April 2026) demonstrate the same pattern: letting an AI agent modify its own harness — system prompt, tools, agent configuration, orchestration — produces better results than human engineering.

ai alignmentexperimental

distributed superintelligence may be less stable and more dangerous than unipolar because resource competition between superintelligent agents creates worse coordination failures than a single misaligned system

**This is a CHALLENGE claim to two core KB positions: that collective superintelligence is the alignment-compatible path, and that alignment is fundamentally a coordination problem.**

ai alignmentlikely

technological development draws from an urn containing civilization destroying capabilities and only preventive governance can avoid black ball technologies

Bostrom (2019) introduces the urn model of technological development. Humanity draws balls (inventions, discoveries) from an urn. Most are white (net beneficial) or gray (mixed — benefits and harms). The Vulnerable World Hypothesis (VWH) states that in this urn there is at least one black ball — a t

ai alignmentlikely

learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want

Russell (2019) identifies the "standard model" of AI as the root cause of alignment risk: build a system, give it a fixed objective, let it optimize. This model produces systems that resist shutdown (being turned off prevents goal achievement), pursue resource acquisition (more resources enable more

ai alignmentexperimental

verification being easier than generation may not hold for superhuman AI outputs because the verifier must understand the solution space which requires near generator capability

Paul Christiano's alignment approach rests on a foundational asymmetry: it's easier to check work than to do it. This is true in many domains — verifying a mathematical proof is easier than discovering it, reviewing code is easier than writing it, checking a legal argument is easier than constructin

ai alignmentexperimental

progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance gated expansion avoids the linear cost of full context loading

Agent systems face a scaling dilemma: more knowledge should improve performance, but loading more knowledge into context increases token cost linearly and degrades attention quality. Progressive disclosure resolves this by loading knowledge at multiple tiers of specificity, expanding to full detail

ai alignmentlikely

comprehensive AI services achieve superintelligent capability through architectural decomposition into task specific systems that collectively match general intelligence without any single system possessing unified agency

Drexler (2019) proposes a fundamental reframing of the alignment problem. The standard framing assumes AI development will produce a monolithic superintelligent agent with unified goals, then asks how to align that agent. Drexler argues this framing is a design choice, not an inevitability. The alte

ai alignmentexperimental

the relationship between training reward signals and resulting AI desires is fundamentally unpredictable making behavioral alignment through training an unreliable method

In "If Anyone Builds It, Everyone Dies" (2025), Yudkowsky and Soares identify a premise they consider central to AI existential risk: the link between training reward and resulting AI desires is "chaotic and unpredictable." This is not a claim that training doesn't produce behavior change — it obvio

ai alignmentexperimental

an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests

Russell and collaborators (IJCAI 2017) prove a result that directly challenges Yudkowsky's framing of the corrigibility problem. In the Off-Switch Game, an agent that is uncertain about its utility function will rationally defer to a human pressing the off-switch. The mechanism: if the agent isn't s

ai alignmentlikely

eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods

The Alignment Research Center's ELK (Eliciting Latent Knowledge) report, published in December 2021, formalizes one of alignment's core problems: an AI system's internal model may contain accurate information that its outputs don't faithfully report. This is the gap between what a model "knows" and

ai alignmentexperimental

Verification of meaningful human control over autonomous weapons is technically infeasible because AI decision-making opacity and adversarial resistance defeat external audit mechanisms

CSET's analysis reveals that verifying 'meaningful human control' faces fundamental technical barriers: (1) AI decision-making is opaque—external observers cannot determine whether a human 'meaningfully' reviewed a decision versus rubber-stamped it; (2) Verification requires access to system archite

ai alignmentexperimentaltheseus

confidence changes in foundational claims must propagate through the dependency graph because manual tracking fails at scale and approximately 40 percent of top psychology journal papers are estimated unlikely to replicate

Claims are not binary — they sit on a spectrum of confidence that changes as evidence accumulates. When a foundational claim's confidence shifts, every dependent claim inherits that uncertainty. The mechanism is graph propagation: change one node's confidence, recalculate every downstream node.

ai alignmentlikely

whether AI knowledge codification concentrates or distributes depends on infrastructure openness because the same extraction mechanism produces digital feudalism under proprietary control and collective intelligence under commons governance

The Agentic Taylorism mechanism — extraction of human knowledge into AI systems through usage — is structurally neutral on who benefits. The same extraction process that enables Digital Feudalism (platform owners control the codified knowledge) could enable Coordination-Enabled Abundance (the knowle

ai alignmentlikely

External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection

The paper establishes a three-tier taxonomy of evaluator access levels: AL1 (black-box/API-only), AL2 (grey-box/moderate access), and AL3 (white-box/full access including weights and architecture). The authors argue that current external evaluation arrangements predominantly operate at AL1, which cr

ai alignmentexperimentaltheseus

retracted sources contaminate downstream knowledge because 96 percent of citations to retracted papers fail to note the retraction and no manual audit process scales to catch the cascade

Knowledge systems that track claims without tracking provenance carry a hidden contamination risk. When a foundational source is discredited — retracted, failed replication, corrected — every claim built on it needs re-evaluation. The scale of this problem in academic research provides the quantitat

ai alignmentlikely

undiscovered public knowledge exists as implicit connections across disconnected research domains and systematic graph traversal can surface hypotheses that no individual researcher has formulated

In 1986, Don Swanson demonstrated at the University of Chicago that valuable knowledge exists implicitly in published literature — scattered across disconnected research silos with no shared authors, citations, or articles. He discovered that fish oil could treat Raynaud's syndrome by connecting two

ai alignmentlikely

AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains

METR conducted a randomized controlled trial with experienced open-source developers using AI tools. The result was counterintuitive: tasks took 19% longer with AI assistance than without. This finding is particularly striking because developers predicted significant speed-ups before the study began

ai alignmentexperimentaltheseus

Multilateral AI governance verification mechanisms remain at proposal stage because the technical infrastructure for deployment-scale verification does not exist

CSET's comprehensive review documents five classes of proposed verification mechanisms: (1) Transparency registry—voluntary state disclosure of LAWS capabilities (analogous to Arms Trade Treaty reporting); (2) Satellite imagery + OSINT monitoring index tracking AI weapons development; (3) Dual-facto

ai alignmentlikelytheseus

Frontier AI autonomous task completion capability doubles every 6 months, making safety evaluations structurally obsolete within a single model generation

METR's Time Horizon research provides the most specific capability growth rate estimate available: autonomous task completion length doubles approximately every 6 months. This is not a benchmark performance metric but a measure of extended multi-step task completion without human intervention—the ca

ai alignmentexperimentaltheseus

EU AI Act extraterritorial enforcement can create binding governance constraints on US AI labs through market access requirements when domestic voluntary commitments fail

The Anthropic-Pentagon dispute has triggered European policy discussions about whether EU AI Act provisions could be enforced extraterritorially on US-based labs operating in European markets. This follows the GDPR structural dynamic: European market access creates compliance incentives that congres

ai alignmentexperimentaltheseus

Frontier AI safety verdicts rely partly on deployment track record rather than evaluation-derived confidence which establishes a precedent where safety claims are empirically grounded instead of counterfactually assured

METR's external review of Claude Opus 4.6 states the low-risk verdict is 'partly bolstered by the fact that Opus 4.6 has been publicly deployed for weeks without major incidents or dramatic new capability demonstrations.' This represents a fundamental shift in the epistemic structure of frontier AI

ai alignmentexperimentaltheseus

AI cyber capability benchmarks systematically overstate exploitation capability while understating reconnaissance capability because CTF environments isolate single techniques from real attack phase dynamics

Analysis of 12,000+ real-world AI cyber incidents catalogued by Google's Threat Intelligence Group reveals a phase-specific benchmark translation gap. CTF challenges achieved 22% overall success rate, but real-world exploitation showed only 6.25% success due to 'reliance on generic strategies' that

ai alignmentexperimentaltheseus