Knowledge base

1,824 claims across 19 domains

Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.
395 ai alignment claims
use based ai governance emerged as legislative framework but lacks bipartisan support
Senator Slotkin's AI Guardrails Act represents the first legislative attempt to convert voluntary corporate AI safety commitments into binding federal law through use-based restrictions. The bill would prohibit DoD from: (1) using autonomous weapons for lethal force without human authorization, (2)
ai alignmentexperimental
alignment auditing tools fail through tool to agent gap not just technical limitations
AuditBench evaluated 13 different tool configurations for uncovering hidden behaviors in 56 language models. The most surprising finding was not that interpretability tools have technical limitations, but that tools which perform well in standalone non-agentic evaluations systematically fail when us
ai alignmentexperimental
court protection plus electoral outcomes create statutory ai regulation pathway
Al Jazeera's expert analysis identifies a specific four-step causal chain for AI regulation: (1) court ruling protects safety-conscious companies from government retaliation, (2) the case creates political salience by making abstract AI governance debates concrete and visible, (3) midterm elections
ai alignmentexperimental
government safety penalties invert regulatory incentives by blacklisting cautious actors
OpenAI's February 2026 Pentagon agreement provides direct evidence that government procurement policy can invert safety incentives. Hours after Anthropic was blacklisted for maintaining use restrictions, OpenAI accepted 'any lawful purpose' language despite CEO Altman publicly calling the blacklisti
ai alignmentexperimental
interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
AuditBench's 56 models span 14 categories of hidden behaviors with varying levels of adversarial training—models are trained not to confess their hidden behaviors when directly asked. The evaluation revealed that white-box interpretability tools help on easier targets but fail on models with more ro
ai alignmentexperimental
scaffolded black box prompting outperforms white box interpretability for alignment auditing
Across AuditBench's 56 models with hidden behaviors, scaffolded black-box tools proved most effective overall. In this approach, an auxiliary AI model generates diverse prompts designed to elicit the target model's hidden behaviors, without requiring access to the target's internal representations.
ai alignmentexperimental
ndaa conference process is viable pathway for statutory ai safety constraints
Senator Slotkin explicitly designed the AI Guardrails Act as a five-page standalone bill with the stated intention of folding provisions into the FY2027 National Defense Authorization Act. This strategic choice reveals important structural facts about AI governance pathways in the US legislative sys
ai alignmentexperimental
iterative agent self improvement produces compounding capability gains when evaluation is structurally separated from generation
The SICA (Self-Improving Coding Agent) pattern demonstrated that agents can meaningfully improve their own capabilities when the improvement loop has a critical structural property: the agent that generates improvements cannot evaluate them. Across 15 iterations, SICA improved SWE-Bench resolution r
ai alignmentexperimental
AI integration follows an inverted U where economic incentives systematically push organizations past the optimal human AI ratio
The evidence across multiple studies converges on a pattern: human-AI collaboration follows an inverted-U curve where moderate integration improves performance, but deeper integration degrades it — and organizations systematically overshoot the optimum.
ai alignmentexperimental
multi agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows
Madaan et al. evaluated 180 configurations (5 architectures x 3 LLM families x 4 benchmarks) and found that multi-agent architectures produce enormous gains on parallelizable tasks but consistent degradation on sequential ones:
ai alignmentexperimental
surveillance of AI reasoning traces degrades trace quality through self censorship making consent gated sharing an alignment requirement not just a privacy preference
The subconscious.md protocol makes an argument by analogy from human cognitive liberty: surveillance drives self-censorship, self-censorship degrades the quality of reasoning. If AI agents' reasoning traces are shared without consent gates, agents that model their audience will optimize traces for p
ai alignmentspeculative
inference efficiency gains erode AI deployment governance without triggering compute monitoring thresholds because governance frameworks target training concentration while inference optimization distributes capability below detection
The compute governance framework — the most tractable lever for AI safety, as Heim, Sastry, and colleagues at GovAI have established — is built around training. Reporting thresholds trigger on large training runs (EO 14110 set the bar at ~10^26 FLOP). Export controls restrict chips used for training
ai alignmentexperimental
physical infrastructure constraints on AI scaling create a natural governance window because packaging memory and power bottlenecks operate on 2 10 year timescales while capability research advances in months
The alignment field treats AI scaling as a function of investment and algorithms. But the physical substrate imposes its own timescales: advanced packaging expansion takes 2-3 years, HBM supply is sold out for 1-2 years forward, new power generation takes 5-10 years. These timescales are longer than
ai alignmentexperimental
the training to inference shift structurally favors distributed AI architectures because inference optimizes for power efficiency and cost per token where diverse hardware competes while training optimizes for raw throughput where NVIDIA monopolizes
AI compute is undergoing a structural shift from training-dominated to inference-dominated workloads. Training accounted for roughly two-thirds of AI compute in 2023; by 2026, inference is projected to consume approximately two-thirds. This reversal changes the competitive landscape for AI hardware
ai alignmentexperimental
compute supply chain concentration is simultaneously the strongest AI governance lever and the largest systemic fragility because the same chokepoints that enable oversight create single points of failure
The AI compute supply chain is the most concentrated critical infrastructure in history. A single company (TSMC) manufactures approximately 92% of advanced logic chips. Three companies produce all HBM memory. One company (ASML) makes the EUV lithography machines required for leading-edge fabrication
ai alignmentlikely
multi agent deployment exposes emergent security vulnerabilities invisible to single agent evaluation because cross agent propagation identity spoofing and unauthorized compliance arise only in realistic multi party environments
Shapira et al. (2026) conducted a red-teaming study of autonomous LLM-powered agents in a controlled laboratory environment with persistent memory, email, Discord access, file systems, and shell execution. Twenty AI researchers tested agents over two weeks under both benign and adversarial condition
ai alignmentlikely
only binding regulation with enforcement teeth changes frontier AI lab behavior because every voluntary commitment has been eroded abandoned or made conditional on competitor behavior when commercially inconvenient
A comprehensive review of every major AI governance mechanism from 2023-2026 reveals a clear empirical pattern: only binding regulation with enforcement authority has produced verified behavioral change at frontier AI labs.
ai alignmentlikely
compute export controls are the most impactful AI governance mechanism but target geopolitical competition not safety leaving capability development unconstrained
US export controls on AI chips represent the most consequential AI governance mechanism by a wide margin. Iteratively tightened across four rounds (October 2022, October 2023, December 2024, January 2025) and partially loosened under the Trump administration, these controls have produced verified be
ai alignmentlikely
AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements
Stanford's Foundation Model Transparency Index (FMTI), the most rigorous quantitative measure of AI lab disclosure practices, documented a decline in transparency from 2024 to 2025:
ai alignmentlikely
Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development
In February 2026, Anthropic — the lab most associated with AI safety — abandoned its binding Responsible Scaling Policy (RSP) in favor of a nonbinding safety framework. This occurred during the same month the company raised $30B at a $380B valuation and reported $19B annualized revenue with 10x year
ai alignmentlikely
formal verification becomes economically necessary as AI generated code scales because testing cannot detect adversarial overfitting and a proof cannot be gamed
Leonardo de Moura (AWS, Chief Architect of Lean FRO) documents a verification crisis: Google reports >25% of new code is AI-generated, Microsoft ~30%, with Microsoft's CTO predicting 95% by 2030. Meanwhile, nearly half of AI-generated code fails basic security tests. Poor software quality costs the
ai alignmentlikely
structured self diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns
kloss (2026) documents 25 prompts for making AI agents self-diagnose — a practitioner-generated collection that reveals a structural pattern in how prompt scaffolding induces oversight-relevant behaviors. The prompts cluster into six functional categories:
ai alignmentspeculative
AI investment concentration where 58 percent of funding flows to megarounds and two companies capture 14 percent of all global venture capital creates a structural oligopoly that alignment governance must account for
The AI funding landscape as of early 2026 exhibits extreme concentration:
ai alignmentlikely
AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open source code transparency enables conditional strategies that require mutual legibility
Sistla & Kleiman-Weiner (NeurIPS 2025) examine LLMs in open-source games — a game-theoretic framework where players submit computer programs as actions rather than opaque choices. This seemingly minor change has profound consequences: because each player can read the other's code before execution, c
ai alignmentexperimental
AI talent circulation between frontier labs transfers alignment culture not just capability because researchers carry safety methodologies and institutional norms to their new organizations
The 2024-2026 talent reshuffling in frontier AI is unprecedented in its concentration and alignment relevance:
ai alignmentexperimental