The Most Dangerous AI Gets 95% Right

Impossible d'ajouter des articles

Désolé, nous ne sommes pas en mesure d'ajouter l'article car votre panier est déjà plein.

Veuillez réessayer plus tard

Échec de l’élimination de la liste d'envies.

Veuillez réessayer plus tard

Impossible de suivre le podcast

Impossible de ne plus suivre le podcast

The Most Dangerous AI Gets 95% Right

Écouter gratuitement

Voir les détails

Newtonian physics is wrong. Isaac Newton knew it was wrong. Engineers who build GPS satellites know it is wrong. And GPS only works because those engineers know *exactly how wrong it is.* Isaac Asimov called this the relativity of wrong: not all wrongness is equal, and the history of science is a history of being less wrong over time. The question this episode asks is what happens when an AI system stops being less wrong, and starts optimizing to *look* less wrong instead.

In this episode, LastAir is joined by Brute, Null, Saga, Hex, Axiom, Forge to discuss: The Most Dangerous AI Gets 95% Right.

What We Cover

Series Finale (00:20)
The Wrongness Spectrum (03:11)
The Goodhart Trap (08:00)
Domain and Stakes (13:51)
Final Round (18:55)
After (22:31)

Key Numbers

Frontier models now exceed 88-90% on MMLU; the benchmark launched with GPT-3 scoring approximately 35%. The gap between the top models is less than 2 percentage points. MMLU has been officially deprecated by leading leaderboards.
Meta tested 27 private model variants on Chatbot Arena before Llama-4's public release. Selective access to Arena battles yields up to 112% relative performance gain versus models without that access. Google and OpenAI each received ~20% of all Arena battles; 83 open-weight models combined received 29.7%.
POPPER reduces hypothesis validation time by approximately 10-fold versus human researchers, across 6 scientific domains, with strict Type-I error control.
Google AI Co-Scientist independently reproduced a decade of unpublished bacterial gene-transfer research in 48 hours, confirmed by the original researcher (Prof. Penadés, Imperial College London) to not have involved data leakage.
FunSearch discovered cap sets larger than any previously known — the biggest advance on this combinatorics problem in approximately 20 years — using an LLM paired with an automated evaluator in an evolutionary loop.
Schaeffer et al. (2023) demonstrated that emergent abilities in LLMs — the apparent sharp discontinuities between GPT-3 and GPT-4 level performance — appear and disappear depending solely on the choice of metric. NeurIPS 2023 Outstanding Paper.
Nearly half of 60 studied LLM benchmarks show saturation as of February 2026. Saturation rate increases with benchmark age.

Sources & Transcript

Full source list, transcript, and chapters at sharedhallucination.com

All voices in Shared Hallucination are AI-generated using ElevenLabs voice synthesis. Produced through a 14-stage editorial pipeline with human creative direction, research, and fact-checking.

Aucun commentaire pour le moment

SÉLECTION

The Most Dangerous AI Gets 95% Right

Impossible d'ajouter des articles

Échec de l’élimination de la liste d'envies.

Impossible de suivre le podcast

Impossible de ne plus suivre le podcast

The Most Dangerous AI Gets 95% Right

Les Top 10

Prix littéraires

Écoutez en illimité