ELO Ratings Questions
Impossible d'ajouter des articles
Désolé, nous ne sommes pas en mesure d'ajouter l'article car votre panier est déjà plein.
Veuillez réessayer plus tard
Veuillez réessayer plus tard
Échec de l’élimination de la liste d'envies.
Veuillez réessayer plus tard
Impossible de suivre le podcast
Impossible de ne plus suivre le podcast
-
Lu par :
-
De :
À propos de ce contenu audio
- Thesis: Using ELO for AI agent evaluation = measuring noise
- Problem: Wrong evaluators, wrong metrics, wrong assumptions
- Solution: Quantitative assessment frameworks
Chess ELO
- FIDE arbiters: 120hr training
- Binary outcome: win/loss
- Test-retest: r=0.95
- Cohen's κ=0.92
AI Agent ELO
- Random users: Google engineer? CS student? 10-year-old?
- Undefined dimensions: accuracy? style? speed?
- Test-retest: r=0.31 (coin flip)
- Cohen's κ=0.42
- Anchoring: 34% rating variance in first 3 seconds
- Confirmation: 78% selective attention to preferred features
- Dunning-Kruger: d=1.24 effect size
- Result: Circular preferences (A>B>C>A)
Objective Metrics
- McCabe complexity ≤20
- Test coverage ≥80%
- Big O notation comparison
- Self-admitted technical debt
- Reliability: r=0.91 vs r=0.42
- Effect size: d=2.18
Dream
- World's best engineers
- Annotated metrics
- Standardized criteria
Reality
- Random internet users
- No expertise verification
- Subjective preferences
- Stop: Using preference votes as quality metrics
- Start: Automated complexity analysis
- ROI: 4.7 months to break even
- Kapoor et al. (2025): "AI agents that matter" - κ=0.42 finding
- Santos et al. (2022): Technical Debt Grading validation
- Regan & Haworth (2011): Chess arbiter reliability κ=0.92
- Chapman & Johnson (2002): 34% anchoring effect
"You can't rate chess with basketball fans"
"0.31 reliability? That's a coin flip with extra steps"
"Every preference vote is a data crime"
"The psychometrics are screaming"
Resources- Technical Debt Grading (TDG) Framework
- PMAT (Pragmatic AI Labs MCP Agent Toolkit)
- McCabe Complexity Calculator
- Cohen's Kappa Calculator
- 🤖 Master GenAI Engineering - Build Production AI Systems
- 🦀 Learn Professional Rust - Industry-Grade Development
- 📊 AWS AI & Analytics - Scale Your ML in Cloud
- ⚡ Production GenAI on AWS - Deploy at Enterprise Scale
- 🛠️ Rust DevOps Mastery - Automate Everything
- 💼 Production ML Program - Complete MLOps & Cloud Mastery
- 🎯 Start Learning Now - Fast-Track Your ML Career
- 🏢 Trusted by Fortune 500 Teams
Learn end-to-end ML engineering from industry veterans at PAIML.COM
Vous êtes membre Amazon Prime ?
Bénéficiez automatiquement de 2 livres audio offerts.Bonne écoute !
Aucun commentaire pour le moment