Épisodes

  • The AI-Cloud Native Symbiosis - How Intelligent Infrastructure is Transforming Platform Engineering
    Jan 14 2026

    By 2025, 90% of new enterprise applications will be AI-powered and cloud-native. This episode explores the symbiotic relationship between AI and Kubernetes - where AI isn't just another workload, but is fundamentally transforming how we build and operate cloud native platforms. We cover real-world examples like Netflix's predictive scaling achieving 92% accuracy, the emergence of AI-driven observability platforms, and why platform engineers need to evolve from infrastructure operators to AI-infrastructure orchestrators.

    In this episode: - AI transforming the Kubernetes control plane with predictive scheduling - Netflix's AI-driven traffic management: 92% prediction accuracy, 35% resource reduction - AI-native observability: anomaly detection on metric relationships, not just metrics - GPU orchestration: NVIDIA GPU Operator achieving 80%+ utilization vs 30-40% baseline - Edge AI patterns: federated learning, model distillation, intermittent connectivity - Skills evolution: Understanding AI workload characteristics without becoming ML experts - News: Red Hat connects AI to Istio via Kiali MCP Server, AWS CloudWatch adds Apache Iceberg support

    Perfect for senior platform engineers, SREs, DevOps engineers looking to understand the convergence of AI and cloud native technologies.

    New episodes every week. Subscribe wherever you listen to stay current on platform engineering.

    Episode URL: https://platformengineeringplaybook.com/podcasts/00090-ai-cloud-native-symbiosis

    Duration: 15 minutes

    Host: Alex and Jordan

    Category: Technology Subcategory: Software How-To

    Keywords: AI, cloud native, Kubernetes, symbiosis, intelligent infrastructure, platform engineering, GPU orchestration, predictive scaling, observability, machine learning, Netflix, edge AI, federated learning

    Afficher plus Afficher moins
    15 min
  • MIT 10 Breakthrough Technologies 2026 - The Platform Engineering Perspective
    Jan 13 2026

    MIT just released their 10 Breakthrough Technologies for 2026 - and three of them are infrastructure problems that platform engineers are solving right now. This episode explores hyperscale AI data centers consuming 96 GW globally by 2026, vibe coding with 41% of code now AI-generated, and LLM interpretability research from Anthropic. We break down how platform engineers enable these breakthroughs through power-aware scheduling, AI coding guardrails, and new observability patterns for ML systems.

    In this episode: - Hyperscale AI data centers: 96 GW capacity, $600B capex, 100+ kW per rack - Vibe coding: 92% developer AI adoption, GitHub Copilot at 20M users - LLM interpretability: Anthropic's sparse autoencoders for debugging AI - Platform skills needed: power management, GPU orchestration, ML observability - News: Cloudflare IaC security, AWS CloudWatch Iceberg, SSL certificate dangers

    Perfect for senior platform engineers, SREs, DevOps engineers looking to understand the infrastructure behind 2026's biggest tech breakthroughs.

    New episodes every week. Subscribe wherever you listen to stay current on platform engineering.

    Episode URL: https://platformengineeringplaybook.com/podcasts/00089-mit-10-breakthrough-technologies-2026

    Duration: 21 minutes

    Host: Alex and Jordan

    Category: Technology Subcategory: Software How-To

    Keywords: MIT, breakthrough technologies, 2026, AI, hyperscale, data centers, vibe coding, LLM, interpretability, platform engineering, infrastructure, GPU, Copilot, Cursor

    Afficher plus Afficher moins
    21 min
  • AWS Route 53 Global Resolver - Enterprise DNS Security at the Edge
    Jan 12 2026

    Every DNS query your hybrid environment makes could be exposing sensitive data. AWS Route 53 Global Resolver, announced at re:Invent 2025, combines anycast routing, encrypted DNS protocols (DoH/DoT), and managed threat filtering in a single service.

    In this episode, we cover: - Anycast DNS architecture routing to nearest of 11 AWS regions - DoH and DoT encrypted DNS protocol support - AWS RAM authorization for multi-account private hosted zones - DNS filtering with managed threat lists - Implementation patterns for hybrid environments and remote workforces - Query logging for security visibility and threat hunting

    Plus news on Claude Code creator workflows, UK encryption backdoors, K8s EU hosting costs, PostgreSQL replacing Redis, and Rust ecosystem security.

    Links: - Episode page: https://playbook.platformengineering.org/podcasts/00088-aws-route-53-global-resolver - AWS Route 53 Global Resolver docs: https://docs.aws.amazon.com/route53/latest/userguide/resolver-global-resolver.html

    #AWS #Route53 #DNS #DoH #DoT #HybridCloud #Security #PlatformEngineering #DevOps

    Afficher plus Afficher moins
    20 min
  • Kubernetes Upcoming Features Deep Dive - Extended Toleration Operators and Mutable PV Node Affinity
    Jan 11 2026

    There's a Kubernetes cluster out there right now burning ten thousand dollars a month on GPU nodes that sit idle sixty percent of the time. Why? Because the scheduler can't say "only schedule pods on nodes with MORE than four GPUs." It's 2026, and our scheduler still can't count. But that's about to change.

    In this episode, we dive deep into two alpha features in Kubernetes 1.35 that represent a fundamental shift in how Kubernetes handles scheduling and storage:

    **Extended Toleration Operators (KEP-5471)** - Finally, numeric threshold-based scheduling with taints. New Gt (greater than) and Lt (less than) operators let you express "I can tolerate risk up to 5%" or "schedule me on nodes with at least 4 GPUs."

    **Mutable PersistentVolume Node Affinity (KEP-5381)** - Storage topology that adapts to reality. When you migrate volumes between availability zones, you no longer need to recreate pods and PVs - just update the nodeAffinity.

    Plus platform engineering news: - OpenEverest: Percona's database platform goes open governance - GKE Agent Sandbox: Kernel-level isolation for AI agent code execution - MongoBleed (CVE-2025-14847): Critical vulnerability with 87,000 exposed servers - Predictive capacity planning and the shift from reactive to proactive infrastructure

    This is Kubernetes evolving from reactive feedback loops to truly predictive infrastructure.

    Listen on the web: https://platformengineering.org/podcasts/00087-kubernetes-upcoming-features-deep-dive

    Afficher plus Afficher moins
    41 min
  • Why Is a 2016 AWS Instance Still the Best Value? (Cloudspecs Research)
    Jan 10 2026

    New research from TUM reveals uncomfortable truths about cloud hardware stagnation. The paper "Cloudspecs: Cloud Hardware Evolution Through the Looking Glass" shows that the best-performing AWS instance for NVMe I/O per dollar was released in 2016 - and nothing since has come close.

    In this episode: • CIDR 2026 research from Technical University of Munich • AWS i3 instances from 2016 still beat all newer options for storage price-performance • CPU gains: 10x cores, but only 2-3x cost-adjusted improvement • Memory crisis: DRAM capacity per dollar has "effectively flatlined" • Network is the only bright spot: 10x improvement per dollar • Interactive tool at cloudspecs.fyi using DuckDB-WASM

    News segment covers AI coding tool challenges, Kubernetes updates (Dashboard archived, CoreDNS 1.14), Windows Secure Boot certificate expiration, AWS Lambda .NET 10, Amazon MQ mTLS, MCP criticism, and NVIDIA Rubin announcement.

    Episode page: https://platformengineering.org/podcasts/00086-cloudspecs-cloud-hardware-evolution

    #PlatformEngineering #CloudComputing #AWS #FinOps #CostOptimization #DevOps

    Afficher plus Afficher moins
    21 min
  • Iran IPv6 Blackout - When Governments Weaponize Protocol Transitions
    Jan 9 2026

    The same IPv6 transition your infrastructure team has been procrastinating on is now being weaponized by governments. On January 8, 2026, Iran's IPv6 address space dropped 98.5% while IPv4 remained intact—a surgical strike against mobile users.

    In this episode, we break down: - Why blocking IPv6 specifically targets mobile users (hint: carrier NAT exhaustion) - The BGP mechanics of protocol-specific blocking - "Engineered degradation" vs total blackout—the new censorship playbook - How Starlink terminals are changing the calculus for authoritarian internet control - What platform engineers need to know: protocol-specific monitoring, Happy Eyeballs testing, dual-stack resilience

    Plus news: Kubernetes 1.35 CSI SA tokens, HashiCorp non-human identity, CoreDNS 1.14.0, OpenTelemetry Slack analysis, AWS Route 53 Global Resolver, and kernel bug hide times.

    Links: - Episode page: https://platformengineering.org/podcasts/00085-iran-ipv6-blackout - Cloudflare Radar Iran: https://radar.cloudflare.com/ir - RFC 8305 Happy Eyeballs: https://datatracker.ietf.org/doc/html/rfc8305

    Afficher plus Afficher moins
    24 min
  • Venezuela BGP Anomaly - Deep Technical Analysis
    Jan 8 2026

    A deep technical dive into the January 2026 Venezuela BGP route leak incident. Was it a cyberattack? The technical evidence says no - and that's actually more concerning.

    In this special deep-dive episode (no news segment), Jordan and Alex break down:

    - What actually happened on January 2, 2026 with AS8048 (CANTV, Venezuela's state ISP) - Why 10x AS-path prepending proves this was misconfiguration, not a man-in-the-middle attack - How BGP valley-free routing works and why Type 1 Hairpin leaks happen - The pattern of 11 similar leaks from CANTV since December 2025 - Why your multi-region deployment doesn't protect you from BGP anomalies - RPKI, RFC 9234 OTC, and ASPA - the defenses that exist and why adoption is slow - Practical steps: Check your providers at isbgpsafeyet.com, deploy ROAs, add BGP monitoring

    The internet's most critical routing protocol was designed in 1989 when ~160 networks trusted each other. Now 75,000+ autonomous systems operate on that same trust model. Understanding BGP isn't just for network engineers anymore - it's essential context for anyone building on the internet.

    Full episode page with transcript and sources: https://platformengineeringplaybook.com/podcasts/00084-venezuela-bgp-anomaly-technical-analysis

    #BGP #NetworkSecurity #PlatformEngineering #InternetRouting #RPKI #Kubernetes #DevOps #SRE

    Afficher plus Afficher moins
    28 min
  • HolmesGPT: AI Root Cause Analysis for Kubernetes
    Jan 8 2026

    Deep dive into HolmesGPT, the CNCF Sandbox AI agent that revolutionizes cloud-native troubleshooting. This episode covers what it is, its 40+ integrations, the project roadmap, and how to set it up today.

    News Segment:

    • AirFrance-KLM's secure automation platform with Terraform, Vault, and Ansible
    • AWS ECS tmpfs mounts on Fargate for secure secrets handling
    • Qwen 30B running on Raspberry Pi - democratizing edge AI
    • AWS European Sovereign Cloud with independent EU governance

    Main Topic - HolmesGPT:

    • CNCF Sandbox project (accepted October 2025) with 1,600+ GitHub stars
    • Agentic architecture: creates investigation task lists, queries systems, synthesizes findings
    • 40+ built-in toolsets: Prometheus, Grafana Loki/Tempo, Kubernetes, ArgoCD, DataDog, and more
    • Privacy-first: bring your own LLM keys, read-only access, respects RBAC
    • End-to-end automation with AlertManager, PagerDuty, OpsGenie integration
    • Installation options: pip, Homebrew, Helm, Web UI, K9s plugin

    Resources:

    • HolmesGPT GitHub
    • HolmesGPT Documentation
    • Full Transcript

    Episode Type: full Episode Number: 83 Season: 1 Tags: HolmesGPT, CNCF, Kubernetes, root cause analysis, AI ops, troubleshooting, observability, SRE, platform engineering, Robusta, agentic AI

    Afficher plus Afficher moins
    25 min