Couverture de #006 - The Subtle Art of Inference with Adam Grzywaczewski

#006 - The Subtle Art of Inference with Adam Grzywaczewski

#006 - The Subtle Art of Inference with Adam Grzywaczewski

Écouter gratuitement

Voir les détails

3 mois pour 0,99 €/mois

Après 3 mois, 9.95 €/mois. Offre soumise à conditions.

À propos de ce contenu audio

In this episode of The Private AI Lab, Johan van Amersfoort speaks with Adam Grzywaczewski, a senior Deep Learning Data Scientist at NVIDIA, about the rapidly evolving world of AI inference.


They explore how inference has shifted from simple, single-GPU execution to highly distributed, latency-sensitive systems powering today’s large language models. Adam explains the real bottlenecks teams face, why software optimization and hardware innovation must move together, and how NVIDIA’s inference stack—from TensorRT-LLM to Dynamo—enables scalable, cost-efficient deployments.


The conversation also covers quantization, pruning, mixture-of-experts models, AI factories, and why inference optimization is becoming one of the most critical skills in modern AI engineering.


Topics covered


  • Why inference is now harder than training

  • Autoregressive models and KV-cache challenges

  • Mixture-of-experts architectures

  • NVIDIA Dynamo and TensorRT-LLM

  • Hardware vs software optimization

  • Quantization, pruning, and distillation

  • Latency vs throughput trade-offs

  • The rise of AI factories and DGX systems

  • What’s next for AI inference

Les membres Amazon Prime bénéficient automatiquement de 2 livres audio offerts chez Audible.

Vous êtes membre Amazon Prime ?

Bénéficiez automatiquement de 2 livres audio offerts.
Bonne écoute !
    Aucun commentaire pour le moment