#006 - The Subtle Art of Inference with Adam Grzywaczewski
Impossible d'ajouter des articles
Échec de l’élimination de la liste d'envies.
Impossible de suivre le podcast
Impossible de ne plus suivre le podcast
-
Lu par :
-
De :
À propos de ce contenu audio
In this episode of The Private AI Lab, Johan van Amersfoort speaks with Adam Grzywaczewski, a senior Deep Learning Data Scientist at NVIDIA, about the rapidly evolving world of AI inference.
They explore how inference has shifted from simple, single-GPU execution to highly distributed, latency-sensitive systems powering today’s large language models. Adam explains the real bottlenecks teams face, why software optimization and hardware innovation must move together, and how NVIDIA’s inference stack—from TensorRT-LLM to Dynamo—enables scalable, cost-efficient deployments.
The conversation also covers quantization, pruning, mixture-of-experts models, AI factories, and why inference optimization is becoming one of the most critical skills in modern AI engineering.
Topics covered
Why inference is now harder than training
Autoregressive models and KV-cache challenges
Mixture-of-experts architectures
NVIDIA Dynamo and TensorRT-LLM
Hardware vs software optimization
Quantization, pruning, and distillation
Latency vs throughput trade-offs
The rise of AI factories and DGX systems
What’s next for AI inference
Vous êtes membre Amazon Prime ?
Bénéficiez automatiquement de 2 livres audio offerts.Bonne écoute !