Couverture de How vLLM and llm-d Changed AI Inference with Rob Shaw

How vLLM and llm-d Changed AI Inference with Rob Shaw

How vLLM and llm-d Changed AI Inference with Rob Shaw

Écouter gratuitement

Voir les détails
In this episode of Alexa’s Input (AI), I sat down with Rob Shaw from Red Hat to talk about how AI inference evolved from a simple model serving problem into a large-scale distributed systems problem.We explored the infrastructure shifts behind modern LLM serving, including how vLLM and PagedAttention changed the economics and efficiency of inference, why KV cache management became one of the most important bottlenecks in production AI systems, and how orchestration layers like llm-d are emerging to coordinate distributed inference.We also discuss:how LLM inference differs from traditional model serving runtimesKV cache, prefix caching, and cache-aware routingwhy throughput and latency became major infrastructure challengeslong-context agents and repeated inference callsdistributed inference on Kubernetesintelligent routing, flow control, and load balancingprefill/decode disaggregationenterprise AI deployment realitiesvLLM has become one of the most important open-source projects in AI infrastructure, and llm-d represents a newer shift toward treating inference as a coordinated distributed system rather than just a single runtime problem.If you want to better understand the systems layer beneath modern AI applications, this episode is a deep dive into where inference infrastructure is heading next.General Podcast LinksWatch: ⁠⁠⁠⁠⁠⁠https://www.youtube.com/@alexa_griffith⁠⁠⁠⁠⁠⁠Read: ⁠⁠⁠⁠⁠⁠⁠⁠https://alexasinput.substack.com/⁠⁠⁠⁠⁠⁠⁠⁠Listen:⁠⁠ ⁠⁠https://creators.spotify.com/pod/profile/alexagriffith/⁠⁠⁠⁠More: ⁠⁠⁠⁠⁠⁠https://linktr.ee/alexagriffith⁠⁠⁠⁠⁠⁠Learn more about the host atWebsite: ⁠⁠⁠⁠⁠⁠https://alexagriffith.com/⁠⁠⁠⁠⁠⁠LinkedIn: ⁠⁠⁠⁠⁠⁠https://www.linkedin.com/in/alexa-griffith/⁠⁠⁠⁠⁠⁠Find out more about the guest at:LinkedIn: https://www.linkedin.com/in/robert-shaw-1a01399a/ Red Hat Articles: https://developers.redhat.com/author/robert-shawGithub: https://github.com/robertgshaw2-redhat ResourcesvLLM Website: https://vllm.ai/vLLM GitHub Repository: https://github.com/vllm-project/vllmllm-d Website: https://llm-d.ai/llm-d GitHub Repository - https://github.com/llm-d/llm-d KeywordsAI inference, VLLM, LMD, distributed inference, GPU optimization, open source AI, Kubernetes, multi-cluster deployment, AI infrastructure, enterprise AI AI infrastructure, Kubernetes, model optimization, speculative decoding, mixture of experts, AI deployment, performance tuning, AI systems, neural network scaling Key TopicsEvolution of vLLM and llm-dDistributed inference and routingGPU utilization and performance optimizationOpen source AI infrastructureEnterprise deployment challenges and solutions Standardization in Kubernetes for NIC exposurePerformance optimizations: quantization and speculative decodingMixture of experts architecture and parallelism strategiesFlow control and request scheduling in AI systemsEmerging hardware for AI inference, Cerebras processorReinforcement learning and AI system supportModular architecture of vLLM and ecosystem projects
adbl_web_anon_alc_button_suppression_t1
Aucun commentaire pour le moment