Breakouts
BRK2-125 • Kubernetes, Cloud Runtimes • Technical
Achieve state-of-the-art inference: High performance on TPUs and GPUs with llm-d
location_on
Jasmine A
schedule
5:15 PM - 6:00 PM
Proprietary stacks and generic open source solutions often lack deep hardware integration. Break free from proprietary constraints with llm-d, an open source stack that delivers state-of-the-art performance across Tensor Processing Units (TPUs) and NVIDIA GPUs. This session dives deep into how to architect disaggregated serving and automatic key-value cache storage tiering on Ironwood (TPU7x). Learn to implement routing optimized for service-level objectives and build a portable, high-performance inference fleet that scales automatically based on real-time server conditions. Leave with a reference architecture for hardware-optimized LLM serving.
Read more