Breakouts
BRK2-113 • Kubernetes, Cloud Runtimes • Technical
The GKE inference playbook: Optimize cost and performance
location_on
Mandalay Bay F
schedule
12:30 PM - 1:15 PM
Balancing inference latency and cost is generative AI’s toughest challenge. This session delivers the playbook for a high-performance, cost-efficient inference stack on Google Kubernetes Engine (GKE). Explore innovations for scaling state-of-the-art models, crushing cold starts, and maximizing utilization with Ironwood Tensor Processing Units (TPUs) and vLLM. We’ll take a deep dive into GKE Inference Gateway, demonstrating routing based on the service level objective (SLO) across clusters to ensure your application stays responsive – and your budget stays intact.
Read
more