Session Details: Google Cloud Next 2026

phase 5 session info modal

Day 2 – April 23, 2026

Breakouts

BRK2-125 • Kubernetes, Cloud Runtimes • Technical

Achieve state-of-the-art inference: High performance on TPUs and GPUs with llm-d

Jasmine A

5:15 PM - 6:00 PM

Proprietary stacks and generic open source solutions often lack deep hardware integration. Break free from proprietary constraints with llm-d, an open source stack that delivers state-of-the-art performance across Tensor Processing Units (TPUs) and NVIDIA GPUs. This session dives deep into how to architect disaggregated serving and automatic key-value cache storage tiering on Ironwood (TPU7x). Learn to implement routing optimized for service-level objectives and build a portable, high-performance inference fleet that scales automatically based on real-time server conditions. Leave with a reference architecture for hardware-optimized LLM serving.

This agenda widget contains logic for handling same sessions available at multiple times

Session Details

Achieve state-of-the-art inference: High performance on TPUs and GPUs with llm-d

Related Sessions