AI Inference at Scale: Reliability, Observability, Cost & Sustainability

AI inference is the new production workload — always on, cost-intensive, and increasingly complex. Many teams face latency spikes at P99, runaway GPU bills, and limited observability across their agentic and RAG pipelines.
This session delivers practical, vendor-aware patterns for reliable and sustainable inference at scale.
You’ll explore queueing, caching, GPU pooling, FinOps, and GreenOps strategies grounded in the Google Cloud AI/ML Well-Architected Framework, Azure AI Workload Guidance, and the Databricks Lakehouse Principles — enabling you to build inference systems that are performant, efficient, and planet-friendly.

Problems Solved

  • Latency spikes at P95/P99 under bursty inference workloads
  • Runaway GPU/TPU costs and inefficient utilization
  • Lack of observability in multi-agent and vector retrieval pipelines
  • Cache inefficiency and poor vector store tuning
  • Unmeasured energy and carbon footprint in AI workloads

What You’ll Learn

  • When to use serverless triggers, async queues, or GPU pooling
  • How to instrument prompts, vector queries, and GPU utilization end-to-end
  • FinOps guardrails: cost attribution, right-sizing, and preemptible instances
  • GreenOps practices: SCI metrics, time-/region-aware scaling, energy optimization
  • How to map reliability and sustainability principles across GCP, Azure, and Databricks

Agenda
Opening & Context
: Why inference reliability, observability, and sustainability define the next stage of enterprise AI.
Case study: A RAG system suffering from unpredictable latency and GPU overspend — what went wrong and what patterns fix it.

Pattern 1: Reliable Inference Flow Design
Engineering for bursty demand.
Patterns: async queues, back-pressure controls, serverless triggers, GPU pooling vs autoscaling, and caching strategies for RAG.

Pattern 2: Observability & Instrumentation
Full-stack tracing for inference workloads.
Prompt-level metrics, vector query instrumentation, GPU telemetry, OpenTelemetry integration, and structured prompt logging.

Pattern 3: FinOps for AI
Controlling inference cost without losing reliability.
Cost attribution, tagging GPU workloads, quantization and model distillation, choosing preemptible/spot instances, and cross-cloud FinOps tooling (GCP Recommender, Azure Advisor, Databricks Cost Profiler).

Pattern 4: GreenOps & Sustainability
, Reducing the environmental footprint of AI pipelines.
SCI (Software Carbon Intensity) metrics, carbon-aware scheduling, time-shifting inference jobs, and sustainable scaling practices.

Cross-Cloud Well-Architected Anchors
Mapping patterns to major frameworks:

  • Google Cloud AI/ML Well-Architected Framework (Reliability, Cost, Sustainability)
  • Azure AI Workload Guidance and sustainability assessment tools
  • Databricks Lakehouse Well-Architected Principles (governance, performance, sustainability trade-offs)

Wrap-Up & Discussion
Recap of proven design patterns, FinOps + GreenOps checklist, and architectural recommendations for enterprise AI teams.

Key Framework References

  • Google Cloud: AI/ML Well-Architected Framework (Reliability, Cost, Sustainability)
  • Azure: Well-Architected for AI Workloads + Sustainability Tools
  • Databricks: Seven-Pillar Lakehouse Principles
  • FinOps Foundation: AI/ML Cost Allocation & Efficiency Models
  • Green Software Foundation: SCI (Software Carbon Intensity) Metrics

Takeaways

  • Cross-Cloud Inference Pattern Playbook
  • FinOps & GreenOps Implementation Checklist
  • Observability and Instrumentation Reference Map for AI pipelines

About Rohit Bhardwaj

Rohit Bhardwaj is a Director of Architecture working at Salesforce. Rohit has extensive experience architecting multi-tenant cloud-native solutions in Resilient Microservices Service-Oriented architectures using AWS Stack. In addition, Rohit has a proven ability in designing solutions and executing and delivering transformational programs that reduce costs and increase efficiencies.

As a trusted advisor, leader, and collaborator, Rohit applies problem resolution, analytical, and operational skills to all initiatives and develops strategic requirements and solution analysis through all stages of the project life cycle and product readiness to execution.
Rohit excels in designing scalable cloud microservice architectures using Spring Boot and Netflix OSS technologies using AWS and Google clouds. As a Security Ninja, Rohit looks for ways to resolve application security vulnerabilities using ethical hacking and threat modeling. Rohit is excited about architecting cloud technologies using Dockers, REDIS, NGINX, RightScale, RabbitMQ, Apigee, Azul Zing, Actuate BIRT reporting, Chef, Splunk, Rest-Assured, SoapUI, Dynatrace, and EnterpriseDB. In addition, Rohit has developed lambda architecture solutions using Apache Spark, Cassandra, and Camel for real-time analytics and integration projects.

Rohit has done MBA from Babson College in Corporate Entrepreneurship, Masters in Computer Science from Boston University and Harvard University. Rohit is a regular speaker at No Fluff Just Stuff, UberConf, RichWeb, GIDS, and other international conferences.

Rohit loves to connect on http://www.productivecloudinnovation.com.
http://linkedin.com/in/rohit-bhardwaj-cloud or using Twitter at rbhardwaj1.

More About Rohit »