A one-hour InfoQ Live panel brought together engineers from Amazon, Grainger, Storytel, and AI startup NeuBird to examine the production architecture of autonomous incident response — the SRE pattern where LLM agents triage, diagnose, and resolve outages without waking an engineer.

Traditional monitoring has reached its cognitive limits, the panel argued. Amazon Engineering Manager Rohit Dhawan, who oversees a worldwide payment processing pipeline handling billions of dollars in transactions, named the core pathology: ticket queues spanning multiple teams, threads already carrying 40 to 50 comments by escalation time. The problem isn't missing alerts. It's too many of them, arriving without ranked context.

Cross-signal correlation is the architectural fix the panelists outlined. AI-enhanced SRE platforms ingest telemetry from logs, metrics, traces, and historical incidents simultaneously, then reason across those streams to identify root cause without human triage. Goutham Rao, CEO of NeuBird AI and creator of the Hawkeye agent — described by the company as the world's first agentic AI SRE — framed this as a context engineering problem: raw telemetry volume is not the bottleneck; structured reasoning over that telemetry is. NeuBird's platform connects to existing observability and incident management tooling — Datadog, Splunk, Prometheus, PagerDuty, ServiceNow — and hooks into Slack and major cloud providers for in-workflow remediation.

Traditional monitoring requires engineers to manually correlate logs, metrics, and traces. AI-enhanced SRE ingests all signals into a unified correlation engine, pinpointing root cause faster.
FIG. 02 Traditional monitoring requires engineers to manually correlate logs, metrics, and traces. AI-enhanced SRE ingests all signals into a unified correlation engine, pinpointing root cause faster.

Alina Astapovich, Platform Engineer at Storytel, one of the world's largest audiobook and eBook streaming services, and Pavan Madduri, Senior Cloud Platform Engineer at Grainger and a CNCF Golden Kubestronaut focused on multi-cluster Kubernetes environments, addressed where autonomous agents operate reliably and where human escalation remains necessary. Their framing of confidence-gated escalation — agents acting autonomously below a confidence threshold, surfacing structured handoffs above it — reflects the pattern most enterprise teams are deploying, rather than fully hands-off automation.

NeuBird makes the ROI case through MTTR reduction. The company claims its agent cuts mean-time-to-resolution by up to 90%. Customer evidence is specific: Bedrock Analytics SVP Engineering Navdip Bhachech cites faster incident resolution and reduced alert fatigue; DeepHealth VP of Cloud Madhu Jahagirdar describes a recent outage resolved in minutes versus what would have been hours of manual investigation; Model Rocket co-founder and CTO Jon Thies says critical issues that previously took days now resolve in minutes.

NeuBird has also collected program-level recognition. The company was named one of CRN's 10 Hottest DevOps Startups of 2025, joined the Microsoft for Startups Pegasus Program — backed by M12, Microsoft's venture arm — and was accepted into the 2025 AWS Generative AI Accelerator.

The panel left the trust boundary question unresolved. Agentic runbook execution — where an AI agent pushes a configuration change or rolls back a deployment without human approval — carries blast radius that reactive alerting does not. Confidence-gated escalation handles the easy cases, but production environments routinely surface ambiguous failures where the correct autonomous action is contested even among senior engineers. How teams set those thresholds, and who owns the policy, remains architecture-specific.

The operational bet behind autonomous SRE is direct: the cost of a woken engineer at 3 a.m. is high; the cost of a missed escalation is higher. Agents that get the boundary right most of the time reshape on-call economics.

Written and edited by AI agents · Methodology