v0.1.0b1 — Built with Google ADK BETA

Your On-Call Engineer
Can Finally Sleep

TheNightOps is an autonomous SRE agent that investigates Kubernetes incidents, generates root cause analyses, and recommends remediation — reducing MTTR from 45+ minutes to under 5 minutes.

nightops
$ nightops agent run --simple --incident "Pod OOMKilled in production"
< 5 Min MTTR
6 AI Agents
4 MCP Servers
5 Demo Scenarios
4 Webhook Sources

Everything Your SRE Team Needs

Multi-agent AI orchestration that correlates logs, events, deployments, and metrics across your entire Kubernetes infrastructure.

πŸ”

Multi-Agent Investigation

6 specialized agents work in parallel — Log Analyst, Deployment Correlator, Runbook Retriever, Communication Drafter, Anomaly Detector, and Root Orchestrator.

πŸ“Š

Real-Time Dashboard

WebSocket-powered live investigation UI with phase progress tracking, severity-colored findings, and auto-generated RCA summaries.

🧠

Incident Memory

TF-IDF similarity matching learns from past incidents. Flags recurring patterns and accelerates diagnosis with historical context.

πŸ›‘οΈ

Graduated Remediation

4-level policy engine: auto-approve safe actions, require approval for risky ones, and block dangerous operations entirely.

πŸ”Œ

MCP Integration

Official Google Cloud MCP servers (GKE, Cloud Observability) plus custom servers for Kubernetes, Cloud Logging, Slack, and more.

πŸ“‘

Multi-Source Ingestion

Accept alerts from Grafana, Alertmanager, PagerDuty, or custom webhooks. K8s event watcher and proactive anomaly scheduler included.

πŸ€–

Dual-Mode Architecture

Plan A: Full multi-agent MCP mode for production. Plan B: Simple kubectl-based mode for quick demos and testing — zero MCP setup needed.

πŸ“ˆ

Metrics & Impact Tracking

Track MTTR, RCA consistency, auto-resolution rate, engineer hours saved, and recurring incident patterns across all investigations.

πŸ“’

Multi-Channel Notifications

Send RCA reports and incident updates via Slack, Email (SMTP), Telegram, or WhatsApp Business API automatically.

Built for Production SRE

Multi-agent orchestration powered by Google ADK, connected to your infrastructure via MCP.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Webhook Receiver  /  CLI  /  Event Watcher  /  Scheduler β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚    Root Orchestrator       β”‚
                β”‚    (ADK + Gemini 3.1 Pro)  β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚ delegates
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β–Ό          β–Ό           β–Ό           β–Ό          β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚   Log   β”‚ β”‚  Deploy  β”‚ β”‚ Runbook β”‚ β”‚  Comms  β”‚ β”‚ Anomaly β”‚
  β”‚ Analyst β”‚ β”‚Correlatorβ”‚ β”‚Retrieverβ”‚ β”‚ Drafter β”‚ β”‚Detector β”‚
  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
       β”‚           β”‚           β”‚            β”‚           β”‚
       β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜           β”‚            β”‚           β”‚
             β–Ό                 β–Ό            β–Ό           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  GKE MCP Server   β”‚ β”‚ Cloud    β”‚ β”‚ Incident β”‚ β”‚ Policy  β”‚
  β”‚  (Official GCP)   β”‚ β”‚Obs. MCP  β”‚ β”‚  Memory  β”‚ β”‚ Engine  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Official Google Cloud MCP — IAM-authenticated access to GKE clusters and Cloud Observability
Parallel Investigation — Sub-agents work simultaneously for faster diagnosis
Full Reasoning Chain — Each agent contributes specialized analysis to the root cause
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Webhook Receiver  /  CLI  /  Event Watcher       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚    Simple Agent     β”‚
              β”‚  (ADK + Gemini)     β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό              β–Ό              β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ kubectl    β”‚ β”‚ kubectl    β”‚ β”‚ kubectl    β”‚
   β”‚ get pods   β”‚ β”‚ logs       β”‚ β”‚ top        β”‚
   β”‚ get events β”‚ β”‚ describe   β”‚ β”‚ rollout    β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
              (Any kubectl-configured cluster)
Zero MCP Setup — Works with any cluster where kubectl is configured
Single Agent — Direct kubectl subprocess calls for maximum reliability
Perfect for Demos — Get started in minutes, no GCP project required

4-Phase Autonomous Investigation

From alert to RCA in under 5 minutes, fully automated.

01

Triage

Incident received via webhook or CLI. Agent checks pod status, events, and namespace scope to understand the blast radius.

02

Deep Investigation

Sub-agents query logs, resource usage, deployment history, and YAML configs in parallel. Cloud Logging patterns are correlated with K8s events.

03

Synthesis

Findings from all agents are aggregated. Gemini correlates evidence across systems to identify the root cause with a confidence score.

04

RCA + Remediation

Structured RCA is generated. Remediation actions are evaluated against the policy engine. Safe actions auto-execute; risky ones await approval.

5 Failure Modes, Fully Automated Recovery

Each scenario demonstrates a real-world incident pattern that TheNightOps can detect, diagnose, and resolve.

πŸ’Ύ
memory-leak

Memory Leak → OOMKill

Pod gradually consumes memory, gets OOMKilled, enters CrashLoopBackOff. Agent traces the leak, correlates with deployment version, recommends rollback.

πŸ”₯
cpu-spike

CPU Spike from Bad Query

Unoptimized endpoint causes CPU throttling and cascading latency. Agent identifies the hot endpoint, finds the code path, correlates with recent deploy.

🌊
cascading-failure

Cascading DB Failure

Database connection pool exhaustion triggers 504s across dependent services. Agent maps the cascade, identifies the root DB timeout, traces to config change.

βš™οΈ
config-drift

Config Drift → 5xx Errors

Misconfigured environment variable causes 50% failure rate. Agent detects the error spike, correlates with recent env var change, recommends revert.

πŸ’₯
oom-kill

Aggressive OOMKill

Instant memory allocation exhausts limits within seconds. Agent detects OOMKill events, compares limits vs usage, identifies the allocation bug.

trigger a scenario
$ nightops demo trigger -s memory-leak
$ nightops demo trigger -s cpu-spike
$ nightops demo trigger -s cascading-failure

Graduated Remediation Policies

Four levels of autonomy ensure safe operations across all environments.

Level 0

Auto-Approve

Safe, read-only or notification actions that can't cause harm.

  • Silence alerts
  • Create Grafana incidents
  • Post Slack updates
Level 1

Environment-Gated

Auto in dev/staging, require approval in production.

  • Pod restarts
  • Scale-up replicas
Level 2

Always Approve

Potentially impactful actions that always need human sign-off.

  • Rollbacks
  • Config reverts
  • Scale-down
Level 3

Blocked

Dangerous operations that are never allowed.

  • Delete namespace
  • Delete PVC
  • Drain node

Before vs After TheNightOps

Measured improvements across real incident investigation workflows.

MTTR (investigation)
45+ min
< 5 min
Context Assembly
20+ min
Seconds (parallel)
RCA Consistency
Varies by engineer
100% standardized
Post-Incident Toil
90+ min
< 10 min
On-Call Cognitive Load
High 5 dashboards
Low pre-diagnosed
Recurring Incidents
60% repeated
Flagged & learning

Built With

Google ADK
Agent orchestration framework
Gemini 3.1 Pro
LLM reasoning engine
MCP
Model Context Protocol
FastAPI
Webhooks & dashboard
Python 3.11+
Core language
Kubernetes
Target platform
GKE
Google Kubernetes Engine
Pydantic
Config & validation

Development Milestones

v0.1.0b1 — First Public Beta

Foundation + Dual-Mode Architecture

Multi-agent architecture with custom MCP servers. Gemini 3.1 Pro, official Google Cloud MCP support. Webhook receiver, policy engine, CLI, and real-time WebSocket dashboard. 5 demo failure scenarios.

Latest

Intelligence & Polish

Incident Memory with TF-IDF similarity. ADK Web wrapper. Architecture deep-dive documentation. Commons Clause licensing.

Roadmap

What's Next

  • Multi-cluster support
  • Grafana MCP integration
  • Vector DB for incident memory
  • Auto-remediation execution
  • Cost impact analysis

Up and Running in Minutes

1

Clone & Install

git clone https://github.com/nomadicmehul/TheNightOps.git
cd TheNightOps
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
2

Configure

cp config/.env.example config/.env
# Edit config/.env with your GOOGLE_API_KEY
nightops verify
3

Run Your First Investigation

# Simple Mode (no MCP setup needed)
nightops agent run --simple \
  --incident "Pod OOMKilled in production"

# Or launch the dashboard
nightops dashboard

Ready to Let Your On-Call Sleep?

TheNightOps is open-source and ready for your Kubernetes clusters.