TheNightOps is an autonomous SRE agent that investigates Kubernetes incidents, generates root cause analyses, and recommends remediation — reducing MTTR from 45+ minutes to under 5 minutes.
Multi-agent AI orchestration that correlates logs, events, deployments, and metrics across your entire Kubernetes infrastructure.
6 specialized agents work in parallel — Log Analyst, Deployment Correlator, Runbook Retriever, Communication Drafter, Anomaly Detector, and Root Orchestrator.
WebSocket-powered live investigation UI with phase progress tracking, severity-colored findings, and auto-generated RCA summaries.
TF-IDF similarity matching learns from past incidents. Flags recurring patterns and accelerates diagnosis with historical context.
4-level policy engine: auto-approve safe actions, require approval for risky ones, and block dangerous operations entirely.
Official Google Cloud MCP servers (GKE, Cloud Observability) plus custom servers for Kubernetes, Cloud Logging, Slack, and more.
Accept alerts from Grafana, Alertmanager, PagerDuty, or custom webhooks. K8s event watcher and proactive anomaly scheduler included.
Plan A: Full multi-agent MCP mode for production. Plan B: Simple kubectl-based mode for quick demos and testing — zero MCP setup needed.
Track MTTR, RCA consistency, auto-resolution rate, engineer hours saved, and recurring incident patterns across all investigations.
Send RCA reports and incident updates via Slack, Email (SMTP), Telegram, or WhatsApp Business API automatically.
Multi-agent orchestration powered by Google ADK, connected to your infrastructure via MCP.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Webhook Receiver / CLI / Event Watcher / Scheduler β
βββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βββββββββββββββΌβββββββββββββββ
β Root Orchestrator β
β (ADK + Gemini 3.1 Pro) β
βββββββββββββββ¬βββββββββββββββ
β delegates
ββββββββββββ¬ββββββββββββΌββββββββββββ¬βββββββββββ
βΌ βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β Log β β Deploy β β Runbook β β Comms β β Anomaly β
β Analyst β βCorrelatorβ βRetrieverβ β Drafter β βDetector β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β β β
βββββββ¬ββββββ β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ
β GKE MCP Server β β Cloud β β Incident β β Policy β
β (Official GCP) β βObs. MCP β β Memory β β Engine β
βββββββββββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Webhook Receiver / CLI / Event Watcher β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β Simple Agent β
β (ADK + Gemini) β
ββββββββββββ¬βββββββββββ
β
ββββββββββββββββΌβββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββ ββββββββββββββ ββββββββββββββ
β kubectl β β kubectl β β kubectl β
β get pods β β logs β β top β
β get events β β describe β β rollout β
ββββββββββββββ ββββββββββββββ ββββββββββββββ
β
(Any kubectl-configured cluster)
From alert to RCA in under 5 minutes, fully automated.
Incident received via webhook or CLI. Agent checks pod status, events, and namespace scope to understand the blast radius.
Sub-agents query logs, resource usage, deployment history, and YAML configs in parallel. Cloud Logging patterns are correlated with K8s events.
Findings from all agents are aggregated. Gemini correlates evidence across systems to identify the root cause with a confidence score.
Structured RCA is generated. Remediation actions are evaluated against the policy engine. Safe actions auto-execute; risky ones await approval.
Each scenario demonstrates a real-world incident pattern that TheNightOps can detect, diagnose, and resolve.
Pod gradually consumes memory, gets OOMKilled, enters CrashLoopBackOff. Agent traces the leak, correlates with deployment version, recommends rollback.
Unoptimized endpoint causes CPU throttling and cascading latency. Agent identifies the hot endpoint, finds the code path, correlates with recent deploy.
Database connection pool exhaustion triggers 504s across dependent services. Agent maps the cascade, identifies the root DB timeout, traces to config change.
Misconfigured environment variable causes 50% failure rate. Agent detects the error spike, correlates with recent env var change, recommends revert.
Instant memory allocation exhausts limits within seconds. Agent detects OOMKill events, compares limits vs usage, identifies the allocation bug.
Four levels of autonomy ensure safe operations across all environments.
Safe, read-only or notification actions that can't cause harm.
Auto in dev/staging, require approval in production.
Potentially impactful actions that always need human sign-off.
Dangerous operations that are never allowed.
Measured improvements across real incident investigation workflows.
Help keep this project alive and growing. Your support fuels late-night debugging sessions and new features.
A one-time contribution to keep the caffeine flowing and the agents running.
Buy Me a CoffeeBecome a recurring sponsor and get your name in the project README and release notes.
Sponsor on GitHubNot ready to sponsor? A GitHub star helps with visibility and means a lot to the project.
Star on GitHubMulti-agent architecture with custom MCP servers. Gemini 3.1 Pro, official Google Cloud MCP support. Webhook receiver, policy engine, CLI, and real-time WebSocket dashboard. 5 demo failure scenarios.
Incident Memory with TF-IDF similarity. ADK Web wrapper. Architecture deep-dive documentation. Commons Clause licensing.
git clone https://github.com/nomadicmehul/TheNightOps.git
cd TheNightOps
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp config/.env.example config/.env
# Edit config/.env with your GOOGLE_API_KEY
nightops verify
# Simple Mode (no MCP setup needed)
nightops agent run --simple \
--incident "Pod OOMKilled in production"
# Or launch the dashboard
nightops dashboard
TheNightOps is open-source and ready for your Kubernetes clusters.