THE NightOps — Autonomous SRE Agent for Kubernetes

Capabilities

Everything Your SRE Team Needs

Multi-agent AI orchestration that correlates logs, events, deployments, and metrics across your entire Kubernetes infrastructure.

🔍

Multi-Agent Investigation

6 specialized agents work in parallel — Log Analyst, Deployment Correlator, Runbook Retriever, Communication Drafter, Anomaly Detector, and Root Orchestrator.

📊

Real-Time Dashboard

WebSocket-powered live investigation UI with phase progress tracking, severity-colored findings, and auto-generated RCA summaries.

🧠

Incident Memory

TF-IDF similarity matching learns from past incidents. Flags recurring patterns and accelerates diagnosis with historical context.

🛡️

Graduated Remediation

4-level policy engine: auto-approve safe actions, require approval for risky ones, and block dangerous operations entirely.

🔌

MCP Integration

Official Google Cloud MCP servers (GKE, Cloud Observability) plus custom servers for Kubernetes, Cloud Logging, Slack, and more.

📡

Multi-Source Ingestion

Accept alerts from Grafana, Alertmanager, PagerDuty, or custom webhooks. K8s event watcher and proactive anomaly scheduler included.

🤖

Dual-Mode Architecture

Plan A: Full multi-agent MCP mode for production. Plan B: Simple kubectl-based mode for quick demos and testing — zero MCP setup needed.

📈

Metrics & Impact Tracking

Track MTTR, RCA consistency, auto-resolution rate, engineer hours saved, and recurring incident patterns across all investigations.

📢

Multi-Channel Notifications

Send RCA reports and incident updates via Slack, Email (SMTP), Telegram, or WhatsApp Business API automatically.

Architecture

Built for Production SRE

Multi-agent orchestration powered by Google ADK, connected to your infrastructure via MCP.

┌──────────────────────────────────────────────────────────────┐
│     Webhook Receiver  /  CLI  /  Event Watcher  /  Scheduler │
└─────────────────────────────┬────────────────────────────────┘
                              │
                ┌─────────────▼──────────────┐
                │    Root Orchestrator       │
                │    (ADK + Gemini 3.1 Pro)  │
                └─────────────┬──────────────┘
                              │ delegates
       ┌──────────┬───────────┼───────────┬──────────┐
       ▼          ▼           ▼           ▼          ▼
  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
  │   Log   │ │  Deploy  │ │ Runbook │ │  Comms  │ │ Anomaly │
  │ Analyst │ │Correlator│ │Retriever│ │ Drafter │ │Detector │
  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
       │           │           │            │           │
       └─────┬─────┘           │            │           │
             ▼                 ▼            ▼           ▼
  ┌───────────────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
  │  GKE MCP Server   │ │ Cloud    │ │ Incident │ │ Policy  │
  │  (Official GCP)   │ │Obs. MCP  │ │  Memory  │ │ Engine  │
  └───────────────────┘ └──────────┘ └──────────┘ └─────────┘

Official Google Cloud MCP — IAM-authenticated access to GKE clusters and Cloud Observability

Parallel Investigation — Sub-agents work simultaneously for faster diagnosis

Full Reasoning Chain — Each agent contributes specialized analysis to the root cause

┌──────────────────────────────────────────────────────┐
│     Webhook Receiver  /  CLI  /  Event Watcher       │
└────────────────────────┬─────────────────────────────┘
                         │
              ┌──────────▼──────────┐
              │    Simple Agent     │
              │  (ADK + Gemini)     │
              └──────────┬──────────┘
                         │
          ┌──────────────┼──────────────┐
          ▼              ▼              ▼
   ┌────────────┐ ┌────────────┐ ┌────────────┐
   │ kubectl    │ │ kubectl    │ │ kubectl    │
   │ get pods   │ │ logs       │ │ top        │
   │ get events │ │ describe   │ │ rollout    │
   └────────────┘ └────────────┘ └────────────┘
                         │
              (Any kubectl-configured cluster)

Zero MCP Setup — Works with any cluster where kubectl is configured

Single Agent — Direct kubectl subprocess calls for maximum reliability

Perfect for Demos — Get started in minutes, no GCP project required

Investigation Flow

4-Phase Autonomous Investigation

From alert to RCA in under 5 minutes, fully automated.

01

Triage

Incident received via webhook or CLI. Agent checks pod status, events, and namespace scope to understand the blast radius.

02

Deep Investigation

Sub-agents query logs, resource usage, deployment history, and YAML configs in parallel. Cloud Logging patterns are correlated with K8s events.

03

Synthesis

Findings from all agents are aggregated. Gemini correlates evidence across systems to identify the root cause with a confidence score.

04

RCA + Remediation

Structured RCA is generated. Remediation actions are evaluated against the policy engine. Safe actions auto-execute; risky ones await approval.

Demo Scenarios

5 Failure Modes, Fully Automated Recovery

Each scenario demonstrates a real-world incident pattern that THE NightOps can detect, diagnose, and resolve.

💾

memory-leak

Memory Leak → OOMKill

Pod gradually consumes memory, gets OOMKilled, enters CrashLoopBackOff. Agent traces the leak, correlates with deployment version, recommends rollback.

🔥

cpu-spike

CPU Spike from Bad Query

Unoptimized endpoint causes CPU throttling and cascading latency. Agent identifies the hot endpoint, finds the code path, correlates with recent deploy.

🌊

cascading-failure

Cascading DB Failure

Database connection pool exhaustion triggers 504s across dependent services. Agent maps the cascade, identifies the root DB timeout, traces to config change.

⚙️

config-drift

Config Drift → 5xx Errors

Misconfigured environment variable causes 50% failure rate. Agent detects the error spike, correlates with recent env var change, recommends revert.

💥

oom-kill

Aggressive OOMKill

Instant memory allocation exhausts limits within seconds. Agent detects OOMKill events, compares limits vs usage, identifies the allocation bug.

trigger a scenario

$ nightops demo trigger -s memory-leak

$ nightops demo trigger -s cpu-spike

$ nightops demo trigger -s cascading-failure

Safety

Graduated Remediation Policies

Four levels of autonomy ensure safe operations across all environments.

Level 0

Auto-Approve

Safe, read-only or notification actions that can't cause harm.

Silence alerts
Create Grafana incidents
Post Slack updates

Level 1

Environment-Gated

Auto in dev/staging, require approval in production.

Pod restarts
Scale-up replicas

Level 2

Always Approve

Potentially impactful actions that always need human sign-off.

Rollbacks
Config reverts
Scale-down

Level 3

Blocked

Dangerous operations that are never allowed.

Delete namespace
Delete PVC
Drain node

Impact

Before vs After THE NightOps

Measured improvements across real incident investigation workflows.

MTTR (investigation)

45+ min

→

< 5 min

Context Assembly

20+ min

→

Seconds (parallel)

RCA Consistency

Varies by engineer

→

100% standardized

Post-Incident Toil

90+ min

→

< 10 min

On-Call Cognitive Load

High 5 dashboards

→

Low pre-diagnosed

Recurring Incidents

60% repeated

→

Flagged & learning

Support

Sponsor THE NightOps

Help keep this project alive and growing. Your support fuels late-night debugging sessions and new features.

Tech Stack

Built With

Google ADK

Agent orchestration framework

Gemini 3.1 Pro

LLM reasoning engine

MCP

Model Context Protocol

FastAPI

Webhooks & dashboard

Python 3.11+

Core language

Kubernetes

Target platform

GKE

Google Kubernetes Engine

Pydantic

Config & validation

Project Timeline

Development Milestones

v0.1.0b1 — First Public Beta

Foundation + Dual-Mode Architecture

Multi-agent architecture with custom MCP servers. Gemini 3.1 Pro, official Google Cloud MCP support. Webhook receiver, policy engine, CLI, and real-time WebSocket dashboard. 5 demo failure scenarios.

Latest

Intelligence & Polish

Incident Memory with TF-IDF similarity. ADK Web wrapper. Architecture deep-dive documentation. Commons Clause licensing.

Roadmap

What's Next

Multi-cluster support
Grafana MCP integration
Vector DB for incident memory
Auto-remediation execution
Cost impact analysis

Get Started

Up and Running in Minutes

1

Clone & Install

git clone https://github.com/nomadicmehul/TheNightOps.git
cd TheNightOps
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

2

Configure

cp config/.env.example config/.env
# Edit config/.env with your GOOGLE_API_KEY
nightops verify

3

Run Your First Investigation

# Simple Mode (no MCP setup needed)
nightops agent run --simple \
  --incident "Pod OOMKilled in production"

# Or launch the dashboard
nightops dashboard

Ready to Let Your On-Call Sleep?

THE NightOps is open-source and ready for your Kubernetes clusters.

Star on GitHub Report an Issue

Your On-Call Engineer Can Finally Sleep