umbertorighetti.dev

watcher watcher

You can checkout the code here along with the docs to get started and try it out for your own cluster.

👀The Watcher

If you've ever read a Marvel comic book, avid Silver Surfer fan here, you may have come across a character called: "The Watcher".

"The Watchers blamed themselves for the catastrophe and vowed to never again meddle in the affairs of other races.Instead, they passively observe and record events for those who will come after the universe ends."

Hopefully, no universe ends if your cluster goes down or has error logs but it ties in nicely with what I want to share.

Standard monitoring tells you that something broke. Intelligent monitoring tells you why. By integrating a local Large Language Model (LLM) into your observability pipeline, you can move past raw stack traces and receive concise, actionable root-cause analyses directly on your phone.

🏛️ Architecture

The pipeline consists of four distinct stages designed for speed and reliability:

Vector : A router that tails logs from all pods, filters for keywords like ERROR or EXCEPTION, and pushes them into a buffer.
Redis : A high-speed, standalone queue (error_buffer) that prevents the system from losing logs during traffic spikes.
The Watcher : A Spring Boot service that polls Redis, gathers Kubernetes metadata, and consults the AI.
Telegram : A comms bot that delivers the final analysis to your chat.

Core Components

1. Log Ingestion with Vector

Vector is configured as an agent to watch all Kubernetes logs. We use a VRL (Vector Remap Language) filter to ignore the Watcher's own logs and only capture genuine errors.

transforms:
  only_errors:
    type: filter
    inputs: [kubernetes_logs]
    condition: |
      is_error = match(string!(.message), r'(?i)(ERROR|EXCEPTION|CRITICAL)')
      is_not_watcher = !match(string!(.kubernetes.pod_name), r'(?i)watcher')
      is_error && is_not_watcher

2. The AI Agent (Ollama)

The OllamaAgent communicates with a local LLM (like ministral-3:3b) via a RestClient. The system prompt is strictly tuned to analyze logs concisely while redacting sensitive information like IP addresses or passwords.

public String call(String log) {
        Map<String, Object> body = Map.of(
                "model", "ministral-3:3b",
                "system",
                """
                Analyse this log concisely.
                Provide a short summary of the root cause and the suggested fix.
                Redact any sensitive information eg: passwords, ip addresses, keys, etc.
                DO NOT repeat the log text in your response.
                """,
                "prompt", log,
                "stream", false
        );
 
        try {
            String response = ollamaRestClient.post()
                    .uri("/api/generate")
                    .body(body)
                    .retrieve()
                    .body(String.class);

3. The Observer

The Observer class runs on a scheduled interval (every 1000ms), popping logs from Redis and enriching them with live data from the KubernetesClient or DockerClient.

It ensures the final message fits within Telegram's 4096-character limit by truncating long stack traces.

Key Implementation Details

Markdown Safety: To prevent Telegram from rejecting messages, all dynamic content is passed through an escapeMarkdown helper that handles special characters like _, *, and -.
Contextual Awareness: The Watcher identifies if an error came from a K8s Pod or a standalone Docker container, fetching the current status (e.g., Running or Exited) to give the AI more context.
Efficient Deployment: Using environment variables for TELEGRAM_BOT_TOKEN and OLLAMA_URL allows the same container to be deployed across different clusters without code changes.

Conclusion

By offloading log analysis to a local LLM, you reduce the "cognitive load" of maintaining a cluster. Instead of waking up to a wall of text, you wake up to:

🚨 Error: Pod 'auth-service' failed due to a Database Connection Timeout. 
Fix: Check the DB service health."

Sure, there are probably off the shelf applications you can install to do all this for you, but where's the fun in that?

What's next for this project?

[ ] De-duplication: Prevent spamming the group chat if the same error occurs multiple times in a short window.
[ ] Interactive Mode: Ability to reply to a Telegram alert and ask the LLM follow-up questions about the specific container, pod or k8s setup.
[ ] Web Dashboard: A simple UI to see current error rates and Redis queue depth.