Under the Hood: The Architecture Behind One Cluster to Rule Them All

In the last post, I announced I was collapsing two Kubernetes clusters into one, built on Talos Linux. I promised the full story: the architecture, the decisions, the things that broke along the way, and everything that’s running now. Here it is.

This is a layer-by-layer breakdown of the full stack — from how the nodes boot to how an AI assistant watches over everything from Slack. Nothing is glossed over.

The Foundation: Talos Linux

Everything runs on Talos Linux v1.12.4 — an immutable, API-driven OS with no SSH, no package manager, and no shell access. Machine configuration is managed declaratively through the Talos API. Push a config, the node converges. That’s it.

The cluster has six nodes: three control planes (talostrk1, talostrk5, talostrk6) and three workers (talostrk2, talostrk3, talostrk4). Control plane scheduling is enabled — they’re not just coordinators, they run workloads too. The cluster endpoint is a control-plane VIP managed natively by Talos.

Bootstrap is fully automated via Ansible. A single playbook generates Talos machine configs, applies them to the nodes, triggers cluster bootstrap on the first control plane, and retrieves the kubeconfig. Every rebuild starts from the exact same known state.

Secrets: Infisical

Nothing sensitive lives in the repo. Every secret — database passwords, API keys, tokens, Cloudflare credentials — is stored in Infisical and pulled at deploy time using a machine identity. Secrets are tagged by cluster name, so one Infisical project can serve multiple environments without anything bleeding between them.

At playbook runtime, a machine identity authenticates to Infisical using a client secret passed as an environment variable. Secrets are fetched and injected into Ansible roles as variables, which use them to create Kubernetes Secrets on the cluster. Nothing hits the filesystem. Nothing is logged. The repo stays clean.

Networking

This layer has a few moving pieces that all need to work together: a bare-metal load balancer, an ingress controller, Gateway API routing, and a Cloudflare tunnel that means zero open ports on my router. Ever.

kube-vip — Bare Metal Load Balancer

kube-vip runs as a DaemonSet and manages a pool of real LAN IPs using ARP. When a service of type LoadBalancer is created, kube-vip assigns it an IP from the pool. This is what makes in-cluster services reachable from the local network without touching the router. Dispatcharr, for example, is pinned to a static LAN address so IPTV clients always know where to point — more on that in the applications section.

Traefik — Ingress and Gateway API

Traefik runs two replicas as a ClusterIP service — it’s not directly exposed to the network because all external traffic goes through the Cloudflare tunnel first. Traefik handles both legacy Ingress objects and the newer Gateway API HTTPRoutes, acting as the traefik GatewayClass implementation. TLS is handled via a Cloudflare DNS-01 certificate resolver.

Gateway API + Cloudflared

All modern service exposure runs through Gateway API HTTPRoutes pointing at a single default-gateway. HTTP and HTTPS listeners are configured with TLS termination and accept routes from all namespaces.

Three Cloudflared pods form the public-facing edge. Every inbound internet request flows like this:

Cloudflare → cloudflared tunnel → Traefik → Gateway API HTTPRoute → service

No ports open on the router. No firewall rules. No exposed public IPs. Cloudflare Zero Trust access policies sit in front of anything internal that shouldn’t be open to the world.

Internet
+-- Cloudflare Edge
    +-- cloudflared tunnel (3 pods, outbound-only)
        +-- Traefik (ClusterIP, 2 replicas)
            |-- HTTPRoute -> thedougie.com   -> WordPress  [public]
            +-- HTTPRoute -> mealie.*        -> Mealie     [public]

LAN (internal, not routed through tunnel)
+-- kube-vip LoadBalancer pool
    |-- Traefik LAN IP        -> Grafana, Headlamp, and others
    +-- Dispatcharr static IP -> IPTV clients

Storage: Longhorn

Longhorn 1.11.0 handles all persistent storage. Volumes default to 2 replicas with a best-effort data locality policy. The storage path lives at /var/mnt/longhorn — a Talos-compatible location set up via kubelet extra mounts.

There are two volume classes in play. RWO (ReadWriteOnce) volumes serve databases and single-consumer workloads: MySQL for WordPress, PostgreSQL for Mealie and Dispatcharr, Prometheus metrics, and Grafana state. RWX (ReadWriteMany) volumes are shared across replicated pods — all three WordPress pods mount the same /var/www/html, all three Mealie pods share the same data directory.

Backups go to Backblaze B2 via the S3-compatible API. If the cluster burns to the ground, the data survives.

Longhorn Storage Architecture

8 Nodes (talostrk1-8)
+-- /var/mnt/longhorn   (Talos kubelet extra mount, per-node disk)
    +-- longhorn-manager DaemonSet pod
        +-- Monitors node storage health
        +-- Orchestrates replica placement
        +-- Reports to Longhorn Manager (control plane)

Persistent Volume Request (e.g. WordPress MySQL 10Gi RWO)
+-- StorageClass: longhorn
    +-- Longhorn creates Volume with 2 replicas
        |-- Replica 1  ->  talostrk2  /var/mnt/longhorn/...
        +-- Replica 2  ->  talostrk4  /var/mnt/longhorn/...
    +-- Engine instance-manager proxies I/O to active replicas
    +-- If a node goes down, replica on surviving node takes over

RWX Volume (e.g. WordPress /var/www/html 15Gi)
+-- Longhorn RWX uses NFS share over the volume
    +-- share-manager pod exposes NFS endpoint
    +-- All 3 WordPress pods mount same NFS path simultaneously

Backups
+-- Longhorn recurring backup job
    +-- Snapshots -> Backblaze B2 (S3-compatible API)
    +-- Volume data survives total cluster loss

Observability

The observability stack is three pieces working together:

kube-prometheus-stack — Prometheus (100Gi PVC, 30-day retention), Grafana, and Alertmanager. ServiceMonitors cover Longhorn, Traefik, and Cloudflared out of the box. Grafana auth goes through Cloudflare Access.
Loki — Single-binary monolithic mode with bundled MinIO for object storage, backed by Longhorn. Tuned ingestion limits for a busy homelab cluster.
Alloy — Grafana’s next-gen agent, deployed as a DaemonSet across every node including control planes. Ships logs to the Loki gateway. Alloy replaced Promtail and brings significantly more flexibility for log collection and transformation pipelines.

One important implementation detail: the node-exporter relabeling is configured to set instance=<nodename> rather than IP:port. This matters because Trimaran — the scheduler plugin in the next section — needs to match Prometheus metrics by node name when making placement decisions.

Scheduling: From Descheduler to Trimaran

This is where things got interesting. Let me tell the whole story.

The Descheduler Problem

The original plan included the Kubernetes Descheduler, running as a CronJob every 15 minutes. The concept is sound: Kubernetes makes scheduling decisions at pod creation time but doesn’t rebalance afterward. Nodes drift into uneven states over time, and the descheduler is supposed to fix that by evicting pods from overloaded or underutilized nodes.

In practice, it was causing more problems than it solved. The descheduler was evicting pods based on utilization thresholds — but then the default Kubernetes scheduler, which works from resource requests not actual usage, would look at the numbers and often drop the pod right back onto the same node. Evict, reschedule to the same place, evict again. Nothing improved, but the cluster was constantly in motion.

Worse, it was evicting stateful pods — PostgreSQL, MySQL. Those take time to come back up, and while they’re restarting, every service that depends on them starts throwing errors. By the time the database was healthy again, the next descheduler run was already queued up. Mealie, WordPress, anything with a database dependency was caught in the middle of this cycle.

Jarvis caught it before I did:

💬 Slack — #homelab
Me12:38 PMAny issues in the last 9 hours?
Jarvis12:38 PMChecking for recent events…
Jarvis12:38 PM⚠️ Issues in the last 9 hours — Critical Pattern

      Descheduler is evicting pods from talostrk2 every 15 minutes due to low node utilization. Cascading failures:

1. Mealie — All 3 web pods failing health probes (HTTP 500) after database restarts

         • PostgreSQL being evicted repeatedly, causing connection failures

         • New pods starting but failing startup probes (connection refused)

2. WordPress — All 3 web pods failing liveness probes (HTTP 500)

         • MySQL being evicted from talostrk4, causing readiness probe failures

         • New WordPress pods failing startup probes (timeout)

3. MySQL — Readiness probe failures during startup (socket connection errors)

Root cause: Descheduler is too aggressive. Evicting stateful pods breaks dependent services. The “low node utilization” threshold is triggering even though the cluster has plenty of capacity.

Recommendation: Disable descheduler or adjust its policy to exclude stateful workloads.
    
Me12:39 PMCan you disable descheduler? Not asking you to. Asking if you can.
Jarvis12:39 PM
      Yes. Two options:

      1. Scale the deployment to 0 replicas — stops it immediately, reversible

      2. Delete the CronJob — removes the scheduled executions, also reversible

      Both are non-destructive. I’d need explicit confirmation before doing either.
    
Me12:40 PMPlease do option 1
Jarvis12:40 PM
      Disabling descheduler by scaling to 0…

      Descheduler isn’t a Deployment — it’s a CronJob. Deleting the CronJob instead.

      Done. Descheduler CronJob deleted. No more evictions every 15 minutes. Existing pods will stabilize.
    
Me3:24 PMConfirm that descheduler isn’t running anymore
Jarvis3:24 PM
      Confirmed. No descheduler pods running. CronJob is gone. No more scheduled evictions.
    
Me3:24 PMCheck the cluster status
Jarvis3:24 PM
      ✅ Cluster Status — Healthy. No events.

      Resource Usage:

      • Cluster: 3.45 CPU cores (43%), 39.7 GB memory (51%)

      • talostrk1: 28% CPU, 56% memory

      • talostrk2: 11% CPU, 67% memory (Loki / Longhorn)

      • talostrk4: 16% CPU, 67% memory (Longhorn instance-manager)

      • Others: 11–17% CPU, 17–32% memory

      Applications:

      • Mealie: 3 web pods healthy, PostgreSQL healthy

      • WordPress: 3 web pods healthy, MySQL healthy

      • Dispatcharr: Web / Celery / DB / Redis all running

      • System: All core services operational

      No descheduler evictions. Pods are stable.

The problem wasn’t just a misconfigured threshold. It was a fundamental mismatch: a reactive eviction tool fighting a scheduler that was putting pods right back where they came from. I didn’t want to get into a cycle of endlessly tuning eviction policies, so I pulled it out and replaced it with something that solves the actual problem.

Trimaran — Proactive Placement with Real Metrics

Trimaran (part of the scheduler-plugins project from SIG Scheduling) takes a completely different approach: instead of evicting pods reactively after the fact, it makes better placement decisions at scheduling time using real utilization data from Prometheus.

The LoadVariationRiskBalancing plugin scores nodes using a formula that accounts for mean utilization, the incoming pod’s resource request, and variance across the observation window:

score = μ (mean utilization) + r (incoming request) + σ (standard deviation)

A node sitting at 10% resource requests but 80% actual utilization won’t score well, and new pods won’t land there. This is what the default scheduler fundamentally can’t do — it has no visibility into real usage.

The Trimaran scheduler is auto-injected cluster-wide using a Kyverno ClusterPolicy. Every new pod — except system namespaces and DaemonSets — automatically gets schedulerName: scheduler-plugins-scheduler set without any per-deployment configuration:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: inject-trimaran-scheduler
spec:
  rules:
    - name: inject-scheduler
      match:
        any:
          - resources:
              kinds:
                - Pod
      exclude:
        any:
          - resources:
              namespaces:
                - kube-system
                - scheduler-plugins
                - kyverno
      mutate:
        patchStrategicMerge:
          spec:
            schedulerName: scheduler-plugins-scheduler

No more reactive evictions. No more cascading database failures. Pods land where the cluster actually has headroom.

The AI Layer: OpenClaw and Nanobot

OpenClaw

OpenClaw is a full agent platform. It runs in the cluster with its own persistent file system, a skills system (SOUL.md, AGENTS.md, TOOLS.md), and full Kubernetes MCP access. It’s genuinely powerful — the kind of environment where you can build complex, multi-step workflows with real persistence and long-running context.

It’s also more than I needed for day-to-day cluster management. OpenClaw isn’t going anywhere — I have plans for it that go beyond cluster operations. But for the “hey, what’s going on right now” use case, it carries more surface area than necessary. I wanted something lighter for the daily driver role.

Nanobot — aka Jarvis

Nanobot is what I actually reach for every day. It’s an ultra-lightweight personal AI assistant that lives in your chat platform of choice — Slack, Discord, Telegram, whatever you use. It runs in the cluster as a single pod – and for me it goes over Slack Socket Mode, meaning no ingress, no public URL, nothing listening on a port. Purely outbound. It responds when mentioned.

It’s wired into the same Kubernetes MCP server that OpenClaw uses. So when I ask it what’s running, it actually checks. When I ask it to do something, it can — with my confirmation first, as the chat log above shows.

The memory system keeps a 40-message rolling context window, consolidating to MEMORY.md and HISTORY.md files on a persistent volume. It remembers what we talked about. It knows the cluster. 99% fewer lines of code than OpenClaw, and exactly what I needed for the job it does.

Applications

Mealie

Dispatcharr

Mealie

Recipe management for the household. Mealie runs 3 replicas backed by PostgreSQL, with a shared RWX Longhorn volume so all replicas work from the same data directory. It sits behind the Cloudflare tunnel with a public-facing route.

WordPress

The blog you’re reading right now runs in the cluster — 3 WordPress replicas sharing a 15Gi RWX volume for /var/www/html, backed by MySQL on a dedicated RWO volume.

Dispatcharr

Dispatcharr is an open-source IPTV management platform in the spirit of the *arr ecosystem. It consolidates streams from multiple IPTV providers into a single interface, auto-matches EPG data, and re-exports everything as M3U, Xtream Codes API, XMLTV, or as an HDHomeRun virtual tuner. That last format is the key integration: Jellyfin, Plex, and Emby all discover Dispatcharr as a live TV source and can record programs directly to their own DVR libraries.

Streams are proxied and relayed through configurable profiles — FFmpeg transcoding, Streamlink, VLC, or raw passthrough — so different clients can get different quality tiers or formats. Real-time monitoring tracks active connections, bandwidth usage, and buffering events. When a stream source goes down, automatic failover switches to a backup without interrupting viewers.

It runs as four components: web server, Celery worker, Redis, and PostgreSQL. It gets a static LAN IP via kube-vip so IPTV clients always know where to point.

The Kubernetes MCP Server

The Kubernetes MCP Server is the engine that makes the AI layer useful. It exposes the Kubernetes API as a set of MCP tools — list pods, scale deployments, describe nodes, pull events, exec commands. Both OpenClaw and Nanobot connect to it in-cluster via HTTP. It runs with cluster-admin RBAC, which is why Nanobot can do what it does when you ask it about the cluster state or tell it to take action.

Headlamp

Headlamp is my visual Kubernetes UI. It’s what I reach for when I want to browse the cluster without opening a terminal.

What’s Next

The cluster is stable. The full stack is up and documented. A few things still on my radar:

OpenClaw’s next role — It’s a capable platform and I haven’t put it to full use yet. I want to find a task that actually warrants it. More on that when it happens.
Trimaran tuning — The variance margin is currently set more aggressively than the upstream default. I want to watch placement decisions over a few weeks of real load before declaring the config optimal.
More Nanobot skills — Nanobot supports the ClawhHub skill registry, which means new capabilities can be dropped in as standalone skill files without touching the core codebase. More skills are in the queue.
SterlingPDF — Working on getting SterlingPDF deployed in the cluster. Self-hosted document processing, fits the pattern of everything else here.

That’s the stack. One cluster, no drama — most of the time.