Skip to content

Monitoring

Lightweight observability for the MIP production stack. Covers host-level metrics via Node Exporter, per-pod K3s metrics via Prometheus + cAdvisor, the built-in MIP Server Dashboard, and external uptime checks.


What Each Tool Covers

Tool Scope What it sees
Node Exporter Host (VM/bare-metal) CPU, memory, disk, network, load avg per node
cAdvisor (built into kubelet) Pod / Container CPU and memory usage per pod and container
kube-state-metrics Kubernetes objects Pod restarts, OOMKills, resource requests/limits, pod phase
Prometheus Aggregation Scrapes all of the above; stores time series; evaluates alerts
MIP Dashboard Frontend Built-in Metrics tab — node cards + pod table with inline bars

Install Prometheus & Node Exporter

sudo apt install -y prometheus prometheus-node-exporter
sudo systemctl enable --now prometheus prometheus-node-exporter

Verify both are running:

sudo systemctl status prometheus prometheus-node-exporter
  • Prometheus UI: http://localhost:9090
  • Node Exporter metrics: http://localhost:9100/metrics

Enable K3s cAdvisor (Read-Only Port)

K3s embeds cAdvisor inside the kubelet. The simplest way to expose it to Prometheus without token auth is to enable the kubelet read-only port:

sudo mkdir -p /etc/rancher/k3s
sudo tee -a /etc/rancher/k3s/config.yaml <<'EOF'
kubelet-arg:
  - "read-only-port=10255"
EOF
sudo systemctl restart k3s

Wait ~30 seconds for K3s to restart, then verify cAdvisor is accessible:

curl -s http://localhost:10255/metrics/cadvisor | grep container_memory_working_set | head -5

If you get metric lines with pod names, it's working.

Warning

Port 10255 is read-only and unauthenticated — it should only be accessible on localhost or via Tailscale. Do not expose it publicly.


Prometheus Configuration

Replace the default config with this full working configuration:

sudo tee /etc/prometheus/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:

  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Host-level metrics — CPU, memory, disk, network per node
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  # Per-pod CPU and memory via kubelet cAdvisor (read-only port)
  - job_name: 'k3s-cadvisor'
    static_configs:
      - targets: ['localhost:10255']
    metrics_path: /metrics/cadvisor

  # Kubelet metrics (pod scheduling, volume stats)
  - job_name: 'k3s-kubelet'
    static_configs:
      - targets: ['localhost:10255']
    metrics_path: /metrics

EOF
sudo systemctl restart prometheus

Verify it's scraping by opening http://localhost:9090/targets — all four jobs should show UP.


Key Metrics

Node Exporter — host-level

What it collects:

Metric group Examples
CPU node_cpu_seconds_total — per-core usage by mode (idle, user, system, iowait)
Memory node_memory_MemAvailable_bytes, node_memory_MemTotal_bytes
Disk I/O node_disk_read_bytes_total, node_disk_write_bytes_total
Disk space node_filesystem_avail_bytes, node_filesystem_size_bytes
Network node_network_receive_bytes_total, node_network_transmit_bytes_total, node_network_receive_errs_total
Load node_load1, node_load5, node_load15

Note

Node Exporter is host-level only. It cannot see inside Kubernetes pods or containers.

cAdvisor — pod/container-level

Metric What it measures
container_cpu_usage_seconds_total Cumulative CPU time per container — use irate(...[5m]) for current rate
container_memory_working_set_bytes Active memory per container (excludes file cache — use this, not RSS)
container_memory_rss RSS memory per container
container_fs_reads_bytes_total Container disk reads
container_fs_writes_bytes_total Container disk writes
container_network_receive_bytes_total Per-pod network ingress
container_network_transmit_bytes_total Per-pod network egress

Filter to MIP game server pods in PromQL:

# CPU usage for all MIP game server pods
irate(container_cpu_usage_seconds_total{pod=~"mip-server-fleet.*",container!="",container!="POD"}[5m])

# Memory working set for MIP game server pods
container_memory_working_set_bytes{pod=~"mip-server-fleet.*",container!="",container!="POD"}

# Sum CPU per pod (across all containers in the pod)
sum by(pod)(irate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m]))

MIP Server Dashboard

The dashboard/ app has a built-in Metrics tab that connects to Prometheus via the Vite dev proxy (no CORS issues) and displays:

  • Node cards — CPU %, Memory used/total, Disk / used/total with animated progress bars
  • Game Server Pods — your mip-server-fleet-* pods at the top, CPU and memory with inline relative bars
  • System Pods — Agones, K3s, and other system pods below
  • Auto-refreshes every 30 seconds

The Dashboard tab → Game Servers table also shows CPU and RAM columns per pod pulled directly from Prometheus.

Setup

Configure dashboard/.env.test to point to your server via Tailscale:

VITE_API_TARGET=http://100.65.48.118:3000
VITE_PROMETHEUS_URL=http://100.65.48.118:9090

Run against the remote server:

cd dashboard
npm run dev:remote

The Vite dev server proxies:

Path Destination
/api/* Backend (:3000)
/prometheus/* Prometheus (:9090) — strips /prometheus prefix

Prometheus itself does not need CORS enabled — all requests go through the Vite proxy.

Accessing via Tailscale

Prometheus (:9090) and Node Exporter (:9100) do not need to be publicly exposed. With Tailscale running on both the server and your local machine, traffic goes through the encrypted Tailscale tunnel.

To find the server's Tailscale IP from your local machine:

tailscale status

Look for the mip-server entry — use that IP in .env.test.


kube-state-metrics — Cluster Object State

cAdvisor tells you resource usage. kube-state-metrics tells you Kubernetes object state — whether pods are running, restart counts, OOMKills, resource requests/limits.

Install

kubectl apply -f https://github.com/kubernetes/kube-state-metrics/releases/latest/download/kube-state-metrics.yaml
kubectl get pods -n kube-system | grep kube-state-metrics

Add to prometheus.yml:

- job_name: 'kube-state-metrics'
  static_configs:
    - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']

Key metrics:

Metric What it measures
kube_pod_status_phase Pod phase: Running / Pending / Failed / Succeeded
kube_pod_container_status_restarts_total Container restart count
kube_pod_container_status_last_terminated_reason OOMKilled, Error, Completed
kube_pod_container_resource_requests CPU/memory requests per container
kube_pod_container_resource_limits CPU/memory limits per container
kube_node_status_condition Node Ready / DiskPressure / MemoryPressure

Useful PromQL:

# Pods that have restarted more than 5 times
kube_pod_container_status_restarts_total > 5

# OOMKilled containers
increase(kube_pod_container_status_restarts_total[1h]) > 0
  and on(pod, container)
  kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

# Node memory pressure
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1

Metric Alert Condition Severity Why
Node CPU usage > 85% for 5 min Warning Game servers are CPU-bound
Node memory usage > 90% Critical OOM killer terminates pods
Node disk usage > 85% Warning Docker images and logs fill disk fast
Node network errors > 0 sustained Warning Packet loss = player disconnects
Pod container restarts > 5 total Warning Crash loop likely
Pod OOMKilled Any Critical Pod exceeded memory limit — raise limit or investigate leak
Pod phase Failed Any Critical Pod is not running
Node DiskPressure True Critical K3s will start evicting pods

kubectl top — Quick CLI Check

Install metrics-server to enable kubectl top without Prometheus:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

For K3s self-signed certs, patch the deployment to skip TLS verification:

kubectl patch deployment metrics-server -n kube-system \
  --type='json' \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

Then:

kubectl top nodes                              # CPU/mem per node
kubectl top pods -A                            # CPU/mem per pod across all namespaces
kubectl top pods -n default --sort-by=memory   # sort by highest memory consumer

Note

kubectl top uses metrics-server for live snapshots only — no history, no alerting. Use Prometheus + the MIP Dashboard for persistent metrics.


Uptime Checks

Use an external service to alert when your backend goes offline entirely (complements internal Prometheus):

Service Free tier Setup
UptimeRobot 50 monitors, 5-min intervals HTTP check against /health endpoint
Hetrixtools 15 monitors HTTP + port checks
Betterstack 10 monitors HTTP + on-call schedules

Point the check at your NestJS backend health endpoint:

GET https://your-backend-domain/health
Expected: 200 OK

Alert via email, Slack, or Discord webhook on any non-200 response.