Monitoring¶

Lightweight observability for the MIP production stack. Covers host-level metrics via Node Exporter, per-pod K3s metrics via Prometheus + cAdvisor, the built-in MIP Server Dashboard, and external uptime checks.

What Each Tool Covers¶

Tool	Scope	What it sees
Node Exporter	Host (VM/bare-metal)	CPU, memory, disk, network, load avg per node
cAdvisor (built into kubelet)	Pod / Container	CPU and memory usage per pod and container
kube-state-metrics	Kubernetes objects	Pod restarts, OOMKills, resource requests/limits, pod phase
Prometheus	Aggregation	Scrapes all of the above; stores time series; evaluates alerts
MIP Dashboard	Frontend	Built-in Metrics tab — node cards + pod table with inline bars

Install Prometheus & Node Exporter¶

sudo apt install -y prometheus prometheus-node-exporter
sudo systemctl enable --now prometheus prometheus-node-exporter

Verify both are running:

sudo systemctl status prometheus prometheus-node-exporter

Prometheus UI: http://localhost:9090
Node Exporter metrics: http://localhost:9100/metrics

Enable K3s cAdvisor (Read-Only Port)¶

K3s embeds cAdvisor inside the kubelet. The simplest way to expose it to Prometheus without token auth is to enable the kubelet read-only port:

sudo mkdir -p /etc/rancher/k3s
sudo tee -a /etc/rancher/k3s/config.yaml <<'EOF'
kubelet-arg:
  - "read-only-port=10255"
EOF
sudo systemctl restart k3s

Wait ~30 seconds for K3s to restart, then verify cAdvisor is accessible:

curl -s http://localhost:10255/metrics/cadvisor | grep container_memory_working_set | head -5

If you get metric lines with pod names, it's working.

Warning

Port 10255 is read-only and unauthenticated — it should only be accessible on localhost or via Tailscale. Do not expose it publicly.

Prometheus Configuration¶

Replace the default config with this full working configuration:

sudo tee /etc/prometheus/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:

  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Host-level metrics — CPU, memory, disk, network per node
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  # Per-pod CPU and memory via kubelet cAdvisor (read-only port)
  - job_name: 'k3s-cadvisor'
    static_configs:
      - targets: ['localhost:10255']
    metrics_path: /metrics/cadvisor

  # Kubelet metrics (pod scheduling, volume stats)
  - job_name: 'k3s-kubelet'
    static_configs:
      - targets: ['localhost:10255']
    metrics_path: /metrics

EOF
sudo systemctl restart prometheus

Verify it's scraping by opening http://localhost:9090/targets — all four jobs should show UP.

Key Metrics¶

Node Exporter — host-level¶

What it collects:

Metric group	Examples
CPU	`node_cpu_seconds_total` — per-core usage by mode (idle, user, system, iowait)
Memory	`node_memory_MemAvailable_bytes`, `node_memory_MemTotal_bytes`
Disk I/O	`node_disk_read_bytes_total`, `node_disk_write_bytes_total`
Disk space	`node_filesystem_avail_bytes`, `node_filesystem_size_bytes`
Network	`node_network_receive_bytes_total`, `node_network_transmit_bytes_total`, `node_network_receive_errs_total`
Load	`node_load1`, `node_load5`, `node_load15`

Note

Node Exporter is host-level only. It cannot see inside Kubernetes pods or containers.

cAdvisor — pod/container-level¶

Metric	What it measures
`container_cpu_usage_seconds_total`	Cumulative CPU time per container — use `irate(...[5m])` for current rate
`container_memory_working_set_bytes`	Active memory per container (excludes file cache — use this, not RSS)
`container_memory_rss`	RSS memory per container
`container_fs_reads_bytes_total`	Container disk reads
`container_fs_writes_bytes_total`	Container disk writes
`container_network_receive_bytes_total`	Per-pod network ingress
`container_network_transmit_bytes_total`	Per-pod network egress

Filter to MIP game server pods in PromQL:

# CPU usage for all MIP game server pods
irate(container_cpu_usage_seconds_total{pod=~"mip-server-fleet.*",container!="",container!="POD"}[5m])

# Memory working set for MIP game server pods
container_memory_working_set_bytes{pod=~"mip-server-fleet.*",container!="",container!="POD"}

# Sum CPU per pod (across all containers in the pod)
sum by(pod)(irate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m]))

MIP Server Dashboard¶

The dashboard/ app has a built-in Metrics tab that connects to Prometheus via the Vite dev proxy (no CORS issues) and displays:

Node cards — CPU %, Memory used/total, Disk / used/total with animated progress bars
Game Server Pods — your mip-server-fleet-* pods at the top, CPU and memory with inline relative bars
System Pods — Agones, K3s, and other system pods below
Auto-refreshes every 30 seconds

The Dashboard tab → Game Servers table also shows CPU and RAM columns per pod pulled directly from Prometheus.

Setup¶

Configure dashboard/.env.test to point to your server via Tailscale:

VITE_API_TARGET=http://100.65.48.118:3000
VITE_PROMETHEUS_URL=http://100.65.48.118:9090

Run against the remote server:

cd dashboard
npm run dev:remote

The Vite dev server proxies:

Path	Destination
`/api/*`	Backend (`:3000`)
`/prometheus/*`	Prometheus (`:9090`) — strips `/prometheus` prefix

Prometheus itself does not need CORS enabled — all requests go through the Vite proxy.

Accessing via Tailscale¶

Prometheus (:9090) and Node Exporter (:9100) do not need to be publicly exposed. With Tailscale running on both the server and your local machine, traffic goes through the encrypted Tailscale tunnel.

To find the server's Tailscale IP from your local machine:

tailscale status

Look for the mip-server entry — use that IP in .env.test.

kube-state-metrics — Cluster Object State¶

cAdvisor tells you resource usage. kube-state-metrics tells you Kubernetes object state — whether pods are running, restart counts, OOMKills, resource requests/limits.

Install¶

kubectl apply -f https://github.com/kubernetes/kube-state-metrics/releases/latest/download/kube-state-metrics.yaml
kubectl get pods -n kube-system | grep kube-state-metrics

Add to prometheus.yml:

- job_name: 'kube-state-metrics'
  static_configs:
    - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']

Key metrics:

Metric	What it measures
`kube_pod_status_phase`	Pod phase: Running / Pending / Failed / Succeeded
`kube_pod_container_status_restarts_total`	Container restart count
`kube_pod_container_status_last_terminated_reason`	`OOMKilled`, `Error`, `Completed`
`kube_pod_container_resource_requests`	CPU/memory requests per container
`kube_pod_container_resource_limits`	CPU/memory limits per container
`kube_node_status_condition`	Node Ready / DiskPressure / MemoryPressure

Useful PromQL:

# Pods that have restarted more than 5 times
kube_pod_container_status_restarts_total > 5

# OOMKilled containers
increase(kube_pod_container_status_restarts_total[1h]) > 0
  and on(pod, container)
  kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

# Node memory pressure
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1

Recommended Alert Thresholds¶

Metric	Alert Condition	Severity	Why
Node CPU usage	> 85% for 5 min	Warning	Game servers are CPU-bound
Node memory usage	> 90%	Critical	OOM killer terminates pods
Node disk usage	> 85%	Warning	Docker images and logs fill disk fast
Node network errors	> 0 sustained	Warning	Packet loss = player disconnects
Pod container restarts	> 5 total	Warning	Crash loop likely
Pod OOMKilled	Any	Critical	Pod exceeded memory limit — raise limit or investigate leak
Pod phase Failed	Any	Critical	Pod is not running
Node DiskPressure	True	Critical	K3s will start evicting pods

`kubectl top` — Quick CLI Check¶

Install metrics-server to enable kubectl top without Prometheus:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

For K3s self-signed certs, patch the deployment to skip TLS verification:

kubectl patch deployment metrics-server -n kube-system \
  --type='json' \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

Then:

kubectl top nodes                              # CPU/mem per node
kubectl top pods -A                            # CPU/mem per pod across all namespaces
kubectl top pods -n default --sort-by=memory   # sort by highest memory consumer

Note

kubectl top uses metrics-server for live snapshots only — no history, no alerting. Use Prometheus + the MIP Dashboard for persistent metrics.

Uptime Checks¶

Use an external service to alert when your backend goes offline entirely (complements internal Prometheus):

Service	Free tier	Setup
UptimeRobot	50 monitors, 5-min intervals	HTTP check against `/health` endpoint
Hetrixtools	15 monitors	HTTP + port checks
Betterstack	10 monitors	HTTP + on-call schedules

Point the check at your NestJS backend health endpoint:

GET https://your-backend-domain/health
Expected: 200 OK

Alert via email, Slack, or Discord webhook on any non-200 response.