Monitoring¶
Lightweight observability for the MIP production stack. Covers host-level metrics via Node Exporter, per-pod K3s metrics via Prometheus + cAdvisor, the built-in MIP Server Dashboard, and external uptime checks.
What Each Tool Covers¶
| Tool | Scope | What it sees |
|---|---|---|
| Node Exporter | Host (VM/bare-metal) | CPU, memory, disk, network, load avg per node |
| cAdvisor (built into kubelet) | Pod / Container | CPU and memory usage per pod and container |
| kube-state-metrics | Kubernetes objects | Pod restarts, OOMKills, resource requests/limits, pod phase |
| Prometheus | Aggregation | Scrapes all of the above; stores time series; evaluates alerts |
| MIP Dashboard | Frontend | Built-in Metrics tab — node cards + pod table with inline bars |
Install Prometheus & Node Exporter¶
sudo apt install -y prometheus prometheus-node-exporter
sudo systemctl enable --now prometheus prometheus-node-exporter
Verify both are running:
- Prometheus UI:
http://localhost:9090 - Node Exporter metrics:
http://localhost:9100/metrics
Enable K3s cAdvisor (Read-Only Port)¶
K3s embeds cAdvisor inside the kubelet. The simplest way to expose it to Prometheus without token auth is to enable the kubelet read-only port:
sudo mkdir -p /etc/rancher/k3s
sudo tee -a /etc/rancher/k3s/config.yaml <<'EOF'
kubelet-arg:
- "read-only-port=10255"
EOF
sudo systemctl restart k3s
Wait ~30 seconds for K3s to restart, then verify cAdvisor is accessible:
If you get metric lines with pod names, it's working.
Warning
Port 10255 is read-only and unauthenticated — it should only be accessible on localhost or via Tailscale. Do not expose it publicly.
Prometheus Configuration¶
Replace the default config with this full working configuration:
sudo tee /etc/prometheus/prometheus.yml <<'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Host-level metrics — CPU, memory, disk, network per node
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
# Per-pod CPU and memory via kubelet cAdvisor (read-only port)
- job_name: 'k3s-cadvisor'
static_configs:
- targets: ['localhost:10255']
metrics_path: /metrics/cadvisor
# Kubelet metrics (pod scheduling, volume stats)
- job_name: 'k3s-kubelet'
static_configs:
- targets: ['localhost:10255']
metrics_path: /metrics
EOF
sudo systemctl restart prometheus
Verify it's scraping by opening http://localhost:9090/targets — all four jobs should show UP.
Key Metrics¶
Node Exporter — host-level¶
What it collects:
| Metric group | Examples |
|---|---|
| CPU | node_cpu_seconds_total — per-core usage by mode (idle, user, system, iowait) |
| Memory | node_memory_MemAvailable_bytes, node_memory_MemTotal_bytes |
| Disk I/O | node_disk_read_bytes_total, node_disk_write_bytes_total |
| Disk space | node_filesystem_avail_bytes, node_filesystem_size_bytes |
| Network | node_network_receive_bytes_total, node_network_transmit_bytes_total, node_network_receive_errs_total |
| Load | node_load1, node_load5, node_load15 |
Note
Node Exporter is host-level only. It cannot see inside Kubernetes pods or containers.
cAdvisor — pod/container-level¶
| Metric | What it measures |
|---|---|
container_cpu_usage_seconds_total |
Cumulative CPU time per container — use irate(...[5m]) for current rate |
container_memory_working_set_bytes |
Active memory per container (excludes file cache — use this, not RSS) |
container_memory_rss |
RSS memory per container |
container_fs_reads_bytes_total |
Container disk reads |
container_fs_writes_bytes_total |
Container disk writes |
container_network_receive_bytes_total |
Per-pod network ingress |
container_network_transmit_bytes_total |
Per-pod network egress |
Filter to MIP game server pods in PromQL:
# CPU usage for all MIP game server pods
irate(container_cpu_usage_seconds_total{pod=~"mip-server-fleet.*",container!="",container!="POD"}[5m])
# Memory working set for MIP game server pods
container_memory_working_set_bytes{pod=~"mip-server-fleet.*",container!="",container!="POD"}
# Sum CPU per pod (across all containers in the pod)
sum by(pod)(irate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m]))
MIP Server Dashboard¶
The dashboard/ app has a built-in Metrics tab that connects to Prometheus via the Vite dev proxy (no CORS issues) and displays:
- Node cards — CPU %, Memory used/total, Disk
/used/total with animated progress bars - Game Server Pods — your
mip-server-fleet-*pods at the top, CPU and memory with inline relative bars - System Pods — Agones, K3s, and other system pods below
- Auto-refreshes every 30 seconds
The Dashboard tab → Game Servers table also shows CPU and RAM columns per pod pulled directly from Prometheus.
Setup¶
Configure dashboard/.env.test to point to your server via Tailscale:
Run against the remote server:
The Vite dev server proxies:
| Path | Destination |
|---|---|
/api/* |
Backend (:3000) |
/prometheus/* |
Prometheus (:9090) — strips /prometheus prefix |
Prometheus itself does not need CORS enabled — all requests go through the Vite proxy.
Accessing via Tailscale¶
Prometheus (:9090) and Node Exporter (:9100) do not need to be publicly exposed. With Tailscale running on both the server and your local machine, traffic goes through the encrypted Tailscale tunnel.
To find the server's Tailscale IP from your local machine:
Look for the mip-server entry — use that IP in .env.test.
kube-state-metrics — Cluster Object State¶
cAdvisor tells you resource usage. kube-state-metrics tells you Kubernetes object state — whether pods are running, restart counts, OOMKills, resource requests/limits.
Install¶
kubectl apply -f https://github.com/kubernetes/kube-state-metrics/releases/latest/download/kube-state-metrics.yaml
kubectl get pods -n kube-system | grep kube-state-metrics
Add to prometheus.yml:
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
Key metrics:
| Metric | What it measures |
|---|---|
kube_pod_status_phase |
Pod phase: Running / Pending / Failed / Succeeded |
kube_pod_container_status_restarts_total |
Container restart count |
kube_pod_container_status_last_terminated_reason |
OOMKilled, Error, Completed |
kube_pod_container_resource_requests |
CPU/memory requests per container |
kube_pod_container_resource_limits |
CPU/memory limits per container |
kube_node_status_condition |
Node Ready / DiskPressure / MemoryPressure |
Useful PromQL:
# Pods that have restarted more than 5 times
kube_pod_container_status_restarts_total > 5
# OOMKilled containers
increase(kube_pod_container_status_restarts_total[1h]) > 0
and on(pod, container)
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
# Node memory pressure
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1
Recommended Alert Thresholds¶
| Metric | Alert Condition | Severity | Why |
|---|---|---|---|
| Node CPU usage | > 85% for 5 min | Warning | Game servers are CPU-bound |
| Node memory usage | > 90% | Critical | OOM killer terminates pods |
| Node disk usage | > 85% | Warning | Docker images and logs fill disk fast |
| Node network errors | > 0 sustained | Warning | Packet loss = player disconnects |
| Pod container restarts | > 5 total | Warning | Crash loop likely |
| Pod OOMKilled | Any | Critical | Pod exceeded memory limit — raise limit or investigate leak |
| Pod phase Failed | Any | Critical | Pod is not running |
| Node DiskPressure | True | Critical | K3s will start evicting pods |
kubectl top — Quick CLI Check¶
Install metrics-server to enable kubectl top without Prometheus:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
For K3s self-signed certs, patch the deployment to skip TLS verification:
kubectl patch deployment metrics-server -n kube-system \
--type='json' \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
Then:
kubectl top nodes # CPU/mem per node
kubectl top pods -A # CPU/mem per pod across all namespaces
kubectl top pods -n default --sort-by=memory # sort by highest memory consumer
Note
kubectl top uses metrics-server for live snapshots only — no history, no alerting. Use Prometheus + the MIP Dashboard for persistent metrics.
Uptime Checks¶
Use an external service to alert when your backend goes offline entirely (complements internal Prometheus):
| Service | Free tier | Setup |
|---|---|---|
| UptimeRobot | 50 monitors, 5-min intervals | HTTP check against /health endpoint |
| Hetrixtools | 15 monitors | HTTP + port checks |
| Betterstack | 10 monitors | HTTP + on-call schedules |
Point the check at your NestJS backend health endpoint:
Alert via email, Slack, or Discord webhook on any non-200 response.