Curated collection of Grafana dashboards for monitoring the Goldentooth Raspberry Pi cluster infrastructure.
This repository contains production-ready Grafana dashboards designed specifically for monitoring a Raspberry Pi cluster running Kubernetes, HashiCorp stack (Consul, Nomad, Vault), and various observability tools.
Comprehensive overview of HashiCorp services:
- Consul: Cluster membership, service health, KV store metrics
- Nomad: Job status, resource allocation, client health
- Vault: Secret engine metrics, authentication status, token usage
High-level infrastructure monitoring:
- Node Health: CPU, memory, disk usage across all Pi nodes
- Network Performance: Latency, bandwidth, connectivity status
- Service Availability: Uptime metrics for critical cluster services
- Storage: NFS exports, ZFS pool status, disk I/O
Detailed system metrics from node_exporter:
- Hardware Monitoring: Temperature, voltage, frequency scaling
- Resource Utilization: Per-node CPU, memory, disk, network
- Process Monitoring: System load, context switches, interrupts
- Filesystem Details: Mount points, inode usage, disk space
SLURM workload manager dashboard:
- Job Queue: Pending, running, completed job metrics
- Partition Status: Node allocation, resource availability
- User Activity: Job submission patterns, resource consumption
- Cluster Efficiency: Utilization rates, queue times
These dashboards are automatically provisioned through the Goldentooth Ansible role:
goldentooth setup_grafana
They integrate with:
- Prometheus: Primary metrics collection
- Node Exporter: System-level metrics
- Blackbox Exporter: Service availability monitoring
- Custom Exporters: SLURM, HashiCorp services
Dashboards are automatically deployed via Ansible to /var/lib/grafana/dashboards/
and configured through provisioning files. Updates are applied through the cluster management pipeline.
Each dashboard includes:
- Variable Templating: Node selection, time ranges, service filters
- Alert Annotations: Integration with Prometheus AlertManager
- Panel Descriptions: Detailed explanations of metrics and thresholds
- Responsive Layout: Optimized for different screen sizes
For cluster-specific customizations, modify the dashboard JSON files and redeploy through Ansible.