NTFY Integration for ZFS Scrub Alerts
This guide shows how to configure Prometheus Alertmanager to send ZFS scrub alerts to NTFY.
Architecture
Storage Nodes (storage-a, storage-b)
└─ node_exporter (port 9100)
└─ textfile collector
└─ zfs_scrub.prom (metrics updated every 5 min)
↓
Prometheus Server
├─ Scrapes metrics from storage nodes
├─ Evaluates alert rules (prometheus-alerts.yml)
└─ Sends alerts to Alertmanager
↓
Alertmanager
└─ Routes alerts to NTFY webhook
↓
NTFY (ntfy.sh or self-hosted)
└─ Sends push notifications to your devices
Prerequisites
- Prometheus server scraping your storage nodes (on port 9100)
- Alertmanager configured to receive alerts from Prometheus
- NTFY topic (e.g.,
https://ntfy.sh/your-san-alertsor self-hosted)
Step 1: Configure Prometheus
Add your storage nodes as scrape targets in prometheus.yml:
scrape_configs:
- job_name: 'storage-nodes'
static_configs:
- targets:
- '10.20.20.1:9100' # storage-a management IP
- '10.20.20.2:9100' # storage-b management IP
labels:
cluster: 'san-cluster'
environment: 'production'
Step 2: Add Alert Rules
Copy docs/prometheus-alerts.yml to your Prometheus rules directory and load it:
# In prometheus.yml
rule_files:
- "rules/prometheus-alerts.yml"
Reload Prometheus:
systemctl reload prometheus
# or
curl -X POST http://localhost:9090/-/reload
Verify rules are loaded:
# Check Prometheus UI: http://your-prometheus:9090/alerts
Step 3: Configure Alertmanager for NTFY
Option A: NTFY Cloud (ntfy.sh)
Create or update /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
route:
receiver: 'ntfy-default'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
# Route critical alerts immediately, warnings with delay
routes:
- match:
severity: critical
receiver: 'ntfy-critical'
group_wait: 0s
repeat_interval: 1h
- match:
severity: warning
receiver: 'ntfy-warning'
group_wait: 30s
repeat_interval: 12h
receivers:
- name: 'ntfy-critical'
webhook_configs:
- url: 'https://ntfy.sh/your-san-alerts-critical'
send_resolved: true
http_config:
follow_redirects: true
- name: 'ntfy-warning'
webhook_configs:
- url: 'https://ntfy.sh/your-san-alerts'
send_resolved: true
http_config:
follow_redirects: true
- name: 'ntfy-default'
webhook_configs:
- url: 'https://ntfy.sh/your-san-alerts'
send_resolved: true
inhibit_rules:
# Suppress warning if critical is firing for same alertname
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Important: Replace your-san-alerts and your-san-alerts-critical with your actual NTFY topic names.
Option B: Self-Hosted NTFY
If you’re running your own NTFY server with authentication:
receivers:
- name: 'ntfy-critical'
webhook_configs:
- url: 'https://ntfy.example.com/san-alerts-critical'
send_resolved: true
http_config:
follow_redirects: true
authorization:
type: Bearer
credentials: 'your-ntfy-access-token'
Step 4: Format Alert Messages for NTFY
Alertmanager webhook sends alerts in JSON format. NTFY expects specific headers for formatting. You have two options:
Option A: Use ntfy-alertmanager Bridge (Recommended)
Install ntfy-alertmanager as a bridge:
# On your Alertmanager server or separate host
docker run -d \
--name ntfy-alertmanager \
-p 8080:8080 \
-e NTFY_TOPIC=https://ntfy.sh/your-san-alerts \
-e NTFY_PRIORITY=default \
-e NTFY_TAGS=warning \
xenrox/ntfy-alertmanager:latest
Then point Alertmanager to the bridge:
receivers:
- name: 'ntfy-critical'
webhook_configs:
- url: 'http://localhost:8080/alerts' # ntfy-alertmanager bridge
send_resolved: true
The bridge formats messages nicely and adds proper NTFY headers.
Option B: Direct NTFY Webhook (Simple but Limited Formatting)
NTFY accepts webhook POSTs directly, but formatting is basic:
receivers:
- name: 'ntfy-critical'
webhook_configs:
- url: 'https://ntfy.sh'
send_resolved: true
http_config:
follow_redirects: true
# Send custom headers for NTFY
# Note: Alertmanager doesn't support custom headers in webhook_configs easily
# Consider using ntfy-alertmanager bridge instead
Limitation: Direct webhook has poor message formatting. Use the bridge for better UX.
Step 5: Test Alerts
Test 1: Manual Alert Trigger
Trigger a test alert by manually running the scrub exporter with a fake failure:
# On storage-a
ssh storage-a
sudo systemctl stop zfs-scrub-exporter.timer
sudo rm /var/lib/prometheus/node-exporter/zfs_scrub.prom
# Wait 5+ minutes for Prometheus to detect missing metrics
Test 2: Simulate Overdue Scrub
Manually create metrics showing an old scrub:
echo 'zfs_scrub_last_run_timestamp_seconds{pool="san-pool"} 1609459200' | \
sudo tee /var/lib/prometheus/node-exporter/zfs_scrub.prom
# This timestamp is from Jan 1, 2021 - will trigger ZFSScrubOverdue
Test 3: Send Test Alert from Alertmanager
# On Alertmanager server
amtool alert add alertname=ZFSScrubTest severity=warning
Step 6: Subscribe to NTFY Notifications
Mobile Apps
- iOS: NTFY iOS App
- Android: NTFY Android App
Web Browser
Visit: https://ntfy.sh/your-san-alerts
Command Line
# Subscribe to alerts in terminal
ntfy sub your-san-alerts
Alert Types You’ll Receive
Based on the Prometheus alert rules:
ZFS Storage Alerts:
- ZFSScrubOverdue (warning) - No scrub in 35+ days
- ZFSScrubNeverRun (warning) - Pool never scrubbed
- ZFSScrubErrors (critical) - Data corruption detected
- ZFSPoolDegraded (critical) - Pool health degraded
- ZFSScrubExporterFailed (warning) - Metrics collection issue
- ZFSScrubStalled (warning) - Scrub running >12 hours
- ZFSPoolSplitBrain (critical) - Pool imported on multiple nodes
Cluster Health Alerts:
- ClusterQuorumLost (critical) - Cluster lost quorum
- ClusterNodeOffline (critical) - Node offline for >2 minutes
- ClusterResourceFailed (critical) - Resource has failures
- StonithDisabled (critical) - STONITH fencing disabled
- ClusterSplitBrainRisk (critical) - Multiple nodes reporting quorum
- ResourceMigrationThresholdReached (critical) - Resource about to migrate
- ClusterResourceStopped (warning) - Managed resource stopped
- CorosyncRingFaulty (warning) - Corosync ring errors
- CorosyncMembershipChanges (warning) - Frequent membership changes
- HAClusterExporterDown (warning) - Cluster exporter not responding
- ResourceFailoverDetected (info) - Normal failover event
Alert Routing:
- Critical alerts (3, 4, 7-13) →
ntfy-criticaltopic (urgent notifications) - Warning alerts (1, 2, 5, 6, 14-17) →
ntfy-warningtopic (standard notifications) - Info alerts (18) →
ntfy-defaulttopic (awareness only)
Advanced: Priority and Tags
If using ntfy-alertmanager bridge, configure per-severity routing:
# In ntfy-alertmanager config
ntfy:
topic: "san-alerts"
priority_map:
critical: "urgent"
warning: "default"
info: "low"
tags_map:
critical: "🚨,storage,critical"
warning: "⚠️,storage,warning"
Troubleshooting
Metrics Not Appearing in Prometheus
# Check if exporter is running
systemctl status zfs-scrub-exporter.timer
systemctl list-timers zfs-scrub-exporter.timer
# Check metrics file exists
cat /var/lib/prometheus/node-exporter/zfs_scrub.prom
# Check node_exporter is serving metrics
curl http://10.20.20.1:9100/metrics | grep zfs_scrub
Alerts Not Firing
# Check Prometheus can reach storage nodes
curl http://your-prometheus:9090/api/v1/targets
# Check alert rules are loaded
curl http://your-prometheus:9090/api/v1/rules
# Check if alert is pending/firing
curl http://your-prometheus:9090/api/v1/alerts
NTFY Not Receiving Messages
# Test NTFY directly
curl -d "Test message" https://ntfy.sh/your-san-alerts
# Check Alertmanager logs
journalctl -u alertmanager -f
# Verify webhook is configured
amtool config show
Example Alert Message
When ZFSScrubOverdue fires, you’ll receive:
🚨 ZFS Scrub Overdue
Pool: san-pool
Node: storage-a (10.20.20.1)
Last scrub: 38 days ago
Monthly scrubs are recommended for data integrity.
Run: systemctl start zfs-scrub@san-pool.service
Security Considerations
- NTFY Topic Names: Use long, random topic names for public ntfy.sh
- Bad:
san-alerts(easily guessable) - Good:
san-alerts-8f3k2j9x(random suffix)
- Bad:
- Self-Hosted NTFY: Use authentication and HTTPS
- Alertmanager: Restrict access to management VLAN only
- Sensitive Data: Alert messages contain pool names and hostnames - ensure topics are private
Dead-Man’s Switch with Uptime Kuma
The Watchdog alert always fires (it uses expr: vector(1)), confirming the alerting pipeline
(Prometheus → Alertmanager → ntfy) is functioning end-to-end. If the Watchdog heartbeat stops
arriving, something in the pipeline is broken — even if no real alerts are firing.
Setup
- Deploy Uptime Kuma (self-hosted recommended):
docker run -d --name uptime-kuma -p 3001:3001 -v uptime-kuma:/app/data louislam/uptime-kuma:1 - Create a Push monitor in Uptime Kuma:
- Type: Push
- Heartbeat Interval: 5 minutes
- Note the push URL shown:
https://uptime-kuma.example.com/api/push/<token>
- Add Watchdog receiver to Alertmanager:
```yaml
receivers:
- name: ‘uptime-kuma-watchdog’
webhook_configs:
- url: ‘https://uptime-kuma.example.com/api/push/
?status=up&msg=OK' send_resolved: false ```
- url: ‘https://uptime-kuma.example.com/api/push/
- name: ‘uptime-kuma-watchdog’
webhook_configs:
- Add Watchdog route (before all other routes so it matches first):
```yaml
routes:
- match: alertname: Watchdog receiver: uptime-kuma-watchdog group_wait: 0s repeat_interval: 5m # … existing routes below ```
- Configure Uptime Kuma notifications — set Uptime Kuma to notify you (via ntfy, email, etc.) when the push heartbeat stops arriving.
Uptime Kuma will alert you via its own notification system if the Prometheus/Alertmanager pipeline goes silent for more than one missed interval.
Additional Alertmanager Inhibit Rules
These suppress downstream noise when a root-cause alert is firing. Add to alertmanager.yml:
inhibit_rules:
# Existing: suppress warnings when critical fires for same resource
- source_match: { severity: 'critical' }
target_match: { severity: 'warning' }
equal: ['alertname', 'instance']
# When a cluster node is offline, suppress resource/service alerts from that node
- source_match: { alertname: 'ClusterNodeOffline' }
target_match_re:
alertname: 'ClusterResource.*|ZFSPool.*|NFSServer.*|SMBServer.*'
equal: ['instance']
# When quorum is lost, suppress resource transition and failover noise
- source_match: { alertname: 'ClusterQuorumLost' }
target_match_re:
alertname: 'ClusterResource.*|ResourceMigration.*|Pacemaker.*|STONITH.*'
equal: []