Metrics (Prometheus)
CSM exposes a /metrics endpoint on its HTTPS web UI port
(default 9443). The endpoint serves the Prometheus text exposition
format (Content-Type: text/plain; version=0.0.4) and is safe to
scrape every 15 seconds.
This is ROADMAP item 4. The initial release covers the metrics
listed under “Available metrics” below. More call sites are
instrumented in ongoing releases; track progress in
CHANGELOG.md under ## [Unreleased].
Enabling
Metrics are on whenever webui.enabled: true is set in csm.yaml.
The endpoint has its own auth knob:
webui:
enabled: true
auth_token: "<UI login token>"
metrics_token: "<long random string for Prometheus scraper>"
metrics_token is optional. When set, a Bearer header containing
this exact value unlocks /metrics. The UI auth_token or a valid
UI session cookie is also accepted so the dashboard can self-scrape,
but keeping the two tokens separate is recommended: rotating
auth_token does not then break Prometheus scraping, and giving
your monitoring stack the scrape token does not also give it UI
access.
Prometheus scrape config
scrape_configs:
- job_name: csm
scheme: https
tls_config:
# CSM serves a self-signed cert by default; either skip
# verification here or pin the CA you chose.
insecure_skip_verify: true
authorization:
type: Bearer
credentials: "<metrics_token from csm.yaml>"
static_configs:
- targets:
- csm-host-1.example.internal:9443
- csm-host-2.example.internal:9443
A complete, validated version of this snippet (with global: block)
ships as docs/src/examples/prometheus-scrape.yml. The CI pipeline
runs promtool check config against that file in the promtool-check
job; if the example ever stops validating, the pipeline fails.
Quick check
curl -sk -H "Authorization: Bearer $METRICS_TOKEN" \
https://localhost:9443/metrics | head
Available metrics
Build / process
csm_build_info{version}(gauge, always 1): build metadata. Scrape once to discover the running version. Join on it in queries viagroup_left(version).
YARA-X worker (when signatures.yara_worker_enabled: true)
csm_yara_worker_restarts_total(counter): cumulative number of times the supervisor has restarted thecsm yara-workerchild. Alert on sustained growth: a single restart is routine (rule deploys), a steady climb means the worker is crash-looping and real-time YARA scans are degraded.
Findings
csm_findings_total{severity}(counter): every finding CSM records is counted here. Severities areCRITICAL,HIGH, andWARNING(matching thealert.Severityenum). Userate(...)for arrival velocity; watch for sudden CRITICAL spikes.
State
csm_store_size_bytes(gauge): on-disk size of the bbolt state database (/opt/csm/state/csm.dbby default). ROADMAP item 6 will add a retention policy that compacts this file; for now use this metric to spot runaway growth.
Fanotify realtime monitor
csm_fanotify_queue_depth(gauge): current number of queued events waiting for the analyzer pool. The queue capacity is 4000; sustained values near that cap mean drops are imminent. Alert target:max_over_time(csm_fanotify_queue_depth[5m]) > 3500.csm_fanotify_events_dropped_total(counter): cumulative events dropped because the analyzer queue was full. The reconcile pass still rescans drop-affected directories 60 s later, so dropped events do not disappear from detection – they arrive delayed. Alert target:rate(csm_fanotify_events_dropped_total[5m]) > 0paired with a short for-clause.csm_fanotify_reconcile_latency_seconds(histogram): how long the post-overflow reconcile pass takes to walk drop-affected directories and rescan recent files. Buckets: 0.01 s .. 60 s. Watch p95: reconcile stealing tens of seconds means bulk events are piling up faster than the walker can keep up.
Periodic check runner
-
csm_check_duration_seconds{name,tier}(histogram): wall-clock time each check takes to complete. Labelnameis one of the 62 checks (fake_kernel_threads,webshells, …); labeltieriscritical,deep, orall. Buckets: 0.01 s .. 300 s (300 s is the per-check timeout ceiling). Useful aggregations:# p95 of the slowest check in the critical tier: histogram_quantile(0.95, sum by (le, name) ( rate(csm_check_duration_seconds_bucket{tier="critical"}[10m]) ) ) # total time each cycle spends in deep-tier checks: sum by (tier) (rate(csm_check_duration_seconds_sum{tier="deep"}[1h]))
Firewall
csm_blocked_ips_total(gauge): number of IPs currently on the firewall block list. Excludes expired temp bans – the store’sLoadFirewallStatefilters those before the gauge reads.csm_firewall_rules_total(gauge): total firewall rules across all four categories (blocked IPs, allowed IPs, blocked subnets, port-specific allows). Sudden drops are worth investigating; the firewall engine does not prune rules without operator or auto-response action.
Config reloads
csm_config_reloads_total{result}(counter): SIGHUP reload attempts, by outcome. Labels:resultis one of:success– safe fields swapped in place, integrity hash re-signed, live config updated.restart_required– one or more fields that need a full restart changed; live config unchanged.error– YAML parse failure, validation failure, or re-sign failure; live config unchanged.noop– file edit produced no semantic change (identical values, whitespace edit, etc.). Alert target:rate(csm_config_reloads_total{result="error"}[5m]) > 0paired with a short for-clause.
Auto-response
csm_auto_response_actions_total{action}(counter): every auto-response action fired, by class. Labels:actioniskill,quarantine, orblock. Incremented once per finding the correspondingAuto*helper produces, so a batch blocking four IPs in one cycle adds 4 toaction=block. Useful for detecting response storms:rate(csm_auto_response_actions_total[5m]).
Counter reset semantics
Prometheus counters in CSM live in process memory. They reset to zero
whenever the daemon restarts (config change, binary upgrade, crash
recovery). This is the standard behaviour for every
Prometheus-instrumented daemon; Prometheus’s scrape pipeline detects
counter resets on its own and rate(), increase(), and
rate_over_time() all handle them correctly.
Operators should not alert on “counter decreased across a scrape” as
a failure condition. Alert on rate() or increase() of a counter
over a window long enough to absorb expected restarts.
Persisting counters across restarts would require writing to bbolt on every increment, which would not pay for itself. If a specific metric needs restart-stable behaviour later, a gauge-over-the-bbolt-counter pattern can be added for that one case without affecting the rest.
Caveats
- Scrape the web UI’s HTTPS port, not a separate listener.
curl -k/insecure_skip_verifyis appropriate only when the cert is self-signed and the network path is trusted. Pin a CA for anything else.- Prometheus label cardinality: per-account and per-IP labels are deliberately not exposed. Shared-hosting deployments with 1000+ cPanel users would otherwise overwhelm a Prometheus server.
Not instrumented (yet)
- Per-account labels on any metric. Deliberately off: shared-hosting deployments with 1000+ cPanel users would blow out Prometheus cardinality.
- Fanotify inline auto-response actions (the quarantine-while-
seeing-the-write path in
fanotify.go). The periodiccsm_auto_response_actions_totaldoes not count those; a follow- up may split the metric or add asourcelabel. - bbolt per-bucket size breakdown. Ships with ROADMAP item 6 (the retention + compaction work).