Drift Detection & Self-Heal¶

The watcher-service continuously polls every configured datasource and compares the live state against the Git baseline. When values diverge, it can restore the correct state automatically.

What is drift?¶

Drift happens when a datasource value no longer matches what Git says it should be. Common causes:

A developer edits MongoDB directly during an incident
A deployment script overwrites a config value
A Kubernetes operator changes a ConfigMap
A Redis key expires or is flushed

Without drift detection, these changes go unnoticed until something breaks.

How detection works¶

Every POLL_INTERVAL (default: 60s)
  │
  ├── Read live values from datasource
  ├── Compare against Git-sourced baseline (key by key)
  │
  ├── No diff   → log "OK", update poll counter
  │
  └── Diff found
        ├── Increment vartrack_watcher_drift_total counter
        ├── Log the drifted keys and their values
        └── self_heal: true?
              ├── Yes → call TriggerSync gRPC → orchestrator re-runs ETL pipeline
              └── No  → log drift, alert via metrics, do nothing

Enabling self-heal¶

Set self_heal: true on a rule to enable automatic repair:

rules: [{
  platform:   "github"
  datasource: "mongo"
  self_heal:  true    // watcher will restore state on drift
}]

Rules with self_heal: false (or omitted) are still polled and drift is logged, but no automatic repair is triggered.

Concrete example¶

Your config in Git:

max_connections: 50
log_level: info

Someone runs db.variables.updateOne({key: "max_connections"}, {$set: {value: 5}}) directly in MongoDB.

On the next poll cycle:

Watcher reads max_connections = 5 from MongoDB
Compares against baseline: expected 50, got 5
self_heal: true → calls TriggerSync on orchestrator
Orchestrator fetches the file from Git, parses it, writes max_connections = 50 back to MongoDB
Next poll sees no drift — state restored

Total recovery time: up to one POLL_INTERVAL (default 60 seconds).

Metrics¶

Monitor drift in Prometheus / Grafana:

Metric	Type	Description
`vartrack_watcher_poll_total`	Counter	Total poll cycles, labeled by datasource
`vartrack_watcher_drift_total`	Counter	Drift events detected
`vartrack_watcher_heal_total`	Counter	Self-heal calls triggered
`vartrack_watcher_heal_errors_total`	Counter	Self-heal calls that failed
`vartrack_watcher_poll_duration_seconds`	Histogram	Poll cycle duration

Poll interval¶

Adjust how often the watcher checks each datasource:

POLL_INTERVAL=30s go run ./watcher-service/cmd

Lower values mean faster drift detection but more load on datasources.

Why polling? (and how it's kept efficient)¶

varTrack uses a scheduled pull model rather than event-driven push because it must support eight fundamentally different backends — MongoDB, Redis, ZooKeeper, S3, Kubernetes ConfigMaps, Helm, Vercel, and Linux servers. Each backend exposes a different (or no) change-notification mechanism:

Datasource	Native change events?
MongoDB	Change Streams (replica set only)
Redis	Keyspace notifications (off by default, extra config)
ZooKeeper	Watches (per-znode, limited concurrency)
S3	Event notifications (requires SNS/SQS wiring)
Linux server	None — SSH file stat only
Vercel	None
ConfigMap	Kubernetes informer (requires in-cluster)
Helm	None — release state only

A unified polling model means a single code path works across all backends with no external infrastructure requirements.

How the overhead is minimised:

Lightweight reads — for key/value stores (Redis, MongoDB document), the watcher fetches only the keys declared in the rule's config file, not the entire database.
Configurable interval — default 60 s; set POLL_INTERVAL to match your SLA.
Leader election — in multi-replica deployments, only one replica polls at a time (ZooKeeper or Redis distributed lock).
Shared state store — replicas share the baseline via Redis so each poll compares against a consistent snapshot.
Circuit breaker — if a datasource is unreachable, the watcher backs off exponentially rather than hammering it.

Multi-replica deployments¶

When running multiple watcher replicas, only one should run the heal loop at a time. Leader election ensures this:

global_tags: {
  watcher_leader_election_datasource: "redis"   // or "zookeeper"
}

Replicas that are not the leader still poll and detect drift, but defer healing to the current leader. If the leader crashes, another replica acquires the lock within one TTL window (15 seconds for Redis).

Shared state store¶

In multi-replica deployments, all replicas can share the same baseline state via Redis:

global_tags: {
  watcher_state_redis: "redis"   // name of a configured redis datasource
}

This prevents replicas from triggering unnecessary heals due to stale local state.