- Java 96.9%
- Shell 1.7%
- Dockerfile 1.4%
| presence-coordinator-assembly | ||
| presence-coordinator-cli | ||
| presence-coordinator-lib | ||
| .gitignore | ||
| Dockerfile | ||
| Jenkinsfile | ||
| LICENSE | ||
| MANUAL_TESTING.md | ||
| pom.xml | ||
| README.md | ||
nats-presence-coordinator
A distributed presence monitoring system built on NATS messaging. It tracks the availability and liveness of distributed clients through a heartbeat/presence protocol with automatic leader election for high availability.
Infrastructure Requirements
NATS Server
This application requires a NATS server (or cluster) with JetStream enabled. JetStream is used for:
- Leader election — Coordinators use a KV bucket to elect a leader via compare-and-swap operations
- Client registration persistence — Registered clients are stored in a KV bucket for failover recovery
- Registration message delivery — Client registration messages are delivered via a JetStream stream
Minimum NATS server version: 2.9+ (for KV store support)
To enable JetStream, add to your NATS server configuration:
jetstream {
store_dir: "/path/to/jetstream/data"
max_memory_store: 1GB
max_file_store: 10GB
}
For production, a 3-node NATS cluster is recommended for high availability.
NATS Resources Created
The coordinator automatically creates the following JetStream resources (all namespaced by cluster-id):
| Resource | Type | Purpose |
|---|---|---|
coordinator-leadership-{cluster-id} |
KV Bucket | Leader election |
presence-registrations-{cluster-id} |
KV Bucket | Persisted client registrations |
presence-registrations-{cluster-id} |
Stream | Registration message delivery (WorkQueue) |
This namespacing allows multiple independent coordinator clusters to share the same NATS infrastructure without conflicts.
Coordinator Clustering
Multiple coordinator instances form a cluster using automatic leader election. Only the leader performs active presence monitoring; followers remain on standby.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Coordinator 1 │ │ Coordinator 2 │ │ Coordinator 3 │
│ (LEADER) │ │ (FOLLOWER) │ │ (FOLLOWER) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌────────────┴────────────┐
│ NATS Cluster │
│ (JetStream enabled) │
└─────────────────────────┘
Leader election uses NATS KV with compare-and-swap for distributed consensus:
- Random backoff (150-300ms) prevents election storms
- Leader publishes heartbeats every second
- Followers detect leader failure after configurable timeout (default: 2 seconds)
- New election triggered automatically on leader failure
All coordinators in a cluster must use the same --cluster-id.
Client Registration Protocol
Clients register for presence monitoring by publishing to a NATS subject. The coordinator then actively monitors registered clients via ping/pong.
Registration Flow
- Client registers — Publishes to
presence.register.{cluster-id} - Coordinator acknowledges — Stores registration in KV, publishes
NodePresentreport - Coordinator pings — Sends
PINGto client's subject every ~750ms if no recent activity - Client responds — Any response to the ping indicates liveness
- Timeout handling — No response within 5 seconds triggers
NodeDeadreport
NATS Subjects
All subjects are namespaced by cluster-id:
| Subject | Direction | Purpose |
|---|---|---|
presence.register.{cluster-id} |
Client → Coordinator | Client registration messages |
{client-subject} |
Coordinator → Client | Ping requests (client must subscribe) |
presence.report.{cluster-id} |
Coordinator → Consumers | Presence reports (NodePresent, NodeDead, NodeState) |
coordinator.leader.heartbeat.{cluster-id} |
Leader → Followers | Leader heartbeat (internal) |
Client Requirements
Clients must:
- Publish a registration message to the registration subject
- Subscribe to their own unique subject and respond to any message (ping)
- Re-register if they restart
Registration message format (JSON):
{"myPresenceSubject":"client-subject"}
Example (using NATS CLI):
# Register a client
nats pub presence.register.my-cluster '{"myPresenceSubject":"my-service.instance-1"}'
# Client must also subscribe and respond to pings
nats sub my-service.instance-1 --reply PONG
Running the Coordinator
Command Line Options
Required:
-u, --nats-url <url> NATS server URL(s), comma-separated for cluster
--cluster-id <id> Cluster identifier (all coordinators must match)
Optional:
--coordinator-id <id> Unique instance ID (default: auto-generated UUID)
--leader-timeout <ms> Leader heartbeat timeout (default: 2000)
--election-backoff-min <ms> Min election backoff (default: 150)
--election-backoff-max <ms> Max election backoff (default: 300)
-v, --verbose Enable verbose logging
Subject names are automatically derived from the cluster-id:
- Registration:
presence.register.{cluster-id} - Reports:
presence.report.{cluster-id}
Example Deployment
Start three coordinator instances for high availability:
# Instance 1
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
--cluster-id production --coordinator-id coord-1
# Instance 2
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
--cluster-id production --coordinator-id coord-2
# Instance 3
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
--cluster-id production --coordinator-id coord-3
Docker
podman run -it nats-presence-coordinator:1.0 \
--nats-url nats://nats-server:4222 \
--cluster-id my-cluster
Consuming Presence Reports
The coordinator publishes binary presence reports to presence.report.{cluster-id}. Report types:
- NodePresent — Client just registered
- NodeDead — Client failed to respond to pings
- NodeState — Periodic status update for each active client (every 1.5 seconds)
Example:
# Subscribe to presence reports for cluster "production"
nats sub presence.report.production
Why JetStream for Client Registration?
Client registrations are persisted to a JetStream stream rather than using core NATS subjects. This is a deliberate design choice for immediate consistency during leader failover.
The Problem
When a coordinator leader fails, a new leader must take over. Without persistence, the new leader has no knowledge of which clients are registered.
Alternative: Core NATS with Client Re-registration
An alternative design would have clients periodically re-register (e.g., every 30 seconds). The new leader would start with an empty state and learn about clients as they re-register. This works, but creates a "blind period" after failover where:
- The coordinator cannot publish accurate presence reports
- Client deaths during the gap go undetected (nothing to ping)
- Downstream consumers see inconsistent state
Current Design: JetStream Persistence
With JetStream, the new leader loads all pending registrations immediately on activation. This provides:
- Instant state recovery — no blind period after failover
- Correct death detection — the coordinator knows all clients from the start
- Consistent reporting — downstream consumers see accurate state immediately
The trade-off is infrastructure complexity (requires JetStream-enabled NATS), but the benefit is correctness guarantees that matter for presence monitoring systems.