No description
  • Java 96.9%
  • Shell 1.7%
  • Dockerfile 1.4%
Find a file
2026-01-08 19:26:56 +01:00
presence-coordinator-assembly [maven-release-plugin] prepare for next development iteration 2026-01-08 19:26:56 +01:00
presence-coordinator-cli [maven-release-plugin] prepare for next development iteration 2026-01-08 19:26:56 +01:00
presence-coordinator-lib [maven-release-plugin] prepare for next development iteration 2026-01-08 19:26:56 +01:00
.gitignore more ignore target 2025-11-17 12:57:27 +01:00
Dockerfile Fix licenses, config license plugin. 2025-11-20 19:18:57 +01:00
Jenkinsfile fix docker repo login 2025-11-21 15:00:30 +01:00
LICENSE Fix licenses, config license plugin. 2025-11-20 19:18:57 +01:00
MANUAL_TESTING.md Fixes for leadership state storage and KV refresh 2025-11-29 15:03:16 +01:00
pom.xml [maven-release-plugin] prepare for next development iteration 2026-01-08 19:26:56 +01:00
README.md Fixes for leadership state storage and KV refresh 2025-11-29 15:03:16 +01:00

nats-presence-coordinator

A distributed presence monitoring system built on NATS messaging. It tracks the availability and liveness of distributed clients through a heartbeat/presence protocol with automatic leader election for high availability.

Infrastructure Requirements

NATS Server

This application requires a NATS server (or cluster) with JetStream enabled. JetStream is used for:

  • Leader election — Coordinators use a KV bucket to elect a leader via compare-and-swap operations
  • Client registration persistence — Registered clients are stored in a KV bucket for failover recovery
  • Registration message delivery — Client registration messages are delivered via a JetStream stream

Minimum NATS server version: 2.9+ (for KV store support)

To enable JetStream, add to your NATS server configuration:

jetstream {
    store_dir: "/path/to/jetstream/data"
    max_memory_store: 1GB
    max_file_store: 10GB
}

For production, a 3-node NATS cluster is recommended for high availability.

NATS Resources Created

The coordinator automatically creates the following JetStream resources (all namespaced by cluster-id):

Resource Type Purpose
coordinator-leadership-{cluster-id} KV Bucket Leader election
presence-registrations-{cluster-id} KV Bucket Persisted client registrations
presence-registrations-{cluster-id} Stream Registration message delivery (WorkQueue)

This namespacing allows multiple independent coordinator clusters to share the same NATS infrastructure without conflicts.

Coordinator Clustering

Multiple coordinator instances form a cluster using automatic leader election. Only the leader performs active presence monitoring; followers remain on standby.

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Coordinator 1  │     │  Coordinator 2  │     │  Coordinator 3  │
│    (LEADER)     │     │   (FOLLOWER)    │     │   (FOLLOWER)    │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │     NATS Cluster        │
                    │   (JetStream enabled)   │
                    └─────────────────────────┘

Leader election uses NATS KV with compare-and-swap for distributed consensus:

  • Random backoff (150-300ms) prevents election storms
  • Leader publishes heartbeats every second
  • Followers detect leader failure after configurable timeout (default: 2 seconds)
  • New election triggered automatically on leader failure

All coordinators in a cluster must use the same --cluster-id.

Client Registration Protocol

Clients register for presence monitoring by publishing to a NATS subject. The coordinator then actively monitors registered clients via ping/pong.

Registration Flow

  1. Client registers — Publishes to presence.register.{cluster-id}
  2. Coordinator acknowledges — Stores registration in KV, publishes NodePresent report
  3. Coordinator pings — Sends PING to client's subject every ~750ms if no recent activity
  4. Client responds — Any response to the ping indicates liveness
  5. Timeout handling — No response within 5 seconds triggers NodeDead report

NATS Subjects

All subjects are namespaced by cluster-id:

Subject Direction Purpose
presence.register.{cluster-id} Client → Coordinator Client registration messages
{client-subject} Coordinator → Client Ping requests (client must subscribe)
presence.report.{cluster-id} Coordinator → Consumers Presence reports (NodePresent, NodeDead, NodeState)
coordinator.leader.heartbeat.{cluster-id} Leader → Followers Leader heartbeat (internal)

Client Requirements

Clients must:

  1. Publish a registration message to the registration subject
  2. Subscribe to their own unique subject and respond to any message (ping)
  3. Re-register if they restart

Registration message format (JSON):

{"myPresenceSubject":"client-subject"}

Example (using NATS CLI):

# Register a client
nats pub presence.register.my-cluster '{"myPresenceSubject":"my-service.instance-1"}'

# Client must also subscribe and respond to pings
nats sub my-service.instance-1 --reply PONG

Running the Coordinator

Command Line Options

Required:
  -u, --nats-url <url>           NATS server URL(s), comma-separated for cluster
  --cluster-id <id>              Cluster identifier (all coordinators must match)

Optional:
  --coordinator-id <id>          Unique instance ID (default: auto-generated UUID)
  --leader-timeout <ms>          Leader heartbeat timeout (default: 2000)
  --election-backoff-min <ms>    Min election backoff (default: 150)
  --election-backoff-max <ms>    Max election backoff (default: 300)
  -v, --verbose                  Enable verbose logging

Subject names are automatically derived from the cluster-id:

  • Registration: presence.register.{cluster-id}
  • Reports: presence.report.{cluster-id}

Example Deployment

Start three coordinator instances for high availability:

# Instance 1
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
    --cluster-id production --coordinator-id coord-1

# Instance 2
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
    --cluster-id production --coordinator-id coord-2

# Instance 3
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
    --cluster-id production --coordinator-id coord-3

Docker

podman run -it nats-presence-coordinator:1.0 \
    --nats-url nats://nats-server:4222 \
    --cluster-id my-cluster

Consuming Presence Reports

The coordinator publishes binary presence reports to presence.report.{cluster-id}. Report types:

  • NodePresent — Client just registered
  • NodeDead — Client failed to respond to pings
  • NodeState — Periodic status update for each active client (every 1.5 seconds)

Example:

# Subscribe to presence reports for cluster "production"
nats sub presence.report.production

Why JetStream for Client Registration?

Client registrations are persisted to a JetStream stream rather than using core NATS subjects. This is a deliberate design choice for immediate consistency during leader failover.

The Problem

When a coordinator leader fails, a new leader must take over. Without persistence, the new leader has no knowledge of which clients are registered.

Alternative: Core NATS with Client Re-registration

An alternative design would have clients periodically re-register (e.g., every 30 seconds). The new leader would start with an empty state and learn about clients as they re-register. This works, but creates a "blind period" after failover where:

  • The coordinator cannot publish accurate presence reports
  • Client deaths during the gap go undetected (nothing to ping)
  • Downstream consumers see inconsistent state

Current Design: JetStream Persistence

With JetStream, the new leader loads all pending registrations immediately on activation. This provides:

  • Instant state recovery — no blind period after failover
  • Correct death detection — the coordinator knows all clients from the start
  • Consistent reporting — downstream consumers see accurate state immediately

The trade-off is infrastructure complexity (requires JetStream-enabled NATS), but the benefit is correctness guarantees that matter for presence monitoring systems.