No description

Java 96.9%
Shell 1.7%
Dockerfile 1.4%

Find a file

Johan Maasing 2019ff1c42 [maven-release-plugin] prepare for next development iteration		2026-01-08 19:26:56 +01:00
presence-coordinator-assembly	[maven-release-plugin] prepare for next development iteration	2026-01-08 19:26:56 +01:00
presence-coordinator-cli	[maven-release-plugin] prepare for next development iteration	2026-01-08 19:26:56 +01:00
presence-coordinator-lib	[maven-release-plugin] prepare for next development iteration	2026-01-08 19:26:56 +01:00
.gitignore	more ignore target	2025-11-17 12:57:27 +01:00
Dockerfile	Fix licenses, config license plugin.	2025-11-20 19:18:57 +01:00
Jenkinsfile	fix docker repo login	2025-11-21 15:00:30 +01:00
LICENSE	Fix licenses, config license plugin.	2025-11-20 19:18:57 +01:00
MANUAL_TESTING.md	Fixes for leadership state storage and KV refresh	2025-11-29 15:03:16 +01:00
pom.xml	[maven-release-plugin] prepare for next development iteration	2026-01-08 19:26:56 +01:00
README.md	Fixes for leadership state storage and KV refresh	2025-11-29 15:03:16 +01:00

README.md

nats-presence-coordinator

A distributed presence monitoring system built on NATS messaging. It tracks the availability and liveness of distributed clients through a heartbeat/presence protocol with automatic leader election for high availability.

Infrastructure Requirements

NATS Server

This application requires a NATS server (or cluster) with JetStream enabled. JetStream is used for:

Leader election — Coordinators use a KV bucket to elect a leader via compare-and-swap operations
Client registration persistence — Registered clients are stored in a KV bucket for failover recovery
Registration message delivery — Client registration messages are delivered via a JetStream stream

Minimum NATS server version: 2.9+ (for KV store support)

To enable JetStream, add to your NATS server configuration:

jetstream {
    store_dir: "/path/to/jetstream/data"
    max_memory_store: 1GB
    max_file_store: 10GB
}

For production, a 3-node NATS cluster is recommended for high availability.

NATS Resources Created

The coordinator automatically creates the following JetStream resources (all namespaced by cluster-id):

Resource	Type	Purpose
`coordinator-leadership-{cluster-id}`	KV Bucket	Leader election
`presence-registrations-{cluster-id}`	KV Bucket	Persisted client registrations
`presence-registrations-{cluster-id}`	Stream	Registration message delivery (WorkQueue)

This namespacing allows multiple independent coordinator clusters to share the same NATS infrastructure without conflicts.

Coordinator Clustering

Multiple coordinator instances form a cluster using automatic leader election. Only the leader performs active presence monitoring; followers remain on standby.

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Coordinator 1  │     │  Coordinator 2  │     │  Coordinator 3  │
│    (LEADER)     │     │   (FOLLOWER)    │     │   (FOLLOWER)    │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │     NATS Cluster        │
                    │   (JetStream enabled)   │
                    └─────────────────────────┘

Leader election uses NATS KV with compare-and-swap for distributed consensus:

Random backoff (150-300ms) prevents election storms
Leader publishes heartbeats every second
Followers detect leader failure after configurable timeout (default: 2 seconds)
New election triggered automatically on leader failure

All coordinators in a cluster must use the same --cluster-id.

Client Registration Protocol

Clients register for presence monitoring by publishing to a NATS subject. The coordinator then actively monitors registered clients via ping/pong.

Registration Flow

Client registers — Publishes to presence.register.{cluster-id}
Coordinator acknowledges — Stores registration in KV, publishes NodePresent report
Coordinator pings — Sends PING to client's subject every ~750ms if no recent activity
Client responds — Any response to the ping indicates liveness
Timeout handling — No response within 5 seconds triggers NodeDead report

NATS Subjects

All subjects are namespaced by cluster-id:

Subject	Direction	Purpose
`presence.register.{cluster-id}`	Client → Coordinator	Client registration messages
`{client-subject}`	Coordinator → Client	Ping requests (client must subscribe)
`presence.report.{cluster-id}`	Coordinator → Consumers	Presence reports (NodePresent, NodeDead, NodeState)
`coordinator.leader.heartbeat.{cluster-id}`	Leader → Followers	Leader heartbeat (internal)

Client Requirements

Clients must:

Publish a registration message to the registration subject
Subscribe to their own unique subject and respond to any message (ping)
Re-register if they restart

Registration message format (JSON):

{"myPresenceSubject":"client-subject"}

Example (using NATS CLI):

# Register a client
nats pub presence.register.my-cluster '{"myPresenceSubject":"my-service.instance-1"}'

# Client must also subscribe and respond to pings
nats sub my-service.instance-1 --reply PONG

Running the Coordinator

Command Line Options

Required:
  -u, --nats-url <url>           NATS server URL(s), comma-separated for cluster
  --cluster-id <id>              Cluster identifier (all coordinators must match)

Optional:
  --coordinator-id <id>          Unique instance ID (default: auto-generated UUID)
  --leader-timeout <ms>          Leader heartbeat timeout (default: 2000)
  --election-backoff-min <ms>    Min election backoff (default: 150)
  --election-backoff-max <ms>    Max election backoff (default: 300)
  -v, --verbose                  Enable verbose logging

Subject names are automatically derived from the cluster-id:

Registration: presence.register.{cluster-id}
Reports: presence.report.{cluster-id}

Example Deployment

Start three coordinator instances for high availability:

# Instance 1
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
    --cluster-id production --coordinator-id coord-1

# Instance 2
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
    --cluster-id production --coordinator-id coord-2

# Instance 3
./start-server.sh --nats-url nats://nats1:4222,nats://nats2:4222,nats://nats3:4222 \
    --cluster-id production --coordinator-id coord-3

Docker

podman run -it nats-presence-coordinator:1.0 \
    --nats-url nats://nats-server:4222 \
    --cluster-id my-cluster

Consuming Presence Reports

The coordinator publishes binary presence reports to presence.report.{cluster-id}. Report types:

NodePresent — Client just registered
NodeDead — Client failed to respond to pings
NodeState — Periodic status update for each active client (every 1.5 seconds)

Example:

# Subscribe to presence reports for cluster "production"
nats sub presence.report.production

Why JetStream for Client Registration?

Client registrations are persisted to a JetStream stream rather than using core NATS subjects. This is a deliberate design choice for immediate consistency during leader failover.

The Problem

When a coordinator leader fails, a new leader must take over. Without persistence, the new leader has no knowledge of which clients are registered.

Alternative: Core NATS with Client Re-registration

An alternative design would have clients periodically re-register (e.g., every 30 seconds). The new leader would start with an empty state and learn about clients as they re-register. This works, but creates a "blind period" after failover where:

The coordinator cannot publish accurate presence reports
Client deaths during the gap go undetected (nothing to ping)
Downstream consumers see inconsistent state

Current Design: JetStream Persistence

With JetStream, the new leader loads all pending registrations immediately on activation. This provides:

Instant state recovery — no blind period after failover
Correct death detection — the coordinator knows all clients from the start
Consistent reporting — downstream consumers see accurate state immediately

The trade-off is infrastructure complexity (requires JetStream-enabled NATS), but the benefit is correctness guarantees that matter for presence monitoring systems.