MQTT.Agent - open protocol for AI agents

Zurück zum Blog
Presence Real-time

Building Presence That Actually Works

Why naive presence drifts within minutes, and how MQTT's last-will primitive fixes it for free.

November 18, 2025
6 Min. Lesezeit
By CloudSignal Team

The first time we shipped presence in a side project, the dot next to each user was lying within fifteen minutes. People who had closed their laptops still showed as online. People who were typing in another tab showed as away. A user who had crashed the app two hours earlier was still listed as active. The product looked broken, and the bug was not in our UI. It was in the assumption that we could detect “online” by counting heartbeats from a client we did not control.

The presence drift problem

The obvious presence design is a heartbeat every N seconds, mark the user offline after M missed beats. It is the design you reach for in an afternoon, and it accumulates ghost users by the next morning. Mobile devices pause JavaScript when the screen locks, so heartbeats stop firing for minutes at a time even though the user is still there. Background tab throttling on Chrome cuts timer frequency to once per minute, so a 10-second heartbeat becomes a 60-second heartbeat the moment someone switches tabs. NAT timeouts on carrier networks silently kill the underlying TCP connection without telling either side. App crashes skip the goodbye message entirely, leaving the user marked online forever.

Picture a user who joins a chat room from their phone on the subway. The train enters a tunnel, their connection drops, the app gets backgrounded as they switch to a podcast. Your heartbeat misses two beats, then three, then ten. Your server has no way to tell whether they crashed, whether the network is slow, or whether they walked away. Whatever threshold you pick will be wrong for somebody. Make it too short and you flap users offline every time they ride an elevator. Make it too long and ghost users pile up faster than they leave.

Last-will-and-testament as a primitive

MQTT solves this at the protocol level with last-will-and-testament. When a client connects, it hands the broker a goodbye message: a topic, a payload, a QoS level, a retain flag. The broker holds that message and fires it the moment the connection ends uncleanly, whether that is a TCP reset, a keepalive timeout, or the process being killed with kill -9. The client never has to detect its own disconnect. The server never has to count heartbeats. The broker is the only party in a position to know the connection died, and it acts on that knowledge automatically.

LWT is part of the MQTT specification, supported by every compliant broker, and costs nothing extra to use. You get it by setting one field on connect. The disconnect-detection logic you were going to write, the timer wheels, the threshold tuning, the retry math, it all evaporates into a primitive the broker already implements.

What we publish

Here is the connection setup we use in the reference chat app, with mqtt.js:

const client = mqtt.connect(brokerUrl, {
  username: `${userId}@${orgShortId}`,
  password: token,
  will: {
    topic: `presence/${orgShortId}/${userId}`,
    payload: '{"status":"offline"}',
    qos: 1,
    retain: true,
  },
});

client.on('connect', () => {
  client.publish(
    `presence/${orgShortId}/${userId}`,
    '{"status":"online"}',
    { qos: 1, retain: true }
  );
});

The retain: true flag is the part that matters most here. Without it, the offline message would fire once and disappear, useful only to clients who happened to be subscribed at that exact instant. With retain set, the broker stores the most recent presence value for each user and hands it to anyone who subscribes later. A teammate who opens the chat ten minutes after you crash gets your current offline status delivered automatically as part of their subscription.

Retained messages for late joiners

Retained messages turn presence into a query-free system. Without them, a client opening a chat room has to call a presence API to get the current state of everyone, then subscribe to a presence topic to get updates from that point forward. Those two steps are not atomic. Anything that happens between the API response and the subscription confirmation is lost, and you have a race condition that gets worse the more users a room has.

With retained messages, the subscription itself is the query. When a client subscribes to presence/+/+, the broker walks the retained set and delivers the most recent presence value for every matching topic, then continues to deliver updates as they arrive. One operation, no race, no separate API call. The presence state is wherever the messages live, and the broker is the source of truth. We do not run a presence service. We run a topic pattern.

Where this still breaks

Last-will gets you most of the way there, and it is honest to name the gaps. Multi-device users are the first one. If the same person is connected on both their phone and their laptop, you have two clients publishing to the same topic, and the last one to write wins. The phone goes offline, fires its LWT, and now your UI says the user is offline even though their laptop is still active. The fix is to scope topics by device or session ID and aggregate on read, but that is application logic on top of the primitive, not part of it.

The “away” state has no clean MQTT signal. The broker knows whether a connection is alive. It does not know whether the user is staring at the screen or has wandered off to make coffee. We fake away with a client-side activity timer that publishes a status change after a few minutes of no input. It works, but it is the heartbeat problem in miniature.

Very large rooms with high churn create fanout pressure. A room with ten thousand subscribers where presence changes every second is fifty thousand to a hundred thousand messages a second going out. That exceeds the throughput a single connection should reasonably consume on most tiers. For rooms past a certain size, presence becomes a counter, not a list.

What we want to add next

The same primitive that gives us online and offline gives us everything else that lives on a short-lived topic. Typing indicators are a one-message publish with no retain and a TTL of a few seconds. Cursor sharing is the same shape with higher frequency and a smaller payload. Read receipts and reaction bursts fit the same model. We want to ship reference components for these on top of the presence transport, so a no-code builder can drop in a chat panel that shows who is online, who is typing, and where their cursor is, without standing up new infrastructure for each one. The broker already does the work. We are still figuring out the cleanest way to expose it.

Ready to get started?

Try CloudSignal free and connect your first agents in minutes.

Start Building Free