Why typing indicators matter for AI chat

The first token from a large model can take anywhere from 800 milliseconds to three seconds to land in the browser. That window is small in absolute terms and enormous in user-perception terms. Without something on the screen during those seconds, users assume the app is broken. They click send a second time. They refresh the page. They open the dev tools to see if the request even went out. We watched this happen in our own reference chat app before we wired any state signals into it, and the pattern repeated on every cold model call.

Typing indicators bridge the silence. They are not decoration. They are the contract that says the request was received, the system is doing something, and the user does not need to retry. The interesting part for AI chat is that the indicator the user sees should not be the same one a human chat app shows. “User is typing” implies another person is at a keyboard. “Model is thinking” implies a system is processing. Conflating those two signals trains users to misread both. We treat them as separate states with separate visuals, and that single decision removed most of the confusion in early testing.

Three states worth surfacing

Three states matter, and each carries different meaning to the person looking at the chat panel.

The first is “user typing.” Someone is composing input into the chat. This is the classic indicator, useful primarily in shared sessions where more than one human is in the room. A copilot pair-programming view, a support agent watching a customer compose a reply, a collaborative chat where teammates draft together. In a solo session it is unnecessary, because the user already knows they are typing.

The second is “model thinking.” The request hit the inference backend, but no tokens have streamed yet. This is the dead window the model needs to allocate, prefill, and start generating. It is bounded but variable.

The third is “model streaming.” Tokens are arriving. The shape of the indicator changes here: instead of a pulsing dot, the user sees the response appearing character by character, which is its own form of activity signal. We still publish a state event so observers in other tabs or other devices know what is happening even if they cannot yet see the tokens.

Each state is a different topic event with a distinct meaning, and we treat them that way end to end.

How we publish them

We use one topic per chat session for state. The shape is chat/{org}/{session_id}/state, and the payload describes which actor is in which state. A typical event looks like this:

{
  "actor": "model",
  "state": "streaming",
  "ts": 1740432000000
}

That is the entire payload. Three fields, no nesting. Consumers subscribe to the state topic with QoS 0, because state events are inherently recoverable. If a typing indicator event is dropped, the next event corrects the view within milliseconds. There is no reason to pay for QoS 1 delivery on a signal whose only job is to render the current moment, and the broker-side cost is meaningfully lower at QoS 0 when sessions are large.

Token streaming itself rides a separate topic with its own QoS settings. State events are about presence and intent; tokens are the content. Keeping them on different topics means a slow consumer of one does not back-pressure the other, and it lets observers subscribe to state only when they do not need the content stream.

Avoiding flicker on transitions

Naive rendering of state events flickers. If the user stops typing, the request goes out, and the model starts streaming within a hundred milliseconds, you will see three indicators flash through in sequence: typing dots disappear, a thinking spinner appears for one frame, the streaming cursor takes over. On a fast network with a warm model, this happens often enough that users notice the visual stutter even if they cannot describe it.

We coalesce on both sides. On the publishing side, the SDK debounces back-to-back state writes within a short window so two events that arrive inside fifty milliseconds get collapsed into the later one. On the rendering side, the UI waits roughly fifty milliseconds before un-rendering an indicator, so a fast transition through an intermediate state does not visibly flash. The two debounces are independent; one protects the network, the other protects the eye.

Ordering matters here. State events on a single topic arrive in publish order, which is a property of MQTT we lean on. If we sharded across multiple topics, we would have to reconcile timestamps ourselves and the flicker logic would get much harder. One topic, one stream, one order. That is the trade we made deliberately.

Multi-device sessions

A common case: the same user has the chat open in two browser tabs, or on their laptop and their phone at once. What should presence show?

We treat presence as a property of the user, not of the device. The user is present if any of their devices is connected. We do not render duplicate avatars or “Alex on laptop, Alex on phone” in the participant list. The implementation aggregates client connections by user ID and reports a single state per user to the UI layer. This matches user mental models. People do not think of themselves as multiple participants when they switch devices, and the UI should not surface that detail as if it were meaningful.

The model-thinking and model-streaming states are different. Those are per-session, not per-device. A single inference request is in flight on the backend regardless of how many devices are subscribed to its session. One thinking state, one streaming state, multiplied to every observer.

The richer case is collaborative chat: two human users plus the model, all in the same session. That gives three actors, each able to occupy any of their respective states independently. The topic shape supports it without changes, because the actor field already disambiguates. The UI work is where the complexity moves; rendering three concurrent state indicators in a way that is informative rather than busy is a design problem, not a transport one.

What we want to add next

State signals are a starting layer. The interesting next signals live one step deeper into what the model is actually doing. Tool-use steps are the obvious one: when the model decides to call a function, we want to publish a state event that names the tool and shows the user that the system is fetching weather, searching a knowledge base, or running a calculation. That information already exists inside the agent loop; it just needs a topic.

Retrieval-source citations are another. As a RAG pipeline pulls documents, surfacing the sources in real time, before the answer arrives, gives the user something to read and something to trust. Both signals fit the same chat/{org}/{session_id}/state shape with a richer payload. We have not committed to a schema yet, and we will not until we see a few real applications using the basic three states in production. The transport is in place. The semantics are what is left to figure out.

Presence and Typing Indicators for AI Chat Sessions

Why typing indicators matter for AI chat

Three states worth surfacing

How we publish them

Avoiding flicker on transitions

Multi-device sessions

What we want to add next

Related articles

AI Chat Drop-In for No-Code Builders: What We Shipped and Why

Building Presence That Actually Works

An Agent Is Not a User: What Claude Tag Gets Right, and the Layer It Leaves Open