Most teams that need real-time delivery do not start by deploying a broker. They start by adding a polling interval to a database query and shipping the feature. Pub/sub is the right tool, but the cost of standing it up usually pushes it out of scope. This post is about what it took, on our side, to make that cost go away.

What “pub/sub” usually costs you

The textbook reason to use pub/sub is decoupling: publishers do not know who is listening, subscribers do not know who is producing, and the broker routes between them. The reason teams skip it is everything that has to be true before that decoupling pays off. You pick a broker. You deploy at least two nodes so a single VM reboot is not an outage. You wire up TLS termination and rotate the certificates before they expire. You write ACL rules so a leaked client credential cannot subscribe to everyone else’s topics. You set up dashboards for queue depth, connection count, and dropped messages. You write the runbook for what to do when the broker falls over at 2am.

That is the setup tax, and it is why a lot of teams quietly fall back to HTTP polling. A 50-device IoT fleet polling once a second is 180,000 useless queries an hour against your database. Nobody is proud of this, but nobody has a free week to do it right either. The math only works if pub/sub is something you can rent, not something you have to operate.

Why we kept MQTT 5 underneath

We looked at a lot of options before landing on MQTT 5 as the substrate. The reason it won is that the things teams need from pub/sub are already in the protocol, and the protocol has been hardened by twenty years of running on devices that lose power, lose network, and lose patience.

Wildcard topics let one subscription cover a whole class of producers. A subscription to org_abc123/devices/+/status matches every device under that organisation in a single rule, because the + is a single-level wildcard that fills in for any device ID. Three QoS levels give you a per-message choice between fire-and-forget, at-least-once, and exactly-once delivery, instead of one global guarantee. Retained messages mean a newly connected client gets the current state of a topic on subscribe, without an explicit fetch round-trip. Shared subscriptions let multiple consumers cooperate as a load-balanced group, with the broker handling distribution. We did not have to invent any of this. We had to make it boring to use.

The boring parts we hid

Operating a broker for someone else is not glamorous, but it is most of what we do. TLS termination happens at our edge, with certificates rotated automatically before they expire. ACL evaluation is deny-by-default, so a new credential can do nothing until you grant it something. Last-will-and-testament messages are dispatched the moment the broker detects a dead connection, so presence signals are reliable rather than approximate. Every organisation is isolated through mountpoints, which means a leaked credential in one tenant cannot publish to or subscribe to topics in another. That last property matters because shared infrastructure is a security argument long before it is an efficiency argument. Our client SDKs reconnect with exponential backoff and resubscribe on reconnect, so a flaky mobile network looks like a slightly delayed message rather than a dropped session. None of this is novel. All of it has to work every time, which is the part that costs you real engineering hours when you run it yourself.

What you publish against

The point of all of this is that the surface area you write against stays small. Here is a publish from a Node service using mqtt.js:

import mqtt from 'mqtt';

const client = mqtt.connect('mqtts://connect.cloudsignal.app:8883', {
  username: 'service@org_abc123',
  password: process.env.CS_TOKEN,
});

client.on('connect', () => {
  client.publish('org_abc123/orders/created', JSON.stringify(order), { qos: 1 });
});

That is the entire integration. The client does not track acknowledgements, does not maintain a connection state machine, does not negotiate retries against a queue depth metric it never sees. It connects, it publishes, and the broker takes responsibility for delivery from there. Subscribers on the other end of the topic get the message when they are online, and the broker holds it for them if they are not.

The tradeoffs we accepted

There are real costs to using pub/sub as a primitive, and we would rather state them up front than have you discover them in production. QoS 0 is fire-and-forget, and if you choose it because it is cheap, you are choosing to lose messages on disconnect. That is correct behaviour, not a bug, but it is a choice you have to make explicitly. There is no native request/response pattern in MQTT; if you want one, you build it by publishing on a request topic and subscribing to a reply-to topic the publisher includes in the message. Per-organisation fanout has tier-based limits, because broker-side multicast is not free, and we would rather meter it honestly than oversell it and degrade everyone. Topic naming is your design problem. We can suggest hierarchies, but the namespace is yours, and a sloppy one will hurt you the same way a sloppy database schema does.

What we want to add next

The direction we are heading is broker-side logic that does not require a separate worker. Routing rules that transform a payload as it passes through, or fan a single publish out into derived topics, or drop messages that fail a schema check before they reach a subscriber. The shape is still moving, and we are not going to commit to a window for it. The reason it is on the list is that every customer who has reached for a serverless function to do trivial reshaping in front of a subscriber is a signal that the broker should have been able to do it directly. When we ship it, it should feel like the same primitive, with one more thing it does on your behalf.

Pub/Sub Without the Yak Shaving

What “pub/sub” usually costs you

Why we kept MQTT 5 underneath

The boring parts we hid

What you publish against

The tradeoffs we accepted

What we want to add next

Verwandte Artikel

AI Chat Drop-In for No-Code Builders: What We Shipped and Why

Why Multi-Agent AI Systems Keep Failing (And How to Fix It)

An Agent Is Not a User: What Claude Tag Gets Right, and the Layer It Leaves Open