What rate limiting is for
Rate limiting has three jobs that often get conflated. The first is protection — keeping a single caller from saturating a backend that has finite capacity. The second is fairness — making sure that one noisy caller does not push other legitimate callers into degraded service. The third is commercial enforcement — reflecting paid quotas in the runtime so that a free-tier client cannot consume what a paid-tier client paid for.
These three goals pull in slightly different directions. A pure protection limiter can afford to be lenient in burst handling because the backend has its own back-pressure. A fairness limiter has to be more strict, because by the time the backend is hurting it is already too late for the well-behaved callers. A commercial limiter is the strictest of all because the meter has to match the bill.
The four algorithms in common use
Fixed window
Count requests per caller per fixed interval (per minute, per hour). Reject requests once the count for the current window exceeds the limit. The window resets at a clock boundary.
Why people pick it: trivial to implement — a single counter per caller per window, an integer increment, and a TTL on the key. Easy to reason about ("100 requests per minute") and easy to expose to customers.
What goes wrong: the boundary problem. A caller that sends 100 requests in the last second of one window and 100 in the first second of the next window has effectively used 200 requests per minute, but the limiter only ever saw two compliant windows. For protection this is harmless if the backend can absorb a 2× burst at boundaries; for fairness it lets a careful attacker pulse traffic exactly on the boundary and outpace honest callers.
Sliding window
Variant 1, sliding log: store a timestamp for every request from a caller and count entries newer than the window length. Variant 2, sliding window counter: keep a counter for the current window and the previous window, and weight the previous window by how much of it falls inside the current sliding interval.
The sliding-window counter is the practical compromise — it captures most of the precision of the log without needing a per-request timestamp store. Most production limiters built on Redis use this variant.
What goes wrong: the weighted approximation drifts during traffic spikes that are heavily weighted toward the start of a window. The drift is small and almost always tolerable, but it means the limiter is no longer mathematically exact.
Token bucket
Each caller has a bucket with a maximum capacity. Tokens are added to the bucket at a constant refill rate. Each request consumes a token; when the bucket is empty, requests are rejected. Capacity controls burst tolerance, refill rate controls sustained throughput.
Why people pick it: it's the natural fit for traffic that is bursty by design — uploads, batch syncs, user-driven flows that fire several requests in quick succession before going idle. A caller can spend a full bucket on a burst, then drain back to the steady refill rate. For commercial limits ("1,000 requests/minute, with bursts up to 100"), token bucket maps cleanly onto the customer-facing wording.
What goes wrong: the burst tolerance is also a load-shedding hazard. A backend that can handle 1,000 req/min steady but not 100 req/sec instantaneous will fall over the moment a caller cashes in a full bucket. Tune the bucket capacity to what the backend can actually absorb, not what feels generous.
Leaky bucket
Requests enter a queue (the bucket) and are processed at a fixed leak rate. If the queue is full, new requests are rejected. Output is paced at a constant rate, regardless of input bursts.
Why people pick it: it converts a bursty input into a smooth output, which is exactly what a downstream system with a fixed processing capacity wants. SMS gateways and payment processors often sit behind a leaky-bucket limiter for this reason.
What goes wrong: queueing introduces latency. A caller that sends a burst doesn't get rejected — they get accepted but their requests sit in the queue, sometimes for seconds. That latency is invisible in the limiter's logs but very visible in the caller's tail latency. If the caller's client times out before the leaked request completes, you've combined the worst of both worlds: the request was accepted, the work was done, and the caller never sees the response.
A worked example: which algorithm fits which API
Take three illustrative APIs and walk through the choice.
- A read-heavy public catalog API. Backend is well-cached and can absorb large bursts. Fairness between callers matters, but instantaneous protection does not. Sliding window counter on a per-API-key basis at, say, 600 req/min. Cheap to implement, exact enough for billing, and the cache absorbs whatever bursts slip through.
- A webhook delivery endpoint that fans out to a payment processor downstream. Downstream has a hard limit of 50 req/sec across all callers and will start returning 429s above that. Leaky bucket in front of the fan-out, sized so the leak rate matches the downstream limit minus a safety margin. Bursts queue rather than reject; downstream never sees a stampede.
- A user-facing upload API. Each upload triggers a multi-request sequence (initiate, upload chunks, finalize). Users will fire those in quick succession and then go idle. Token bucket sized for "10 uploads in quick succession, then 1 every six seconds steady." Bursts feel responsive; sustained abuse is throttled.
Identity: who you're limiting
The algorithm is half the question; the other half is what counts as "one caller". Common identifiers, in roughly increasing order of trust:
- IP address. Cheap, works for unauthenticated traffic, but conflates everyone behind the same NAT or corporate proxy. Useful as a coarse last line of defence; bad as the only signal.
- API key or token. The standard for authenticated APIs. Pair with the rate-limit headers so the caller can self-throttle. See the authentication reference for how the platform represents callers.
- Account or tenant. Sum across all keys belonging to the same account, so a caller cannot escape a quota by issuing more keys.
- Endpoint group. Apply different limits to expensive endpoints (search, batch) than to cheap ones (single-resource reads). Avoids the situation where a caller burns their global quota on cheap calls and then can't make the expensive one they actually need.
What to send when you reject
HTTP 429 is the standard response. The headers carry the contract:
Retry-After— seconds (or HTTP-date) until the caller should try again. Required if you want well-behaved clients to back off without guessing.X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset— the de-facto convention for advertising current quota state. Send them on every successful response, not just rejections, so clients can self-throttle before they hit the wall.
Body shape matters less than the headers, but a small JSON envelope describing which limit was hit (per-key, per-account, per-endpoint) is far more useful than a generic message. Callers cannot fix what they cannot diagnose.
Common mistakes
- Limiting at the wrong layer. A limiter behind your application server has to do work — accept the connection, parse the request, look up the caller — before it can reject. A limiter at the edge (CDN, gateway) rejects before that work happens. For protection, push the limiter as far out as you can.
- Forgetting to limit the failure path. A caller getting 401s in a tight loop is using your auth lookup just as much as a successful caller. Authentication failures should count toward the limit too, with their own (lower) threshold if needed.
- Per-instance limits in a distributed system. Each app server holding its own counter means the effective limit is N × the configured limit. Use a shared store (Redis, a coordination service) or accept that the limit is approximate.
- No exemption for retries you asked for. If your service tells a caller to retry with
Retry-After, the retry should not itself count toward the limit, or you'll create a self-reinforcing reject loop. - Hiding the limit from customers. Limits that customers learn about by getting throttled are limits that generate support tickets. Document them on the pricing page and surface them in the response headers.
Where to go next
For the wider design context — naming, versioning, error envelopes — see API Design Best Practices. For the specifics of how rate limits show up on responses from this platform, see the REST reference. For idempotency, which often pairs with rate limiting in retry-heavy clients, see Idempotency Keys for APIs.