A trading bot that crashes into Kalshi's rate limits is a bot that misses fills, corrupts its own state, and risks an API key suspension. The good news: rate-limit problems are almost entirely engineering problems, and every one of them has a clean solution. This guide walks through exactly how Kalshi enforces limits, the data structures that keep you inside them, and the code patterns you can drop into a Python bot today.

How Kalshi Rate Limits Work

Kalshi's REST API divides endpoints into two broad tiers:

  • Market data endpoints (GET /markets, GET /orderbook, GET /trades) — higher limits, read-only, less sensitive to abuse.
  • Order management endpoints (POST /orders, DELETE /orders/{id}, GET /portfolio) — tighter limits because each call can trigger exchange-side state changes.

Limits are enforced at the API key level using a sliding-window or token-bucket mechanism server-side. That means all processes sharing one API key compete for the same quota — a detail that bites teams running multiple bots under a single account (more on that in the shared limiter section).

The exact numbers are documented in the Kalshi API reference and are subject to change — always treat the official docs as the source of truth. What doesn't change is the response contract: exceed the limit and you receive HTTP 429 Too Many Requests.

If you're still orienting yourself to the API surface, the Kalshi API tutorial covers authentication and endpoint structure before you worry about rate limits.

Reading the 429 Response

Before writing any throttle-protection logic, understand what Kalshi actually sends back. A 429 response looks like this:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 2
X-RateLimit-Limit: 10
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1713820842

{"error": "rate_limit_exceeded", "message": "Too many requests. Retry after 2 seconds."}

The key headers:

HeaderMeaning
Retry-AfterSeconds to wait before retrying. Read this; don't guess.
X-RateLimit-LimitTotal tokens/requests allowed in the current window.
X-RateLimit-RemainingTokens left this window. Watch this proactively.
X-RateLimit-ResetUnix timestamp when the window resets.

A naive bot ignores all of these and just retries in a tight loop, hammering the server and burning the entire remaining window. A well-built bot reads Retry-After, sleeps exactly that long, then resumes — and ideally never reaches 429 in the first place because it tracks X-RateLimit-Remaining on every successful response.

Token Bucket Implementation

The token bucket algorithm is the right mental model for this problem. You start with a bucket of N tokens. Each request consumes one token. Tokens refill at a steady rate (e.g., 10 per second). If the bucket is empty, the request waits until a token is available.

Here is a minimal, thread-safe Python implementation you can embed directly in your bot:

import time
import threading

class TokenBucket:
    """Thread-safe token bucket rate limiter."""

    def __init__(self, rate: float, capacity: int):
        """
        rate     — tokens added per second (e.g. 10.0)
        capacity — maximum tokens in the bucket
        """
        self.rate = rate
        self.capacity = capacity
        self._tokens = float(capacity)
        self._last_refill = time.monotonic()
        self._lock = threading.Lock()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self._last_refill
        self._tokens = min(self.capacity, self._tokens + elapsed * self.rate)
        self._last_refill = now

    def acquire(self, tokens: int = 1, timeout: float = 10.0) -> bool:
        """Block until tokens are available or timeout expires."""
        deadline = time.monotonic() + timeout
        while True:
            with self._lock:
                self._refill()
                if self._tokens >= tokens:
                    self._tokens -= tokens
                    return True
            wait = tokens / self.rate
            if time.monotonic() + wait > deadline:
                return False
            time.sleep(min(wait, 0.05))


# Usage
order_limiter = TokenBucket(rate=5.0, capacity=10)   # 5 order ops/sec
market_limiter = TokenBucket(rate=20.0, capacity=40) # 20 market reads/sec

def place_order(payload):
    if not order_limiter.acquire():
        raise TimeoutError("Rate limiter timed out waiting for order slot")
    return kalshi_client.post("/portfolio/orders", json=payload)

Use separate buckets for order endpoints and data endpoints — they have different quotas, and a data-polling spike shouldn't block order submission.

Exponential Backoff with Jitter

Even with a client-side bucket, you will eventually hit a 429 — maybe during a burst of signals at market open, or because a second process unexpectedly shared your key. Backoff handles that gracefully.

Plain exponential backoff (sleep 1s, 2s, 4s, 8s…) has a thundering-herd problem: if multiple bot instances all hit 429 at the same moment and all back off on the same schedule, they retry simultaneously and produce another burst. Adding jitter — a random fraction of the backoff interval — desynchronizes them.

import random
import time

def with_backoff(fn, max_retries: int = 6, base_delay: float = 1.0):
    """
    Call fn(); on HTTP 429, back off with full jitter and retry.
    Raises the last exception after max_retries exhausted.
    """
    for attempt in range(max_retries):
        response = fn()
        if response.status_code != 429:
            return response
        retry_after = float(response.headers.get("Retry-After", base_delay))
        cap = min(retry_after * (2 ** attempt), 60.0)
        sleep_for = random.uniform(0, cap)  # full jitter
        print(f"429 received. Sleeping {sleep_for:.2f}s (attempt {attempt + 1})")
        time.sleep(sleep_for)
    response.raise_for_status()  # propagate if exhausted

Six retries with full jitter and a 60-second cap means the worst-case cumulative wait is about two minutes — long enough for any transient rate-limit window to clear without hanging your bot indefinitely.

WebSocket vs. REST Polling

The single highest-leverage change most bots can make is replacing REST polling loops with a WebSocket subscription. If you're calling GET /markets/{id}/orderbook every 500ms to watch prices, that's 120 requests per minute just for one market. Watching ten markets? 1,200 requests per minute — a large fraction of your entire quota spent on data that could arrive via push.

Kalshi's WebSocket API streams order book updates, trade confirmations, and fill events in real time. Your bot subscribes once and receives incremental diffs as they happen, with zero polling overhead.

The tradeoff analysis and connection-management code are covered in depth in Kalshi WebSocket vs. REST API. The short version for rate-limit purposes: use WebSocket for anything you need continuously, use REST only for one-shot lookups or writes.

Switching from REST polling to WebSocket for market data typically reduces a bot's REST request volume by 80–90%, effectively eliminating the risk of hitting data-endpoint limits and freeing the entire quota for order operations.

Batching and Caching Requests

When you do need REST reads, batch them. Kalshi's list endpoints support query parameters that let you retrieve multiple markets in a single round trip:

# Inefficient — one request per market
for market_id in market_ids:
    data = client.get(f"/markets/{market_id}")

# Efficient — one request for up to 200 markets
data = client.get("/markets", params={
    "series_ticker": "INXD",
    "status": "open",
    "limit": 200
})

For data that doesn't change frequently — market metadata, series definitions, fee schedules — implement a simple TTL cache:

import time
from functools import wraps

_cache = {}

def ttl_cache(ttl_seconds: float):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args):
            key = (fn.__name__, args)
            if key in _cache:
                value, expires_at = _cache[key]
                if time.monotonic() < expires_at:
                    return value
            value = fn(*args)
            _cache[key] = (value, time.monotonic() + ttl_seconds)
            return value
        return wrapper
    return decorator

@ttl_cache(ttl_seconds=60)
def get_market_metadata(market_id: str):
    return client.get(f"/markets/{market_id}").json()

Caching market metadata for 60 seconds can cut read requests by an order of magnitude in bots that re-fetch the same data on every loop iteration — a common pattern in early-stage bot code.

Multi-Bot Shared Rate Limiter

If you run several strategies simultaneously — say, a weather bot, a Fed rate bot, and a election bot — all under the same API key, their requests compete for the same quota. Two common solutions:

  1. Separate API keys per bot — the cleanest isolation. Each strategy gets its own credential and its own fresh quota. Requires Kalshi to support multiple API keys per account; check the current API docs.
  2. Shared in-process rate limiter — if all bots run in the same Python process (or connect to a shared Redis-backed limiter), a single TokenBucket instance can be shared across threads or async tasks.

For a Redis-backed distributed limiter (useful when bots run on separate machines), the sliding-window counter pattern works well:

import redis
import time

r = redis.Redis(host="localhost", port=6379)

def acquire_distributed(key: str, limit: int, window_secs: int) -> bool:
    now_ms = int(time.time() * 1000)
    window_start = now_ms - window_secs * 1000
    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)
    pipe.zadd(key, {str(now_ms): now_ms})
    pipe.zcard(key)
    pipe.expire(key, window_secs + 1)
    _, _, count, _ = pipe.execute()
    return count <= limit

This uses a Redis sorted set where each element is a request timestamp. The count of elements in the sliding window is your current usage. If count > limit, the caller backs off before issuing the request.

For more on deploying bots reliably across multiple machines, see the production deployment guide and the bot hosting guide.

Monitoring Your Rate-Limit Headroom

Proactive monitoring beats reactive backoff. Parse the rate-limit headers on every successful response and emit a metric:

import logging

logger = logging.getLogger("kalshi.rate_limit")

def parse_rate_limit_headers(response):
    remaining = response.headers.get("X-RateLimit-Remaining")
    limit = response.headers.get("X-RateLimit-Limit")
    if remaining is not None and limit is not None:
        pct_used = (1 - int(remaining) / int(limit)) * 100
        if pct_used > 80:
            logger.warning(
                "Rate limit at %.0f%% capacity (remaining=%s, limit=%s)",
                pct_used, remaining, limit
            )
        # Emit to your metrics system (Prometheus, Datadog, etc.)
        metrics.gauge("kalshi.rate_limit.used_pct", pct_used)

Alerting when you've consumed more than 80% of the window gives you time to throttle voluntarily before hitting 429. Connect this to the alerting setup described in bot monitoring and alerting to get a Slack or PagerDuty notification if the metric stays elevated.

Also track 429 rate as its own metric: kalshi.rate_limit.throttled_count. A sudden spike in 429s on an otherwise stable bot often signals a new code path generating unexpected request bursts — catching it early prevents a spiral into key suspension.

Putting It All Together

Here is the minimal rate-limit stack every production Kalshi bot should have, in priority order:

  1. Switch to WebSocket for streaming data. This alone eliminates most rate-limit risk. See WebSocket vs. REST for implementation details.
  2. Add per-endpoint token buckets at or below the documented limit. Set order-endpoint buckets conservatively — leave 20% headroom.
  3. Wrap all REST calls in a backoff decorator that reads Retry-After and uses full jitter. Never retry in a tight loop.
  4. Batch list endpoint calls and cache static metadata with a TTL appropriate to how often that data changes.
  5. Emit rate-limit headroom metrics and alert before you hit the wall, not after.
  6. Isolate API keys per strategy if you run multiple bots, or use a shared limiter if they're co-located.

None of this is exotic engineering. Token buckets and exponential backoff are standard patterns from any distributed systems textbook. What makes them feel hard in trading contexts is the time pressure — a missed order during a fast-moving market feels costly. The answer is to build the rate-limit layer once, test it under synthetic load, and then forget about it while the bot handles real markets.

Rate-limit handling is one layer of a production bot. The full architecture — including order management, risk controls, and position sizing — is covered in the complete guide to Kalshi trading bots. If you're earlier in the build process, the Python bot tutorial walks through the REST client setup that underpins everything described here.

Frequently Asked Questions

Quick answers to common questions about Handling Kalshi API Rate Limits Without Getting Your Bot Throttled.

What are Kalshi's current API rate limits?

Kalshi enforces per-endpoint rate limits on its REST API, typically in the range of 10–30 requests per second depending on the endpoint tier. Order submission endpoints are more tightly capped than read-only market data endpoints. Always check the official Kalshi API documentation for the latest numbers, as limits can change.

What HTTP status code does Kalshi return when you're throttled?

Kalshi returns HTTP 429 Too Many Requests when a client exceeds its rate limit. The response headers include a Retry-After value (in seconds) indicating how long you must wait before retrying. Your bot should read this header rather than using a hard-coded delay.

Does using the WebSocket API help avoid rate limits?

Yes — streaming market data over WebSocket removes the need for repeated REST polling and dramatically cuts your request volume. For any data you need continuously (order book updates, market prices), the WebSocket connection is both more efficient and far less likely to trigger throttling.

Can rate limits get my Kalshi API key suspended?

Persistent or egregious rate-limit violations can result in temporary or permanent API key suspension. A well-implemented exponential backoff with jitter and staying within documented limits is the safest approach; a 429 itself is a warning, not an immediate ban.

Does running multiple bots under the same API key share the rate limit?

Yes. Rate limits are enforced at the API key level, not the process or IP level. If you run several bot instances using the same credentials, their requests are pooled against the same quota. Use separate API keys per bot or implement a shared rate-limiter middleware for multi-bot setups.

What is a token bucket and why is it the right model for Kalshi rate limiting?

A token bucket is an algorithm that grants a fixed number of 'tokens' per time window; each request consumes one token, and tokens refill at a steady rate. It smooths burst traffic without hard-stopping at the window boundary, which mirrors how most API gateways — including Kalshi's — actually enforce limits internally.

How do I batch Kalshi API calls to stay within rate limits?

Instead of issuing one REST call per market you care about, use the list endpoints (e.g., GET /markets) with filters to retrieve multiple markets in a single request. For order management, queue pending actions and flush them in a controlled loop rather than firing them immediately on each signal.

PC

Priya Chakraborty

Lead Developer & Technical Writer

Priya Chakraborty is Lead Developer at Bot for Kalshi. A former backend infrastructure engineer at Stripe, she now builds automated trading systems that process 10,000+ daily market signals across prediction markets.