A Kalshi trading bot that runs unmonitored in production is not a system — it is a liability. Bots crash, connections drop, fills come back wrong, and positions accumulate in ways that no strategy intended. This guide covers the full monitoring and alerting stack for production Kalshi bots: what to measure, how to structure your logs, which alerts to page on versus which to investigate later, and how to build a dead-man switch that catches silent failures before they cost real money.

Why Monitoring Is Non-Negotiable

Most trading bot guides end at deployment. The bot starts, a few trades go through, everything looks fine. Then three days later a WebSocket connection stalls mid-session, the bot stops receiving price updates, and it continues posting stale quotes against a rapidly moving market. By the time anyone notices, the damage is done.

This failure mode is more common than bugs in the actual strategy logic. Production infrastructure — cloud VMs, network paths, Kalshi's API endpoints — introduces faults that no amount of backtesting will surface. The complete guide to Kalshi trading bots covers the full lifecycle from idea to live trading; this article drills into the operational layer that keeps a live bot safe.

If you have not yet worked through deploying a Kalshi bot to production or set up bot-level risk management, read those first. Monitoring amplifies the protection those layers provide — it does not replace them.

The Four Layers of Bot Observability

A well-observable Kalshi bot exposes four distinct signal types:

LayerWhat it measuresFailure it catches
InfrastructureCPU, memory, disk, network I/OResource exhaustion, host crash
ConnectivityAPI latency, WebSocket message rate, HTTP error codesDropped connections, rate limiting, auth expiry
ExecutionOrder placement rate, fill rate, slippage per tradeSilent order failures, adverse selection
StrategyPnL vs. expected value, position sizes, model stalenessLogic bugs, regime shifts, runaway positions

Each layer requires different tooling and different alert thresholds. Many teams instrument the infrastructure layer well and completely skip the strategy layer. In trading, the strategy layer is where the most expensive surprises live.

Structured Logging in Practice

Before building dashboards or alert rules, you need logs that are actually queryable. Every meaningful event your bot takes should emit a JSON line with a consistent schema. Here is a minimal baseline for a Python bot:

import json, time, logging

logger = logging.getLogger("kalshi_bot")
logging.basicConfig(level=logging.INFO, format="%(message)s")

def log_event(event_type: str, **kwargs):
    record = {
        "ts": time.time(),
        "event": event_type,
        **kwargs,
    }
    logger.info(json.dumps(record))

# Examples
log_event("order_placed",
    market="KXBTC-25DEC-T50000",
    side="yes",
    price=0.42,
    size=10,
    order_id="ord_abc123",
    model_fair=0.44,
)

log_event("fill_received",
    order_id="ord_abc123",
    fill_price=0.42,
    fill_size=10,
    latency_ms=38,
)

log_event("heartbeat", status="ok", open_positions=3)

The model_fair field on order events is critical. It captures what your bot thought the market was worth at the moment it sent the order. Without it, you cannot distinguish between a strategy that is underperforming because of bad logic versus one that is underperforming because fills are coming back at worse prices than expected.

Route these logs to a centralized sink — Datadog, Grafana Loki, AWS CloudWatch Logs, or even a simple append-only file shipped to S3 — so you can query across sessions without SSHing into the host.

Health Checks and Heartbeats

The simplest and most important monitoring pattern for any long-running process is the heartbeat: the bot emits a signal on a fixed interval, and an external watcher alerts if that signal stops arriving.

Implementing a Heartbeat

Every 60 seconds, your bot's main loop should emit a heartbeat event and optionally POST to an external monitoring endpoint:

import requests, threading

HEARTBEAT_URL = "https://hc-ping.com/your-uuid-here"  # e.g. Healthchecks.io

def heartbeat_loop(interval: int = 60):
    while True:
        try:
            log_event("heartbeat", status="ok")
            requests.get(HEARTBEAT_URL, timeout=5)
        except Exception as e:
            log_event("heartbeat_error", error=str(e))
        time.sleep(interval)

threading.Thread(target=heartbeat_loop, daemon=True).start()

Services like Healthchecks.io or the Datadog Synthetics "deadman" check will alert you if the ping stops arriving within a grace period. This is your last-resort safety net — it fires even if the entire bot process has crashed and can no longer log anything.

WebSocket Connection Monitoring

A WebSocket connection can remain technically open while delivering no messages — a "zombie" connection. Detect this with an idle-message timer:

import asyncio

IDLE_THRESHOLD_SEC = 30  # alert if no message for 30 seconds

class WatchedWebSocket:
    def __init__(self):
        self.last_message_ts = time.time()

    def on_message(self, msg):
        self.last_message_ts = time.time()
        # ... process msg

    async def watchdog(self):
        while True:
            await asyncio.sleep(10)
            idle = time.time() - self.last_message_ts
            if idle > IDLE_THRESHOLD_SEC:
                log_event("ws_idle_alert", idle_seconds=idle)
                # trigger reconnect or alert

For more detail on the differences between WebSocket and REST connectivity patterns, see the Kalshi WebSocket vs. REST API comparison.

PnL and Position Monitoring

Infrastructure and connectivity checks keep the bot alive. PnL and position monitoring keeps the strategy honest.

Tracking Realized and Unrealized PnL

Maintain running totals in memory and persist them to disk or a database on every fill. Log a summary on each heartbeat:

class PnLTracker:
    def __init__(self):
        self.realized = 0.0
        self.fills = []  # list of (price, size, side, model_fair)

    def record_fill(self, price, size, side, model_fair):
        # 'yes' fills cost price*size cents, pay out 100*size cents if correct
        edge_captured = (model_fair - price) * size if side == "yes" else (price - model_fair) * size
        self.fills.append(edge_captured)
        log_event("fill_edge", edge_cents=edge_captured, cumulative_edge=sum(self.fills))

Cumulative edge (the sum of model_fair - fill_price across all orders, adjusted for side) is your leading indicator of execution quality. Realized PnL is your lagging indicator. Both matter, but they answer different questions.

Position Limit Alerts

Hard position limits belong in your risk management layer, but monitoring should enforce a second, independent check. Query open positions from the Kalshi REST API every five minutes and compare against your configured limits:

def check_position_limits(client, limits: dict):
    positions = client.get_positions()
    for pos in positions:
        ticker = pos["market_ticker"]
        size = abs(pos["position"])
        if ticker in limits and size > limits[ticker]:
            log_event("position_limit_breach",
                market=ticker,
                size=size,
                limit=limits[ticker],
                severity="critical",
            )
            # page on-call

This redundant check catches cases where the in-process risk guard has a bug or was bypassed by a restart with stale state.

PnL Drift Detection

Separate statistical underperformance from execution bugs by comparing cumulative PnL against a simple expected-value baseline. If your model assigns an edge of 2 cents per contract on average and you have filled 500 contracts, you expect roughly $10 in realized edge. If cumulative edge is below negative $5, something is likely wrong — not necessarily a crash, but worth investigating.

For a deeper treatment of when your PnL numbers are misleading, see the guide to common Kalshi bot PnL mistakes.

Execution Quality Tracking

Execution quality metrics reveal problems that PnL alone obscures — particularly adverse selection and order routing issues.

Slippage per Trade

For every fill, compute slippage as the signed difference between your model's fair value and the fill price:

slippage = (fill_price - model_fair) * (1 if side == "yes" else -1)

A consistently negative slippage (you are buying higher or selling lower than fair value) indicates that the market moves against you around your order times — classic adverse selection. This is a strategy-level signal worth tracking as a rolling 50-trade average.

Order Placement and Fill Rates

Track:

  • Placement latency: time from decision to REST API acknowledgment
  • Fill rate: percentage of placed limit orders that fill within N seconds
  • Rejection rate: percentage of orders rejected with 4xx errors

A sudden spike in rejection rate often signals an API key issue or a change in Kalshi's order validation rules. A drop in fill rate on a previously liquid market can indicate your prices have become stale relative to the book.

The Kalshi API tutorial covers the REST endpoints for order placement and status polling in detail.

Alerting Tiers and Routing

Not every anomaly warrants a 3 a.m. page. Classify alerts into three tiers:

TierCondition examplesResponse
CriticalHeartbeat lost, position limit breached, auth token expired, unhandled exception in order loopImmediate page (PagerDuty / SMS), auto-pause bot
Warningp99 API latency > 500 ms, fill rate drops > 30% from baseline, WebSocket idle > 30 sSlack / email notification, human reviews within 30 min
InfoDaily PnL summary, trade count, model staleness flagPosted to monitoring channel, no action required unless trend persists

Auto-pausing on critical alerts means your bot stops placing new orders and optionally flattens existing positions. It should never attempt to self-recover silently — it should stop and wait for a human to confirm the system is healthy before restarting. This is especially important for bots like the market making strategy, where an unmanaged quote loop can accumulate large one-sided exposure quickly.

Wiring a Slack Alert in Python

import requests

SLACK_WEBHOOK = "https://hooks.slack.com/services/..."

def send_alert(tier: str, message: str, context: dict = None):
    color = {"critical": "#e01e5a", "warning": "#ecb22e", "info": "#2eb886"}[tier]
    payload = {
        "attachments": [{
            "color": color,
            "title": f"[{tier.upper()}] Kalshi Bot Alert",
            "text": message,
            "fields": [{"title": k, "value": str(v), "short": True}
                       for k, v in (context or {}).items()],
        }]
    }
    requests.post(SLACK_WEBHOOK, json=payload, timeout=5)

Incident Response Runbooks

An alert is only as useful as the runbook it points to. A runbook is a short, numbered checklist a half-awake engineer can follow at midnight. Keep them in a runbooks/ directory in your repo, linked from alert descriptions. Here is a minimal template:

Runbook: Heartbeat Lost

  1. Check if the bot process is running: systemctl status kalshi-bot or check your cloud provider's process list.
  2. Check the last 50 log lines for a traceback or OOM message.
  3. Verify Kalshi API status at kalshi.com or their status page for a platform-side outage.
  4. If the process crashed, check open positions via the Kalshi dashboard before restarting.
  5. If positions are within limits and the cause is identified, restart the bot. Otherwise, manually close positions first.
  6. Document the incident and root cause in the incident log.

Runbook: Position Limit Breach

  1. Do not restart the bot if it is paused — verify why the position breached first.
  2. Query current positions via the API or dashboard.
  3. Determine if the breach came from a fill that arrived out of sequence or from a risk guard bug.
  4. Manually reduce the position to within limits if appropriate.
  5. Review strategy logs for the relevant market around the time of the breach.

Tooling Choices

The right monitoring stack depends on your scale and existing infrastructure. Here is a practical breakdown:

NeedLightweight optionProduction-grade option
Heartbeat / dead-man switchHealthchecks.io (free tier)Datadog Synthetics, PagerDuty heartbeat
Log aggregationStructured file + Loki (self-hosted)Datadog Logs, AWS CloudWatch
Metrics & dashboardsPrometheus + GrafanaDatadog, New Relic
Alerting / on-call routingSlack webhookPagerDuty, Opsgenie
Incident trackingGitHub Issues or NotionPagerDuty incidents, Linear

For a solo operator running one or two bots, the lightweight column is entirely sufficient and costs close to nothing. For a team managing multiple strategies and live capital, the production-grade options earn their cost through on-call routing, escalation policies, and historical dashboards.

If you are running your bot on a VPS or cloud instance, the Kalshi bot hosting guide covers infrastructure choices that pair well with the monitoring patterns here.

Putting It Together

A monitoring setup does not need to be complex to be effective. The minimum viable observability stack for a production Kalshi bot looks like this:

  1. Structured JSON logging with consistent fields on every order, fill, and heartbeat event.
  2. Heartbeat ping every 60 seconds to an external dead-man service — Healthchecks.io takes five minutes to set up.
  3. WebSocket idle watchdog that reconnects and alerts if no message arrives within 30 seconds.
  4. Position limit checker running as an independent loop, querying the API every five minutes.
  5. Slack webhook for warning-tier alerts and a phone notification (PagerDuty or SMS) for critical-tier alerts.
  6. Runbooks for the two or three most likely failure scenarios, stored in the repo and linked from alert messages.

Each of these items can be added to an existing bot in a day. Together they convert a bot that might run fine for weeks and then fail silently into one that surfaces problems within minutes and gives a clear path to resolution.

Monitoring is not the most exciting part of building a trading system, but it is the part that determines whether you can trust the system enough to let it run. A bot you cannot observe is a bot you cannot improve — and cannot safely leave unattended.

For the full context on building, testing, and operating Kalshi bots across their entire lifecycle, return to the complete guide to Kalshi trading bots.

Frequently Asked Questions

Quick answers to common questions about Kalshi Bot Monitoring and Alerting for Production Trading Systems.

What is the most critical alert to set up for a Kalshi trading bot?

A dead-man switch — an alert that fires when your bot stops sending heartbeats — is the single most important alert. If the bot crashes silently, you may have open positions with no active management. A heartbeat check every 60 seconds with a page-style alert is the minimum baseline.

How do I detect PnL drift versus normal variance in my bot?

Compare your bot's realized PnL against a rolling expected-value model built from your historical edge estimates. If cumulative PnL falls more than 2–3 standard deviations below expected over a meaningful sample of trades, trigger an investigation alert rather than an emergency stop — the issue could be a regime shift or a bug. Separating statistical drift from execution errors requires logging both fill prices and your model's predicted fair value at order time.

Should I shut the bot down automatically when an alert fires?

It depends on the alert severity. For critical faults — API authentication failure, position exceeding hard limits, or heartbeat loss — automatic pause-and-flatten is appropriate. For softer signals like elevated latency or mild PnL underperformance, a human-reviewed alert is safer than an automatic shutdown that could leave you worse off.

What logging format works best for Kalshi bot diagnostics?

Structured JSON logs with consistent fields (timestamp, event_type, market_ticker, order_id, side, price, size, latency_ms, model_fair_value) make it straightforward to query with tools like Grafana Loki, Datadog, or even simple grep pipelines. Avoid unstructured print statements in production — they are nearly impossible to parse at scale.

How do I monitor fill quality on Kalshi?

Log your order's limit price and the actual fill price for every execution, then compute slippage as (fill_price - model_fair_value) * side_multiplier. Track this as a rolling average. Persistent negative slippage signals that your fair-value model is stale or that you are being adversely selected in the order book.

What external services are commonly used for Kalshi bot alerting?

PagerDuty and Opsgenie are industry standards for on-call routing. For lighter setups, a Slack webhook or Telegram bot notification is quick to wire up. Prometheus with Alertmanager is a strong open-source stack if you are already running infrastructure metrics alongside your trading metrics.

Do I need separate monitoring for the Kalshi WebSocket and REST connections?

Yes. WebSocket and REST connections fail in different ways. The WebSocket can drop without throwing an exception, so you need an idle-message timer that fires if no message arrives within an expected window. REST calls fail loudly with HTTP errors but can also hang on slow responses, so you should enforce request timeouts and track p99 latency separately from WebSocket feed latency.

PC

Priya Chakraborty

Lead Developer & Technical Writer

Priya Chakraborty is Lead Developer at Bot for Kalshi. A former backend infrastructure engineer at Stripe, she now builds automated trading systems that process 10,000+ daily market signals across prediction markets.