A Kalshi trading bot that runs unmonitored in production is not a system — it is a liability. Bots crash, connections drop, fills come back wrong, and positions accumulate in ways that no strategy intended. This guide covers the full monitoring and alerting stack for production Kalshi bots: what to measure, how to structure your logs, which alerts to page on versus which to investigate later, and how to build a dead-man switch that catches silent failures before they cost real money.
Why Monitoring Is Non-Negotiable
Most trading bot guides end at deployment. The bot starts, a few trades go through, everything looks fine. Then three days later a WebSocket connection stalls mid-session, the bot stops receiving price updates, and it continues posting stale quotes against a rapidly moving market. By the time anyone notices, the damage is done.
This failure mode is more common than bugs in the actual strategy logic. Production infrastructure — cloud VMs, network paths, Kalshi's API endpoints — introduces faults that no amount of backtesting will surface. The complete guide to Kalshi trading bots covers the full lifecycle from idea to live trading; this article drills into the operational layer that keeps a live bot safe.
If you have not yet worked through deploying a Kalshi bot to production or set up bot-level risk management, read those first. Monitoring amplifies the protection those layers provide — it does not replace them.
The Four Layers of Bot Observability
A well-observable Kalshi bot exposes four distinct signal types:
| Layer | What it measures | Failure it catches |
|---|---|---|
| Infrastructure | CPU, memory, disk, network I/O | Resource exhaustion, host crash |
| Connectivity | API latency, WebSocket message rate, HTTP error codes | Dropped connections, rate limiting, auth expiry |
| Execution | Order placement rate, fill rate, slippage per trade | Silent order failures, adverse selection |
| Strategy | PnL vs. expected value, position sizes, model staleness | Logic bugs, regime shifts, runaway positions |
Each layer requires different tooling and different alert thresholds. Many teams instrument the infrastructure layer well and completely skip the strategy layer. In trading, the strategy layer is where the most expensive surprises live.
Structured Logging in Practice
Before building dashboards or alert rules, you need logs that are actually queryable. Every meaningful event your bot takes should emit a JSON line with a consistent schema. Here is a minimal baseline for a Python bot:
import json, time, logging
logger = logging.getLogger("kalshi_bot")
logging.basicConfig(level=logging.INFO, format="%(message)s")
def log_event(event_type: str, **kwargs):
record = {
"ts": time.time(),
"event": event_type,
**kwargs,
}
logger.info(json.dumps(record))
# Examples
log_event("order_placed",
market="KXBTC-25DEC-T50000",
side="yes",
price=0.42,
size=10,
order_id="ord_abc123",
model_fair=0.44,
)
log_event("fill_received",
order_id="ord_abc123",
fill_price=0.42,
fill_size=10,
latency_ms=38,
)
log_event("heartbeat", status="ok", open_positions=3)
The model_fair field on order events is critical. It captures what your bot thought the market was worth at the moment it sent the order. Without it, you cannot distinguish between a strategy that is underperforming because of bad logic versus one that is underperforming because fills are coming back at worse prices than expected.
Route these logs to a centralized sink — Datadog, Grafana Loki, AWS CloudWatch Logs, or even a simple append-only file shipped to S3 — so you can query across sessions without SSHing into the host.
Health Checks and Heartbeats
The simplest and most important monitoring pattern for any long-running process is the heartbeat: the bot emits a signal on a fixed interval, and an external watcher alerts if that signal stops arriving.
Implementing a Heartbeat
Every 60 seconds, your bot's main loop should emit a heartbeat event and optionally POST to an external monitoring endpoint:
import requests, threading
HEARTBEAT_URL = "https://hc-ping.com/your-uuid-here" # e.g. Healthchecks.io
def heartbeat_loop(interval: int = 60):
while True:
try:
log_event("heartbeat", status="ok")
requests.get(HEARTBEAT_URL, timeout=5)
except Exception as e:
log_event("heartbeat_error", error=str(e))
time.sleep(interval)
threading.Thread(target=heartbeat_loop, daemon=True).start()
Services like Healthchecks.io or the Datadog Synthetics "deadman" check will alert you if the ping stops arriving within a grace period. This is your last-resort safety net — it fires even if the entire bot process has crashed and can no longer log anything.
WebSocket Connection Monitoring
A WebSocket connection can remain technically open while delivering no messages — a "zombie" connection. Detect this with an idle-message timer:
import asyncio
IDLE_THRESHOLD_SEC = 30 # alert if no message for 30 seconds
class WatchedWebSocket:
def __init__(self):
self.last_message_ts = time.time()
def on_message(self, msg):
self.last_message_ts = time.time()
# ... process msg
async def watchdog(self):
while True:
await asyncio.sleep(10)
idle = time.time() - self.last_message_ts
if idle > IDLE_THRESHOLD_SEC:
log_event("ws_idle_alert", idle_seconds=idle)
# trigger reconnect or alert
For more detail on the differences between WebSocket and REST connectivity patterns, see the Kalshi WebSocket vs. REST API comparison.
PnL and Position Monitoring
Infrastructure and connectivity checks keep the bot alive. PnL and position monitoring keeps the strategy honest.
Tracking Realized and Unrealized PnL
Maintain running totals in memory and persist them to disk or a database on every fill. Log a summary on each heartbeat:
class PnLTracker:
def __init__(self):
self.realized = 0.0
self.fills = [] # list of (price, size, side, model_fair)
def record_fill(self, price, size, side, model_fair):
# 'yes' fills cost price*size cents, pay out 100*size cents if correct
edge_captured = (model_fair - price) * size if side == "yes" else (price - model_fair) * size
self.fills.append(edge_captured)
log_event("fill_edge", edge_cents=edge_captured, cumulative_edge=sum(self.fills))
Cumulative edge (the sum of model_fair - fill_price across all orders, adjusted for side) is your leading indicator of execution quality. Realized PnL is your lagging indicator. Both matter, but they answer different questions.
Position Limit Alerts
Hard position limits belong in your risk management layer, but monitoring should enforce a second, independent check. Query open positions from the Kalshi REST API every five minutes and compare against your configured limits:
def check_position_limits(client, limits: dict):
positions = client.get_positions()
for pos in positions:
ticker = pos["market_ticker"]
size = abs(pos["position"])
if ticker in limits and size > limits[ticker]:
log_event("position_limit_breach",
market=ticker,
size=size,
limit=limits[ticker],
severity="critical",
)
# page on-call
This redundant check catches cases where the in-process risk guard has a bug or was bypassed by a restart with stale state.
PnL Drift Detection
Separate statistical underperformance from execution bugs by comparing cumulative PnL against a simple expected-value baseline. If your model assigns an edge of 2 cents per contract on average and you have filled 500 contracts, you expect roughly $10 in realized edge. If cumulative edge is below negative $5, something is likely wrong — not necessarily a crash, but worth investigating.
For a deeper treatment of when your PnL numbers are misleading, see the guide to common Kalshi bot PnL mistakes.
Execution Quality Tracking
Execution quality metrics reveal problems that PnL alone obscures — particularly adverse selection and order routing issues.
Slippage per Trade
For every fill, compute slippage as the signed difference between your model's fair value and the fill price:
slippage = (fill_price - model_fair) * (1 if side == "yes" else -1)
A consistently negative slippage (you are buying higher or selling lower than fair value) indicates that the market moves against you around your order times — classic adverse selection. This is a strategy-level signal worth tracking as a rolling 50-trade average.
Order Placement and Fill Rates
Track:
- Placement latency: time from decision to REST API acknowledgment
- Fill rate: percentage of placed limit orders that fill within N seconds
- Rejection rate: percentage of orders rejected with 4xx errors
A sudden spike in rejection rate often signals an API key issue or a change in Kalshi's order validation rules. A drop in fill rate on a previously liquid market can indicate your prices have become stale relative to the book.
The Kalshi API tutorial covers the REST endpoints for order placement and status polling in detail.
Alerting Tiers and Routing
Not every anomaly warrants a 3 a.m. page. Classify alerts into three tiers:
| Tier | Condition examples | Response |
|---|---|---|
| Critical | Heartbeat lost, position limit breached, auth token expired, unhandled exception in order loop | Immediate page (PagerDuty / SMS), auto-pause bot |
| Warning | p99 API latency > 500 ms, fill rate drops > 30% from baseline, WebSocket idle > 30 s | Slack / email notification, human reviews within 30 min |
| Info | Daily PnL summary, trade count, model staleness flag | Posted to monitoring channel, no action required unless trend persists |
Auto-pausing on critical alerts means your bot stops placing new orders and optionally flattens existing positions. It should never attempt to self-recover silently — it should stop and wait for a human to confirm the system is healthy before restarting. This is especially important for bots like the market making strategy, where an unmanaged quote loop can accumulate large one-sided exposure quickly.
Wiring a Slack Alert in Python
import requests
SLACK_WEBHOOK = "https://hooks.slack.com/services/..."
def send_alert(tier: str, message: str, context: dict = None):
color = {"critical": "#e01e5a", "warning": "#ecb22e", "info": "#2eb886"}[tier]
payload = {
"attachments": [{
"color": color,
"title": f"[{tier.upper()}] Kalshi Bot Alert",
"text": message,
"fields": [{"title": k, "value": str(v), "short": True}
for k, v in (context or {}).items()],
}]
}
requests.post(SLACK_WEBHOOK, json=payload, timeout=5)
Incident Response Runbooks
An alert is only as useful as the runbook it points to. A runbook is a short, numbered checklist a half-awake engineer can follow at midnight. Keep them in a runbooks/ directory in your repo, linked from alert descriptions. Here is a minimal template:
Runbook: Heartbeat Lost
- Check if the bot process is running:
systemctl status kalshi-botor check your cloud provider's process list. - Check the last 50 log lines for a traceback or OOM message.
- Verify Kalshi API status at kalshi.com or their status page for a platform-side outage.
- If the process crashed, check open positions via the Kalshi dashboard before restarting.
- If positions are within limits and the cause is identified, restart the bot. Otherwise, manually close positions first.
- Document the incident and root cause in the incident log.
Runbook: Position Limit Breach
- Do not restart the bot if it is paused — verify why the position breached first.
- Query current positions via the API or dashboard.
- Determine if the breach came from a fill that arrived out of sequence or from a risk guard bug.
- Manually reduce the position to within limits if appropriate.
- Review strategy logs for the relevant market around the time of the breach.
Tooling Choices
The right monitoring stack depends on your scale and existing infrastructure. Here is a practical breakdown:
| Need | Lightweight option | Production-grade option |
|---|---|---|
| Heartbeat / dead-man switch | Healthchecks.io (free tier) | Datadog Synthetics, PagerDuty heartbeat |
| Log aggregation | Structured file + Loki (self-hosted) | Datadog Logs, AWS CloudWatch |
| Metrics & dashboards | Prometheus + Grafana | Datadog, New Relic |
| Alerting / on-call routing | Slack webhook | PagerDuty, Opsgenie |
| Incident tracking | GitHub Issues or Notion | PagerDuty incidents, Linear |
For a solo operator running one or two bots, the lightweight column is entirely sufficient and costs close to nothing. For a team managing multiple strategies and live capital, the production-grade options earn their cost through on-call routing, escalation policies, and historical dashboards.
If you are running your bot on a VPS or cloud instance, the Kalshi bot hosting guide covers infrastructure choices that pair well with the monitoring patterns here.
Putting It Together
A monitoring setup does not need to be complex to be effective. The minimum viable observability stack for a production Kalshi bot looks like this:
- Structured JSON logging with consistent fields on every order, fill, and heartbeat event.
- Heartbeat ping every 60 seconds to an external dead-man service — Healthchecks.io takes five minutes to set up.
- WebSocket idle watchdog that reconnects and alerts if no message arrives within 30 seconds.
- Position limit checker running as an independent loop, querying the API every five minutes.
- Slack webhook for warning-tier alerts and a phone notification (PagerDuty or SMS) for critical-tier alerts.
- Runbooks for the two or three most likely failure scenarios, stored in the repo and linked from alert messages.
Each of these items can be added to an existing bot in a day. Together they convert a bot that might run fine for weeks and then fail silently into one that surfaces problems within minutes and gives a clear path to resolution.
Monitoring is not the most exciting part of building a trading system, but it is the part that determines whether you can trust the system enough to let it run. A bot you cannot observe is a bot you cannot improve — and cannot safely leave unattended.
For the full context on building, testing, and operating Kalshi bots across their entire lifecycle, return to the complete guide to Kalshi trading bots.
Frequently Asked Questions
Quick answers to common questions about Kalshi Bot Monitoring and Alerting for Production Trading Systems.
What is the most critical alert to set up for a Kalshi trading bot?
A dead-man switch — an alert that fires when your bot stops sending heartbeats — is the single most important alert. If the bot crashes silently, you may have open positions with no active management. A heartbeat check every 60 seconds with a page-style alert is the minimum baseline.
How do I detect PnL drift versus normal variance in my bot?
Compare your bot's realized PnL against a rolling expected-value model built from your historical edge estimates. If cumulative PnL falls more than 2–3 standard deviations below expected over a meaningful sample of trades, trigger an investigation alert rather than an emergency stop — the issue could be a regime shift or a bug. Separating statistical drift from execution errors requires logging both fill prices and your model's predicted fair value at order time.
Should I shut the bot down automatically when an alert fires?
It depends on the alert severity. For critical faults — API authentication failure, position exceeding hard limits, or heartbeat loss — automatic pause-and-flatten is appropriate. For softer signals like elevated latency or mild PnL underperformance, a human-reviewed alert is safer than an automatic shutdown that could leave you worse off.
What logging format works best for Kalshi bot diagnostics?
Structured JSON logs with consistent fields (timestamp, event_type, market_ticker, order_id, side, price, size, latency_ms, model_fair_value) make it straightforward to query with tools like Grafana Loki, Datadog, or even simple grep pipelines. Avoid unstructured print statements in production — they are nearly impossible to parse at scale.
How do I monitor fill quality on Kalshi?
Log your order's limit price and the actual fill price for every execution, then compute slippage as (fill_price - model_fair_value) * side_multiplier. Track this as a rolling average. Persistent negative slippage signals that your fair-value model is stale or that you are being adversely selected in the order book.
What external services are commonly used for Kalshi bot alerting?
PagerDuty and Opsgenie are industry standards for on-call routing. For lighter setups, a Slack webhook or Telegram bot notification is quick to wire up. Prometheus with Alertmanager is a strong open-source stack if you are already running infrastructure metrics alongside your trading metrics.
Do I need separate monitoring for the Kalshi WebSocket and REST connections?
Yes. WebSocket and REST connections fail in different ways. The WebSocket can drop without throwing an exception, so you need an idle-message timer that fires if no message arrives within an expected window. REST calls fail loudly with HTTP errors but can also hang on slow responses, so you should enforce request timeouts and track p99 latency separately from WebSocket feed latency.
Free office hours with the founders
Drop in Mon, Tue & Wed at 9 AM Pacific — we'll help you build and run your Kalshi bots, live. Everyone welcome, no registration.
See office hours →Can't make 9 AM? Book a free 1:1 instead.