An "AI trading agent" sounds like a black box that prints money. It isn't — and the people building ones that actually work treat the LLM as exactly one component in a larger system. The durable pattern is simple to state: the LLM proposes, deterministic code disposes. The model does what models are genuinely good at — reading news, synthesizing messy context, reasoning through scenarios — and hard-coded rules do what they're good at: the math, the risk limits, and the execution. This guide shows how to wire that up on Kalshi with Claude, with real code and an honest account of where LLMs help and where they'll hurt you.
This is the build-it companion to our overview of AI and automation in prediction markets. For the signal-generation half of the story, see using LLMs for prediction market signal generation.
What an AI agent adds — and what it doesn't
A traditional rules bot fires on conditions you specify in advance: "if YES < 40¢ and volume > X, buy." That's powerful and fast, but it's blind to anything you didn't pre-program. An LLM agent adds the ability to react to unstructured reality — a breaking news headline, a shift in tone across several reports, a scenario your rules never anticipated — and to explain its reasoning in plain language you can audit.
What it does not add is a crystal ball. An LLM doesn't know the future, can't do reliable arithmetic, and is famously overconfident when asked for a probability cold. Treat it as a very well-read analyst that drafts a recommendation — never as the trader with its hand on the button.
Architecture: the LLM proposes, the rules dispose
The whole system is a pipeline with a clear separation of duties:
market data + news → [ Claude: reason over context ]
│
▼
structured proposal (JSON):
{ ticker, side, conviction, rationale }
│
▼
[ deterministic guardrails: sizing, fees, limits ]
│
▼
[ execution: limit order via Kalshi API ]
The model never touches the order endpoint directly. It returns a proposal, and ordinary code decides whether and how to act on it. That boundary is the entire safety story.
The agent loop with code
Give Claude the relevant context and force it to return a structured decision using tool use, so you get clean JSON instead of prose you have to parse:
import anthropic
client = anthropic.Anthropic()
PROPOSE_TRADE = {
"name": "propose_trade",
"description": "Propose at most one trade, or abstain.",
"input_schema": {
"type": "object",
"properties": {
"action": {"enum": ["buy_yes", "buy_no", "abstain"]},
"conviction": {"type": "number",
"description": "0-1, your confidence this is +EV"},
"rationale": {"type": "string"},
},
"required": ["action", "conviction", "rationale"],
},
}
def propose(market, news):
msg = client.messages.create(
model="claude-sonnet-4-6", # any capable Claude model
max_tokens=1024,
tools=[PROPOSE_TRADE],
tool_choice={"type": "tool", "name": "propose_trade"},
system=("You are a prediction-market analyst. Reason carefully, "
"and abstain unless there is a clear, defensible edge. "
"Do not size the trade — only recommend a direction."),
messages=[{"role": "user", "content":
f"Market: {market}\n\nRecent news:\n{news}"}],
)
return msg.content[0].input # the structured proposal
Notice what the model is and isn't asked to do. It picks a direction and a conviction and explains itself. It is explicitly told not to size the trade — because sizing is math, and math is the deterministic layer's job.
Guardrails are the product
The proposal is an input, not a command. Before anything reaches Kalshi, plain code validates it against every rule you'd apply to a human's idea:
def act_on(proposal, market, account):
if proposal["action"] == "abstain":
return None
if proposal["conviction"] < 0.65: # ignore low-conviction noise
return None
# Sizing is deterministic — derived from conviction and risk caps,
# never from the model. See our Kelly + risk-management guides.
contracts = position_size(proposal["conviction"], account)
price = limit_price(market, proposal["action"]) # rest on the book, maker side
if violates_risk_limits(contracts, price, account): # caps, daily loss cap
return None
return place_limit_order(market, proposal["action"], contracts, price)
This is where an AI bot earns the right to trade real money. The LLM's overconfidence can't hurt you if a low conviction is filtered out, its bad arithmetic can't hurt you if it never does the arithmetic, and a hallucinated certainty can't blow up the account if hard position and daily-loss limits sit between the proposal and the exchange. Those limits are the same ones in Kalshi bot risk management — an AI agent makes them more important, not less. And just like any bot, you prove it on a dry run before it places a live order.
What LLMs are genuinely good and bad at here
Good at:
- Synthesis. Reading ten articles and extracting what changed is exactly an LLM's strength.
- Unstructured → structured. Turning a messy headline into "this affects ticker X, direction Y" is reliable when scoped tightly.
- Scenario reasoning. Enumerating ways a thesis could be wrong, which a rules bot can't do at all.
Bad at:
- Arithmetic and probability calibration. Models are confidently wrong on numbers. Never let the LLM compute fees, edges, or sizes — and treat its raw "70%" as a vibe, not a probability. This is the same reason naive automated P&L math goes wrong; see why your Kalshi bot's P&L is wrong.
- Recency. A model only knows what's in its prompt. If you want it reacting to today's news, you have to retrieve and feed that news in — it won't know it on its own.
- Consistency. The same prompt can yield different calls. Low temperature and tight schemas help, but you must design for variance.
A concrete example: reacting to an economic print
Walk through how the pieces fit on a real scenario. A CPI report lands hotter than expected. The flow:
- Trigger. Your data layer detects the release — a deterministic event, not the model's job.
- Retrieve. You pull the release details and a few reputable headlines and pass them to Claude.
- Reason. The model synthesizes: "Inflation came in above consensus, which lowers the odds of a near-term rate cut," and returns a proposal with a direction and a conviction.
- Validate. Code checks conviction against your threshold, sizes the trade from your risk rules, and confirms it doesn't breach exposure or daily-loss limits.
- Execute. If everything passes, a limit order rests on the relevant market.
The model did exactly one thing — turn a number-plus-context into a defensible direction. Everything that touched money stayed in deterministic code. That separation is what makes the system safe to run unattended.
Testing a non-deterministic system
The same prompt can produce different outputs, which makes a naive "run it and see" approach unreliable. Test it like the probabilistic component it is:
- Build an eval set of past situations with known good answers, and measure how often the model's extraction is actually correct.
- Pin down what you can — low temperature, tight schemas, and an explicit "abstain when unsure" instruction — to reduce variance.
- Backtest the full pipeline, not just the model, net of fees and with point-in-time data, exactly as in backtesting a Kalshi strategy.
When not to use an LLM
An LLM is the wrong tool when speed or pure price action is what matters. For a fast scalping strategy on a crypto-hourly market, a model in the loop is too slow and adds nothing — a deterministic rule fires in milliseconds and wins. Reserve the LLM for decisions that genuinely turn on unstructured information; for everything else, plain rules are faster, cheaper, and more predictable. The strongest systems are honest about this division of labor: the model where language is the bottleneck, rules where latency or math is.
Cost, latency, and caching
Calling a frontier model on every market tick is slow and expensive and accomplishes nothing — prices move far faster than a thesis changes. Call the model when the context changes (new headline, new data print), not on a timer. And because your system prompt and instructions are stable across calls, use prompt caching to avoid re-paying for those tokens every time — it meaningfully cuts both latency and cost for an agent that runs all day. The expensive model reasons occasionally; cheap deterministic code handles the fast, frequent execution path.
Are bots like this taking over the exchange? We looked at the actual evidence in are bots taking over Kalshi? — the honest answer is more nuanced than the hype.
Keep a human in the loop early
When you first deploy an AI agent, don't hand it the keys outright. Run it in a "propose only" mode where it logs the trades it would have made, and review those proposals against what actually happened. You'll quickly see where the model reasons well and where it goes sideways, and you can tune your conviction threshold and guardrails before a dollar is at risk. And because the agent logs its rationale in plain language, every proposal comes with an explanation you can audit — a transparency you simply don't get from an opaque numerical model. Graduating from shadow mode to small live size to full size — the same staged path as any bot — is how you earn trust in a system whose reasoning is probabilistic by nature. The dry-run mode in our builder is built for exactly this kind of staged rollout.
Get the guardrails without building them
Our bot builder already separates the strategy from the execution layer — sizing, fees, limit orders, and an account-level loss cap are handled for you, so you can focus on the edge. No code required.
Frequently Asked Questions
Quick answers to common questions about Building an AI-Powered Kalshi Trading Agent with Claude.
Can you use Claude to trade on Kalshi?
Yes, but the reliable pattern is to let the LLM reason and propose a direction while deterministic code handles sizing, fee math, risk limits, and execution. The model never places orders directly — it returns a structured proposal that ordinary code validates against hard limits before anything reaches the exchange.
Is an AI trading agent better than a rules-based bot?
It's better at reacting to unstructured information like breaking news and at reasoning through scenarios you didn't pre-program. It's not better — and is often worse — at anything numerical. The strongest systems combine both: an LLM for synthesis, hard rules for math and execution.
What are LLMs bad at when trading prediction markets?
Arithmetic, probability calibration, and recency. Models are confidently wrong on numbers, so they should never compute fees, edges, or position sizes, and their stated confidence should be treated as a rough signal, not a real probability. They also only know what's in the prompt, so today's news must be retrieved and fed in.
How do you stop an AI trading bot from doing something reckless?
Put deterministic guardrails between the model's proposal and the exchange: a minimum conviction threshold, position-size caps, an account-level daily loss cap, and limit-order-only execution. Because the model only proposes and never executes, its mistakes are filtered before they can cost money. Always dry-run first.
Try the live demo — watch Claude build your trading bot
Describe a trade in plain English and the demo builds it in front of you, wired to live Kalshi data. Free — no email needed to try it.
Drop your email and we'll save the bots you build — no spam. Prefer to watch first? Free live webinar June 22 · 6 PM PT — register here.