An "AI trading agent" sounds like a black box that prints money. It isn't — and the people building ones that actually work treat the LLM as exactly one component in a larger system. The durable pattern is simple to state: the LLM proposes, deterministic code disposes. The model does what models are genuinely good at — reading news, synthesizing messy context, reasoning through scenarios — and hard-coded rules do what they're good at: the math, the risk limits, and the execution. This guide shows how to wire that up on Kalshi with Claude, with real code and an honest account of where LLMs help and where they'll hurt you.

This is the build-it companion to our overview of AI and automation in prediction markets. For the signal-generation half of the story, see using LLMs for prediction market signal generation.

What an AI agent adds — and what it doesn't

A traditional rules bot fires on conditions you specify in advance: "if YES < 40¢ and volume > X, buy." That's powerful and fast, but it's blind to anything you didn't pre-program. An LLM agent adds the ability to react to unstructured reality — a breaking news headline, a shift in tone across several reports, a scenario your rules never anticipated — and to explain its reasoning in plain language you can audit.

What it does not add is a crystal ball. An LLM doesn't know the future, can't do reliable arithmetic, and is famously overconfident when asked for a probability cold. Treat it as a very well-read analyst that drafts a recommendation — never as the trader with its hand on the button.

Architecture: the LLM proposes, the rules dispose

The whole system is a pipeline with a clear separation of duties:

  market data + news      →   [ Claude: reason over context ]
                                        │
                                        ▼
                          structured proposal (JSON):
                          { ticker, side, conviction, rationale }
                                        │
                                        ▼
              [ deterministic guardrails: sizing, fees, limits ]
                                        │
                                        ▼
                    [ execution: limit order via Kalshi API ]

The model never touches the order endpoint directly. It returns a proposal, and ordinary code decides whether and how to act on it. That boundary is the entire safety story.

The agent loop with code

Give Claude the relevant context and force it to return a structured decision using tool use, so you get clean JSON instead of prose you have to parse:

import anthropic

client = anthropic.Anthropic()

PROPOSE_TRADE = {
    "name": "propose_trade",
    "description": "Propose at most one trade, or abstain.",
    "input_schema": {
        "type": "object",
        "properties": {
            "action": {"enum": ["buy_yes", "buy_no", "abstain"]},
            "conviction": {"type": "number",
                           "description": "0-1, your confidence this is +EV"},
            "rationale": {"type": "string"},
        },
        "required": ["action", "conviction", "rationale"],
    },
}

def propose(market, news):
    msg = client.messages.create(
        model="claude-sonnet-4-6",   # any capable Claude model
        max_tokens=1024,
        tools=[PROPOSE_TRADE],
        tool_choice={"type": "tool", "name": "propose_trade"},
        system=("You are a prediction-market analyst. Reason carefully, "
                "and abstain unless there is a clear, defensible edge. "
                "Do not size the trade — only recommend a direction."),
        messages=[{"role": "user", "content":
                   f"Market: {market}\n\nRecent news:\n{news}"}],
    )
    return msg.content[0].input   # the structured proposal

Notice what the model is and isn't asked to do. It picks a direction and a conviction and explains itself. It is explicitly told not to size the trade — because sizing is math, and math is the deterministic layer's job.

Guardrails are the product

The proposal is an input, not a command. Before anything reaches Kalshi, plain code validates it against every rule you'd apply to a human's idea:

def act_on(proposal, market, account):
    if proposal["action"] == "abstain":
        return None
    if proposal["conviction"] < 0.65:        # ignore low-conviction noise
        return None

    # Sizing is deterministic — derived from conviction and risk caps,
    # never from the model. See our Kelly + risk-management guides.
    contracts = position_size(proposal["conviction"], account)
    price = limit_price(market, proposal["action"])   # rest on the book, maker side

    if violates_risk_limits(contracts, price, account):  # caps, daily loss cap
        return None

    return place_limit_order(market, proposal["action"], contracts, price)

This is where an AI bot earns the right to trade real money. The LLM's overconfidence can't hurt you if a low conviction is filtered out, its bad arithmetic can't hurt you if it never does the arithmetic, and a hallucinated certainty can't blow up the account if hard position and daily-loss limits sit between the proposal and the exchange. Those limits are the same ones in Kalshi bot risk management — an AI agent makes them more important, not less. And just like any bot, you prove it on a dry run before it places a live order.

What LLMs are genuinely good and bad at here

Good at:

  • Synthesis. Reading ten articles and extracting what changed is exactly an LLM's strength.
  • Unstructured → structured. Turning a messy headline into "this affects ticker X, direction Y" is reliable when scoped tightly.
  • Scenario reasoning. Enumerating ways a thesis could be wrong, which a rules bot can't do at all.

Bad at:

  • Arithmetic and probability calibration. Models are confidently wrong on numbers. Never let the LLM compute fees, edges, or sizes — and treat its raw "70%" as a vibe, not a probability. This is the same reason naive automated P&L math goes wrong; see why your Kalshi bot's P&L is wrong.
  • Recency. A model only knows what's in its prompt. If you want it reacting to today's news, you have to retrieve and feed that news in — it won't know it on its own.
  • Consistency. The same prompt can yield different calls. Low temperature and tight schemas help, but you must design for variance.

A concrete example: reacting to an economic print

Walk through how the pieces fit on a real scenario. A CPI report lands hotter than expected. The flow:

  1. Trigger. Your data layer detects the release — a deterministic event, not the model's job.
  2. Retrieve. You pull the release details and a few reputable headlines and pass them to Claude.
  3. Reason. The model synthesizes: "Inflation came in above consensus, which lowers the odds of a near-term rate cut," and returns a proposal with a direction and a conviction.
  4. Validate. Code checks conviction against your threshold, sizes the trade from your risk rules, and confirms it doesn't breach exposure or daily-loss limits.
  5. Execute. If everything passes, a limit order rests on the relevant market.

The model did exactly one thing — turn a number-plus-context into a defensible direction. Everything that touched money stayed in deterministic code. That separation is what makes the system safe to run unattended.

Testing a non-deterministic system

The same prompt can produce different outputs, which makes a naive "run it and see" approach unreliable. Test it like the probabilistic component it is:

  • Build an eval set of past situations with known good answers, and measure how often the model's extraction is actually correct.
  • Pin down what you can — low temperature, tight schemas, and an explicit "abstain when unsure" instruction — to reduce variance.
  • Backtest the full pipeline, not just the model, net of fees and with point-in-time data, exactly as in backtesting a Kalshi strategy.

When not to use an LLM

An LLM is the wrong tool when speed or pure price action is what matters. For a fast scalping strategy on a crypto-hourly market, a model in the loop is too slow and adds nothing — a deterministic rule fires in milliseconds and wins. Reserve the LLM for decisions that genuinely turn on unstructured information; for everything else, plain rules are faster, cheaper, and more predictable. The strongest systems are honest about this division of labor: the model where language is the bottleneck, rules where latency or math is.

Cost, latency, and caching

Calling a frontier model on every market tick is slow and expensive and accomplishes nothing — prices move far faster than a thesis changes. Call the model when the context changes (new headline, new data print), not on a timer. And because your system prompt and instructions are stable across calls, use prompt caching to avoid re-paying for those tokens every time — it meaningfully cuts both latency and cost for an agent that runs all day. The expensive model reasons occasionally; cheap deterministic code handles the fast, frequent execution path.

Are bots like this taking over the exchange? We looked at the actual evidence in are bots taking over Kalshi? — the honest answer is more nuanced than the hype.

Keep a human in the loop early

When you first deploy an AI agent, don't hand it the keys outright. Run it in a "propose only" mode where it logs the trades it would have made, and review those proposals against what actually happened. You'll quickly see where the model reasons well and where it goes sideways, and you can tune your conviction threshold and guardrails before a dollar is at risk. And because the agent logs its rationale in plain language, every proposal comes with an explanation you can audit — a transparency you simply don't get from an opaque numerical model. Graduating from shadow mode to small live size to full size — the same staged path as any bot — is how you earn trust in a system whose reasoning is probabilistic by nature. The dry-run mode in our builder is built for exactly this kind of staged rollout.

Get the guardrails without building them

Our bot builder already separates the strategy from the execution layer — sizing, fees, limit orders, and an account-level loss cap are handled for you, so you can focus on the edge. No code required.

Start Building — $99/month →

Frequently Asked Questions

Quick answers to common questions about Building an AI-Powered Kalshi Trading Agent with Claude.

Can you use Claude to trade on Kalshi?

Yes, but the reliable pattern is to let the LLM reason and propose a direction while deterministic code handles sizing, fee math, risk limits, and execution. The model never places orders directly — it returns a structured proposal that ordinary code validates against hard limits before anything reaches the exchange.

Is an AI trading agent better than a rules-based bot?

It's better at reacting to unstructured information like breaking news and at reasoning through scenarios you didn't pre-program. It's not better — and is often worse — at anything numerical. The strongest systems combine both: an LLM for synthesis, hard rules for math and execution.

What are LLMs bad at when trading prediction markets?

Arithmetic, probability calibration, and recency. Models are confidently wrong on numbers, so they should never compute fees, edges, or position sizes, and their stated confidence should be treated as a rough signal, not a real probability. They also only know what's in the prompt, so today's news must be retrieved and fed in.

How do you stop an AI trading bot from doing something reckless?

Put deterministic guardrails between the model's proposal and the exchange: a minimum conviction threshold, position-size caps, an account-level daily loss cap, and limit-order-only execution. Because the model only proposes and never executes, its mistakes are filtered before they can cost money. Always dry-run first.

PC

Priya Chakraborty

Lead Developer & Technical Writer

Priya Chakraborty is Lead Developer at Bot for Kalshi. A former backend infrastructure engineer at Stripe, she now builds automated trading systems that process 10,000+ daily market signals across prediction markets.