03 · Methodology
How the system actually works.
A stack of models, one per betting market, all built around a single optimization target: closing line value, not win rate. The architecture follows from that.
03 · Methodology
A stack of models, one per betting market, all built around a single optimization target: closing line value, not win rate. The architecture follows from that.
Two markets land on the public daily card: moneyline (which team wins) and total runs (over/under on combined score). Three more are modeled in research mode and tracked publicly but not bet: runline (1.5-run spread), strikeout props, and hitter props. A market only graduates from research to the card after it clears a fixed promotion bar — see /research for the live shadow tracking.
For each one, the model spits out a probability — "the home team wins this game 58% of the time" — and we compare that to the implied probability baked into the sportsbook's price. Sportsbook odds are just probabilities in disguise. −110 means "the book thinks this happens about 52% of the time, and is charging you a vig on top." If our number is meaningfully higher than the book's — we're saying it's a 58% bet, the book is pricing 50% — that gap is EV, our edge. If the edge clears a market-specific threshold, the pick goes on the card.
−110 means "risk $110 to win $100." That price implies a 52.4% probability — but the book's baking in a margin (the vig); the true fair odds it's pricing are closer to 50.0%. If the model says the outcome happens 58% of the time, that's an 8-point edge. Clear the market's threshold and it goes on the card.
Predicting which team wins and predicting how many runs score are different problems. They depend on different stats and fail in different ways. So instead of one model trying to do everything, we built four — one for each market we publish.
For the moneyline (which team wins), three different methods vote and we average them: a chess-style power rating (ELO), a textbook statistical model (logistic regression), and a tree-based model that catches non-linear quirks (XGBoost). Example of a non-linear quirk: a groundball-heavy pitcher at Coors Field plays very differently than the same pitcher at Oracle Park. A simple linear model can't represent that interaction; a tree-based one can. All three see the same fifteen features; their answers get averaged with fixed weights — 34/33/33.
Three different methods see the same fifteen features and each produce their own probability. The model ships the weighted average — close to the median, less sensitive to any one method's blind spots.
That handles the moneyline. The other three markets each use their own dedicated model, summarized below.
The ensemble weights are set by hand because at this sample size, a learned blender would be fitting noise instead of capturing real signal — the three methods score within a hair of each other on cross-validation folds, so there's no clean signal for an auto-tuner to grab onto.
About the calibrator. A calibrator is a post-processing step that nudges predicted probabilities to match reality — picks the model says are 65% should actually win about 65% of the time. We trained one (isotonic regression) and benchmarked it against the raw ensemble. It scored slightly worse on out-of-sample data, so it ships off. Three separate retrain cycles have re-confirmed that result.
About the probability caps. Two safety rails apply before any pick reaches the EV calculator. The first is symmetric: probabilities above 60% or below 40% get pulled back to that band for sizing — the model is sharpest inside that range, and the tails are where it has been most wrong historically. The second is asymmetric: home favorites get capped at 58%, away favorites don't. A May 2026 audit found home favorites in the 60–70% predicted bucket were converting at 38.5% (the model was saying 63.2%) while away favorites in the same bucket were under-confident. The bias is one-sided, so the fix is one-sided.
Features are the stats the model looks at for each game. We use sabermetric ones — xFIP, wRC+, OAA — instead of the box-score numbers fans see on TV, because they isolate skill from luck and are far more predictive. Fifteen go into the moneyline model, nineteen into totals.In plain terms: xFIP grades a pitcher on what he controls (strikeouts, walks, ground balls) and strips out balls-in-play luck. wRC+ is a 100-baseline hitter score — 120 means 20% better than league average. OAA is a defensive metric that counts plays made vs. plays an average fielder would make.
Every feature respects a strict temporal boundary: nothing computed from data that wasn't available at the game's first pitch. This is the single most consequential rule of the whole pipeline — a model that accidentally peeks at the future looks brilliant in backtests and worthless in production.
Inputs keep updating right up to first pitch. When confirmed lineups post (around an hour before the game), the model re-runs against the actual posted lineup — adjusted wRC+, catcher framing, velocity versus season norm. Injury news, late starting-pitcher changes, and lineup scratches all trigger fresh evaluations. We also apply a small contrarian overlay: when a heavy public favorite is riding a 4+ game win streak, we trim their bet's edge slightly — the model's probability stays clean, but the EV gets a discount because streaks are mostly random and markets tend to price them as if they aren't.
Early-season stats lie. A pitcher with three good starts looks like an ace; a hitter in a six-game cold streak looks broken. Build a model on raw numbers and it gets fooled twice — once on the way up, once when the player regresses to who they actually are.
So pitcher and team stats get blended with last year's numbers, weighted by how much current-season data we have. A pitcher with 30 innings of current-season xFIP is mostly described by his prior-year true talent. A pitcher with 150 innings is mostly described by current year. The blend shifts continuously as the season goes on. Statisticians call this Bayesian credibility weighting. In practice it just means the model trusts what it has data for.
Pitcher stats get blended continuously as innings accumulate. A three-start hot streak in April barely moves the model's read of a pitcher's true talent; by August, the current-year sample is large enough to mostly speak for itself.
This matters most in April. By late August the prior-year weight is small and the model is essentially looking at current-year numbers — but by then, a single bad April hasn't already wrecked our predictions for the rest of the season.
Anyone can show wins. Closing line value is the metric that survives variance — beat the closing number consistently and the wins follow over a full season.
Think of the closing line as the market's final answer after every sharp bettor and book trader has weighed in. By first pitch, the line is the closest thing baseball betting has to a fair price. If we keep getting our bets in at prices betterthan where the market ends up, we're seeing things the market eventually agrees with — and that's a leading indicator of edge.
We bet at −110. By first pitch the line had moved to −125. The market adjusted toward our position by 3.2 percentage points — that's the CLV, regardless of whether the bet ultimately won.
We optimize for CLV because it's the only public-domain signal that predicts long-term edge before the sample is large enough to prove it from raw profit. Win rate is noisy at any sample under several thousand picks; CLV stabilizes much faster. So we publish both, but treat CLV as the primary scorecard.
Capturing the actualclosing line — not the opener, not the line when we bet — is harder than it sounds. Games start at irregular minutes past the hour, odds feeds lag, and a single fixed schedule will miss half the slate. So we don't use one.
Every pick gets its own timer scheduled off that game's first-pitch time. Ten minutes before first pitch, an automated job pulls the current odds from the same sharp books we entered at, stores them as the closing line, and computes CLV against the entry odds. CLV on this site is always measured against that actual close — not the opener, not some intermediate snapshot.
Most bettors only care if a pick wins or loses. We grade something else too: was the pick correct given the information, even if the result said otherwise? A losing pick can be a good decision (variance went the other way) and a winning pick can be a lucky save. So every settled pick lands in one of four quadrants based on CLV (did the market move our way?) and result (did we win?):
Wins and losses live in the columns; whether the market eventually agreed with us (CLV) lives in the rows. The top row compounds over a season; the bottom row is what we work to shrink.
The Good Bet + Unlucky bucket is what compounds. The Got Lucky + Bad Bet bucket is what we work to shrink. Treating losses as uniformly bad — or wins as uniformly good — masks the actual model quality and leads to chasing the wrong corrections.
Quadrant grades show up next to every settled pick on the daily cards and roll up into the aggregate breakdown on the track record page.
Not all sportsbooks are created equal. We only use odds from FanDuel, DraftKings, BetMGM, Caesars, and Fanatics — the five major US books that update lines aggressively in response to sharp money. Their closing numbers are the closest thing baseball betting has to a fair market price. Offshore books are excluded because their lines are stale; including them inflates apparent edge without predicting anything real.
Live games are excluded entirely from odds fetches, so we never accidentally price into in-play markets. Pre-game lines only.
Math models produce numbers. AI models produce explanations. Closerbets uses both — but they're in separate lanes, and no number on this site is generated by an AI model. The picks, the edges, the closing lines, the post-mortem grades all come from the math stack in Section 02. Anthropic Claude (Sonnet) runs on top of that output to produce natural-language analysis — the "why this pick" thesis on the card, the loss attribution on the post-mortem — but it never sets a probability or chooses a line.
Eight surfaces use it today:
The separation matters. The track record on this site reflects model quality, not LLM quality. If the LLM layer were swapped to a different provider tomorrow — or removed entirely — the picks, the closing line value, and the post-mortem grades would be identical. The LLM is a presentation surface, not a prediction surface.
Closerbets isn't something a person sits down and runs every morning. It's a 24-hour loop of automated jobs — ingestion, modeling, publishing, results settlement, weekly retraining — that fires whether anyone is watching or not. This website and the Discord bot are both consumers of the same loop; pages here are static and regenerate on a fixed cadence from the same source the bot reads from.
Times below are Eastern. Underlying crons are UTC-pinned, so labels shift by one hour between EDT (summer) and EST (winter).
Nothing runs by hand. If a step fails, the system either retries or surfaces the failure to Discord — there is no quiet degradation.
The model is judged on three numbers, all measured against held-out data the model never saw during training (a technique called walk-forward backtesting — train on seasons before S, predict season S, never leak the future).
Brier score is the standard accuracy metric for probabilistic predictions — it punishes both being wrong and being overconfident when wrong. Calibration is the metric the trained-but-disabled calibrator was supposed to improve and didn't; raw ensemble calibration is already inside the noise band of what a calibrator could add.
But the validation metric that matters most is closing line value— see Section 05. Brier and accuracy can look good on a stale model that the market has caught up to; CLV can't.
Half the value of a public model is being honest about what isn't working yet. Two layers here: things shipped but not at their ideal version, and things we measured and explicitly closed.
Shipped, but caveated:
Measured and explicitly closed: Every plausible feature or gate change gets backtested before shipping. The bar is a validation Brier improvement of at least 0.001 on held-out data. Below is a partial inventory of changes tested in the last two months that didn't clear it. Negative results matter — they prevent the model from accumulating cruft and fitting noise.
The pattern across these is consistent: the model is feature- saturated for its current architecture. Future lift comes from better calibration, smarter sizing, or whole new markets — not from adding more inputs.
Everything that does ship is visible on /track-record — every settled pick, every day, since inception.