Skip to main content
closerbetsalpha

03 · Methodology

How the system actually works.

A stack of models, one per betting market, all built around a single optimization target: closing line value, not win rate. The architecture follows from that.

[01]

What we model

Two markets land on the public daily card: moneyline (which team wins) and total runs (over/under on combined score). Three more are modeled in research mode and tracked publicly but not bet: runline (1.5-run spread), strikeout props, and hitter props. A market only graduates from research to the card after it clears a fixed promotion bar — see /research for the live shadow tracking.

For each one, the model spits out a probability — "the home team wins this game 58% of the time" — and we compare that to the implied probability baked into the sportsbook's price. Sportsbook odds are just probabilities in disguise. −110 means "the book thinks this happens about 52% of the time, and is charging you a vig on top." If our number is meaningfully higher than the book's — we're saying it's a 58% bet, the book is pricing 50% — that gap is EV, our edge. If the edge clears a market-specific threshold, the pick goes on the card.

Price
the line
−110
Implied
with vig
52.4%
Fair
de-vigged
50.0%
Model
our number
58.0%
EV
model − fair
+8.0pp

−110 means "risk $110 to win $100." That price implies a 52.4% probability — but the book's baking in a margin (the vig); the true fair odds it's pricing are closer to 50.0%. If the model says the outcome happens 58% of the time, that's an 8-point edge. Clear the market's threshold and it goes on the card.

Moneyline (on-card)
≥ 10% EV, ceiling 25% (picks above are suppressed as likely model blowup)
Totals (on-card)
≥ 15% EV — OVER picks only; UNDERs are currently suppressed at every tier pending a totals retrain (2026 run environment shifted ~1 run/game)
Strikeout props
≥ 3% EV
Hitter props
≥ 0% EV (research-tier, shadow P&L)
Leans (overflow)
≥ 3% EV
Stake sizing
flat 0.5 units
Card-size cap
none — selectivity comes from the EV thresholds above
[02]

One model per market

Predicting which team wins and predicting how many runs score are different problems. They depend on different stats and fail in different ways. So instead of one model trying to do everything, we built four — one for each market we publish.

For the moneyline (which team wins), three different methods vote and we average them: a chess-style power rating (ELO), a textbook statistical model (logistic regression), and a tree-based model that catches non-linear quirks (XGBoost). Example of a non-linear quirk: a groundball-heavy pitcher at Coors Field plays very differently than the same pitcher at Oracle Park. A simple linear model can't represent that interaction; a tree-based one can. All three see the same fifteen features; their answers get averaged with fixed weights — 34/33/33.

Sample moneyline pick (illustrative)
ELO
chess-style rating
34%
58.0%
Logistic
stat model
33%
62.0%
XGBoost
tree model
33%
55.0%
Ensemble
58.3%
weighted average of the three

Three different methods see the same fifteen features and each produce their own probability. The model ships the weighted average — close to the median, less sensitive to any one method's blind spots.

That handles the moneyline. The other three markets each use their own dedicated model, summarized below.

Moneyline
Linear ensemble: ELO 34% / Logistic 33% / XGBoost 33%
Totals
TotalsModel — Ridge + negative-binomial run distribution
Strikeouts
StrikeoutModel — Ridge + Poisson
Hitter props
HitterPropModel — Ridge + Poisson
Runline
Derived from the ML ensemble + totals via Pythagorean inversion
Calibration layer
trained, currently disabled

The ensemble weights are set by hand because at this sample size, a learned blender would be fitting noise instead of capturing real signal — the three methods score within a hair of each other on cross-validation folds, so there's no clean signal for an auto-tuner to grab onto.

About the calibrator. A calibrator is a post-processing step that nudges predicted probabilities to match reality — picks the model says are 65% should actually win about 65% of the time. We trained one (isotonic regression) and benchmarked it against the raw ensemble. It scored slightly worse on out-of-sample data, so it ships off. Three separate retrain cycles have re-confirmed that result.

About the probability caps. Two safety rails apply before any pick reaches the EV calculator. The first is symmetric: probabilities above 60% or below 40% get pulled back to that band for sizing — the model is sharpest inside that range, and the tails are where it has been most wrong historically. The second is asymmetric: home favorites get capped at 58%, away favorites don't. A May 2026 audit found home favorites in the 60–70% predicted bucket were converting at 38.5% (the model was saying 63.2%) while away favorites in the same bucket were under-confident. The bias is one-sided, so the fix is one-sided.

[03]

Features

Features are the stats the model looks at for each game. We use sabermetric ones — xFIP, wRC+, OAA — instead of the box-score numbers fans see on TV, because they isolate skill from luck and are far more predictive. Fifteen go into the moneyline model, nineteen into totals.In plain terms: xFIP grades a pitcher on what he controls (strikeouts, walks, ground balls) and strips out balls-in-play luck. wRC+ is a 100-baseline hitter score — 120 means 20% better than league average. OAA is a defensive metric that counts plays made vs. plays an average fielder would make.

Every feature respects a strict temporal boundary: nothing computed from data that wasn't available at the game's first pitch. This is the single most consequential rule of the whole pipeline — a model that accidentally peeks at the future looks brilliant in backtests and worthless in production.

Pitching
xFIP, SIERA, CSW%, velocity delta. Never raw ERA.
Offense
team wRC+ (Bayesian-stabilized); team batter K% on the totals model
Defense
team OAA (Outs Above Average) for both teams
Bullpen
rest state, recent appearances, depleted-pen flag
Park
venue run environment, roof state
Weather
temp + wind on the totals model, backfilled per game time and roof
Division
divisional matchup binary
Pythagorean gap
runs scored vs. allowed mismatch

Inputs keep updating right up to first pitch. When confirmed lineups post (around an hour before the game), the model re-runs against the actual posted lineup — adjusted wRC+, catcher framing, velocity versus season norm. Injury news, late starting-pitcher changes, and lineup scratches all trigger fresh evaluations. We also apply a small contrarian overlay: when a heavy public favorite is riding a 4+ game win streak, we trim their bet's edge slightly — the model's probability stays clean, but the EV gets a discount because streaks are mostly random and markets tend to price them as if they aren't.

[04]

Stabilization

Early-season stats lie. A pitcher with three good starts looks like an ace; a hitter in a six-game cold streak looks broken. Build a model on raw numbers and it gets fooled twice — once on the way up, once when the player regresses to who they actually are.

So pitcher and team stats get blended with last year's numbers, weighted by how much current-season data we have. A pitcher with 30 innings of current-season xFIP is mostly described by his prior-year true talent. A pitcher with 150 innings is mostly described by current year. The blend shifts continuously as the season goes on. Statisticians call this Bayesian credibility weighting. In practice it just means the model trusts what it has data for.

30 IP
April starter
75%
25%
mostly prior
90 IP
Early June
50%
50%
50/50 blend
150 IP
August vet
37%
63%
mostly current
Prior-year true talent
Current-year sample

Pitcher stats get blended continuously as innings accumulate. A three-start hot streak in April barely moves the model's read of a pitcher's true talent; by August, the current-year sample is large enough to mostly speak for itself.

This matters most in April. By late August the prior-year weight is small and the model is essentially looking at current-year numbers — but by then, a single bad April hasn't already wrecked our predictions for the rest of the season.

[05]

Closing line value

Anyone can show wins. Closing line value is the metric that survives variance — beat the closing number consistently and the wins follow over a full season.

Think of the closing line as the market's final answer after every sharp bettor and book trader has weighed in. By first pitch, the line is the closest thing baseball betting has to a fair price. If we keep getting our bets in at prices betterthan where the market ends up, we're seeing things the market eventually agrees with — and that's a leading indicator of edge.

Entry odds
−110
52.4% implied
line moves
CLV
+3.2pp
Closing odds
−125
55.6% implied

We bet at −110. By first pitch the line had moved to −125. The market adjusted toward our position by 3.2 percentage points — that's the CLV, regardless of whether the bet ultimately won.

We optimize for CLV because it's the only public-domain signal that predicts long-term edge before the sample is large enough to prove it from raw profit. Win rate is noisy at any sample under several thousand picks; CLV stabilizes much faster. So we publish both, but treat CLV as the primary scorecard.

[06]

Closing line capture

Capturing the actualclosing line — not the opener, not the line when we bet — is harder than it sounds. Games start at irregular minutes past the hour, odds feeds lag, and a single fixed schedule will miss half the slate. So we don't use one.

Every pick gets its own timer scheduled off that game's first-pitch time. Ten minutes before first pitch, an automated job pulls the current odds from the same sharp books we entered at, stores them as the closing line, and computes CLV against the entry odds. CLV on this site is always measured against that actual close — not the opener, not some intermediate snapshot.

Capture time
T-10min, scheduled per-game
Books captured
FanDuel, DraftKings, BetMGM, Caesars, Fanatics
Reference price
the same sharp book we entered at
Stored alongside
entry odds, computed CLV, settled result
[07]

Post-mortem grading

Most bettors only care if a pick wins or loses. We grade something else too: was the pick correct given the information, even if the result said otherwise? A losing pick can be a good decision (variance went the other way) and a winning pick can be a lucky save. So every settled pick lands in one of four quadrants based on CLV (did the market move our way?) and result (did we win?):

Won
Lost
+CLV
Good Bet
model right, market agreed
Unlucky
model right, variance lost
−CLV
Got Lucky
model wrong, got bailed out
Bad Bet
model wrong, full stop

Wins and losses live in the columns; whether the market eventually agreed with us (CLV) lives in the rows. The top row compounds over a season; the bottom row is what we work to shrink.

Good Bet
+CLV and won — model was right and the market agreed
Unlucky
+CLV and lost — model was right, variance went the other way
Got Lucky
−CLV and won — market disagreed and we got bailed out
Bad Bet
−CLV and lost — model was wrong, full stop

The Good Bet + Unlucky bucket is what compounds. The Got Lucky + Bad Bet bucket is what we work to shrink. Treating losses as uniformly bad — or wins as uniformly good — masks the actual model quality and leads to chasing the wrong corrections.

Quadrant grades show up next to every settled pick on the daily cards and roll up into the aggregate breakdown on the track record page.

[08]

Books, sample, and limits

Not all sportsbooks are created equal. We only use odds from FanDuel, DraftKings, BetMGM, Caesars, and Fanatics — the five major US books that update lines aggressively in response to sharp money. Their closing numbers are the closest thing baseball betting has to a fair market price. Offshore books are excluded because their lines are stale; including them inflates apparent edge without predicting anything real.

Live games are excluded entirely from odds fetches, so we never accidentally price into in-play markets. Pre-game lines only.

[09]

Generative analysis layer

Math models produce numbers. AI models produce explanations. Closerbets uses both — but they're in separate lanes, and no number on this site is generated by an AI model. The picks, the edges, the closing lines, the post-mortem grades all come from the math stack in Section 02. Anthropic Claude (Sonnet) runs on top of that output to produce natural-language analysis — the "why this pick" thesis on the card, the loss attribution on the post-mortem — but it never sets a probability or chooses a line.

Eight surfaces use it today:

Daily card
Per-pick narrative thesis — why the model likes this pick
Post-mortem
Loss attribution — which features failed yesterday's pick
Drift detection
Root-cause hypothesis when CLV or Brier deteriorates week-over-week
Morning brief
Landscape narrative — today's storylines and regression candidates
Pregame alerts
Impact explanation when a pitcher changes or an injury lands
Game recap
Narrative summary of yesterday's results
/matchup (Discord)
Scouting report on demand for any pitcher / hitter / matchup
/prop-analysis (Discord)
Player prop analysis with on-the-fly model inference + narrative

The separation matters. The track record on this site reflects model quality, not LLM quality. If the LLM layer were swapped to a different provider tomorrow — or removed entirely — the picks, the closing line value, and the post-mortem grades would be identical. The LLM is a presentation surface, not a prediction surface.

[10]

What runs and when

Closerbets isn't something a person sits down and runs every morning. It's a 24-hour loop of automated jobs — ingestion, modeling, publishing, results settlement, weekly retraining — that fires whether anyone is watching or not. This website and the Discord bot are both consumers of the same loop; pages here are static and regenerate on a fixed cadence from the same source the bot reads from.

Times below are Eastern. Underlying crons are UTC-pinned, so labels shift by one hour between EDT (summer) and EST (winter).

11:00 PM ET (prev day)
Daily card — picks generated overnight for the next day's slate (captures early lines), posted to Discord with per-pick narrative thesis, risk-checked against drawdown thresholds
Post-card
Pregame timers — per-game checks scheduled at T-180min (starting-pitcher confirmation), T-60min (lineup integration), and T-10min (closing-line capture)
3:00 AM ET
Nightly sync — yesterday's results, ELO update, stat refreshes
3:45 AM ET
Track results — resolve picks, shadow-evaluate every prediction, nightly attribution, post-mortem grading, game recap
6:30 AM ET
Morning brief — regression candidates, depleted bullpens, recent performance, generative landscape narrative
Hourly
Data quality — freshness and consistency checks across the pipeline
1:00–6:30 PM ET
Supplemental card — half-hourly sweep that re-evaluates games which had no overnight odds (west coast, late posts)
Sunday 6 AM ET
Weekly retrain — backfill, retrain models, drift detection with generative root-cause hypothesis

Nothing runs by hand. If a step fails, the system either retries or surfaces the failure to Discord — there is no quiet degradation.

[11]

Validation

The model is judged on three numbers, all measured against held-out data the model never saw during training (a technique called walk-forward backtesting — train on seasons before S, predict season S, never leak the future).

Brier score
0.2426 — a coin flip would score 0.250. The model improves on naive baseline by 0.024, small in absolute terms but a meaningful edge over thousands of picks.
Out-of-sample accuracy
57.2% on moneyline. Higher than the 50% coin-flip baseline because we only bet picks where the model meaningfully disagrees with the market.
Calibration
Within 2.5% in high-volume probability buckets. Of picks the model says are 55–60% to win, actually win about 55–60% of the time.

Brier score is the standard accuracy metric for probabilistic predictions — it punishes both being wrong and being overconfident when wrong. Calibration is the metric the trained-but-disabled calibrator was supposed to improve and didn't; raw ensemble calibration is already inside the noise band of what a calibrator could add.

But the validation metric that matters most is closing line value— see Section 05. Brier and accuracy can look good on a stale model that the market has caught up to; CLV can't.

[12]

What we don't claim

Half the value of a public model is being honest about what isn't working yet. Two layers here: things shipped but not at their ideal version, and things we measured and explicitly closed.

Shipped, but caveated:

Calibration layer
trained but disabled — raw ensemble currently outperforms (re-confirmed three separate retrain cycles)
Runline
shadow-tracked only, not on the public card. Side-conditional calibrator (v2) ships in the inference path but stays inert until ≥150 closed picks accumulate.
Hitter & K props
research mode only, not on the public card — tracked separately on /research
F5 (first-5-innings) markets
shadow-tracked ML + totals. Promotion gate is ≥150 closed picks (≥200 for F5 ML due to ~25% tie rate).
Bet sizing
flat 0.5 units per on-card pick. Early EV-tiered analysis (N=44) suggested a ~44% P&L lift; win rate cooled, pending re-run at N=100+.
Calibration plot, CLV histogram on /track-record
designed, not yet built

Measured and explicitly closed: Every plausible feature or gate change gets backtested before shipping. The bar is a validation Brier improvement of at least 0.001 on held-out data. Below is a partial inventory of changes tested in the last two months that didn't clear it. Negative results matter — they prevent the model from accumulating cruft and fitting noise.

Isotonic calibrator (×3 tested)
Measured:ΔBrier +0.0009 (2024), +0.0020 (2025)
Verdict:Disabled. Pickle ships as harmless cruft.
Umpire run influence as ML feature
Measured:73% defaulted-zero in training data; ΔBrier −0.0001
Verdict:Dropped from MODEL_FEATURE_COLS 2026-05-01.
Lineup-aware wRC+ (per-batter rollup)
Measured:ΔBrier +0.0002 on n=2415
Verdict:Inert. Team-season wRC+ already integrates the season's lineup mix.
Component-disagreement filter
Measured:High-disagreement bucket was BEST-calibrated (opposite of hypothesis)
Verdict:Closed. The original audit signal was noise at n=68.
Weather + timezone-change features
Measured:Best individual ΔBrier −0.0004 (game_temp_f); none ≤ −0.0010
Verdict:Closed. Model is feature-saturated.
Side-asymmetry gate (longshot bucket)
Measured:No filter clears the 2026-05-08 ML-cap post-mortem bar
Verdict:Deferred to next recalibration cycle.
Recent form / momentum as feature
Measured:SABR study: 94.5% of streaks within expected variance
Verdict:Used only as a small edge-side overlay (streak fade), never as a probability input.
On-card-vs-lean inversion fix
Measured:Inversion is a market-mix artifact; counterfactual gates fail
Verdict:Closed. Not a gate-quality signal.

The pattern across these is consistent: the model is feature- saturated for its current architecture. Future lift comes from better calibration, smarter sizing, or whole new markets — not from adding more inputs.

Everything that does ship is visible on /track-record — every settled pick, every day, since inception.