closerbets/// MLBbeta

03 · Methodology

How the model actually works.

The optimization target is closing line value, not win rate. The architecture follows from that.

[01]

What we model

Three markets per game: moneyline, total runs, and runline. Each is a separate prediction. Strikeout props and hitter props are tracked in research mode but not on the public card.

For every market, the model produces a probability of a specific outcome (e.g., home team wins, total goes under). That probability is compared to the implied probability of the best price across the sharp books. If the gap exceeds a market-specific EV threshold, the pick goes on the card.

Moneyline / Totals / K props threshold
≥ 3% EV
Hitter props / parlay legs threshold
≥ 0% EV
Stake sizing
flat 0.5 units
Daily card cap
top 3 ML+totals by EV (tier 1 only)
[02]

One model per market

Closerbets isn't a single model — it's four, one per market we price. Each is trained, validated, and stored independently.

Moneyline
Linear ensemble: ELO 34% / Logistic 33% / XGBoost 33%
Totals
TotalsModel — Ridge + negative-binomial run distribution
Strikeouts
StrikeoutModel — Ridge + Poisson
Hitter props
HitterPropModel — Ridge + Poisson
Runline
Derived from the ML ensemble + totals via Pythagorean inversion
Calibration layer
trained, currently disabled

The moneyline ensemble weights are fixed by hand, not learned. The weights are stable across cross-validation folds, so a learned blender would be fitting noise rather than capturing real signal at this sample size.

The calibrator (isotonic regression on out-of-fold predictions) was trained and tested. It hurt validation Brier slightly, so the raw ensemble ships instead. Re-evaluated every retrain.

[03]

Features

Fifteen features for moneyline. Nineteen for totals. Every feature respects temporal boundaries — nothing computed from data that wasn't available at the game's first pitch.

Pitching
xFIP, SIERA, CSW%, velocity delta. Never raw ERA.
Offense
wRC+, ISO, BB%, K%, recent rolling form
Defense
team OAA (Outs Above Average) for both teams
Bullpen
rest state, recent appearances, depleted-pen flag
Park
venue run environment, roof state
Weather
temp + wind, backfilled per game time and roof
Umpire
called-strike tendency, RPG residual
Division
divisional matchup binary
Pythagorean gap
runs scored vs. allowed mismatch

Pregame overlays fire at T-60 minutes when lineups post: lineup-adjusted wRC+, catcher framing impact, velocity delta vs. season norm. Market overlays include a contrarian streak fade (capped at 2% adjustment) on 4+ win streak favorites.

Live data flows in from the MLB Stats and Transactions APIs throughout the day — injury news, roster moves, and lineup changes trigger re-evaluations when starting pitchers shift or key bats are scratched. Game-lineup data is captured into game_lineupsfor downstream features that depend on who's actually starting, not who was projected to start.

[04]

Stabilization

Pitcher and team stats are blended with prior-year priors by sample size — a Bayesian credibility weighting rather than a raw running average. A pitcher with 30 innings of current-season xFIP gets mostly weighted toward his prior-year true talent; one with 150 innings gets mostly current-year.

This matters most early in the season, where ten starts of noisy ERA-style metrics would otherwise drive the model. By late August the prior weight is small and the model is essentially looking at current-year numbers.

[05]

Closing line value

Anyone can show wins. Closing line value is the metric that survives variance — beat the closing number consistently and the wins follow over a full season.

Entry odds
−110
52.4% implied
line moves
CLV
+3.2pp
Closing odds
−125
55.6% implied

We bet at −110. By first pitch the line had moved to −125. The market adjusted toward our position by 3.2 percentage points — that's the CLV, regardless of whether the bet ultimately won.

We optimize for CLV because it's the only public-domain signal that predicts long-term edge before the sample is large enough to prove it from raw P&L. A model that consistently beats the close by 1-2% is one that the books — who have every incentive to price tightly — aren't keeping up with.

Win rate is noisy at any sample under several thousand picks. CLV stabilizes much faster. So we publish both, but treat CLV as the primary scorecard.

[06]

Closing line capture

Capturing the actual closing line — not the opener, not the line when we bet — requires infrastructure. Naive approaches (capturing at scheduled times) miss because games start at irregular minutes past the hour and odds APIs lag.

Our setup: an AWS Lambda function, scheduled per-game by EventBridge based on each game's commence_time, fires at T-10 minutes per game. It pulls the current odds from the same sharp books we entered at, stores them as the closing line, and computes CLV against the entry odds. CLV is what the website displays — not opening-line value, not midline value.

Lambda fire time
T-10min per game (game-specific)
Books captured
FanDuel, DraftKings, BetMGM, Caesars, Fanatics
Schedule source
EventBridge, populated from tracked picks
Storage
closing_odds + clv columns on tracked_picks
[07]

Post-mortem grading

Every settled pick is graded on decision quality, not just result. The grade lives in one of four quadrants based on the combination of CLV (did we beat the close?) and result (did we win?):

Good Bet
+CLV and won — model was right and the market agreed
Unlucky
+CLV and lost — model was right, variance went the other way
Got Lucky
−CLV and won — market disagreed and we got bailed out
Bad Bet
−CLV and lost — model was wrong, full stop

The Good Bet + Unlucky bucket is what compounds. The Got Lucky + Bad Bet bucket is what we work to shrink. Treating losses as uniformly bad — or wins as uniformly good — masks the actual model quality and leads to chasing the wrong corrections.

Quadrant grades show up next to every settled pick on the daily cards and roll up into the aggregate breakdown on the track record page.

[08]

Books, sample, and limits

We only use odds from FanDuel, DraftKings, BetMGM, Caesars, and Fanatics. Offshore books are excluded — their lines are stale and their CLV signal is unreliable. Including them inflates apparent edge without predicting anything.

Live games are excluded entirely from odds fetches via the commenceTimeFrom API parameter, so we never accidentally price into in-play markets.

[09]

Generative analysis layer

The prediction stack produces probabilities and edges. Anthropic Claude (Sonnet) runs on top of that output to produce natural-language analysis — never to drive picks, never to set probabilities, never to choose lines. Every number that ships is from the math models in Section 02. The LLM layer is reproducible commentary on those numbers.

Eight surfaces use it today:

Daily card
Per-pick narrative thesis — why the model likes this pick
Post-mortem
Loss attribution — which features failed yesterday's pick
Drift detection
Root-cause hypothesis when CLV or Brier deteriorates week-over-week
Morning brief
Landscape narrative — today's storylines and regression candidates
Pregame alerts
Impact explanation when a pitcher changes or an injury lands
Game recap
Narrative summary of yesterday's results
/matchup (Discord)
Scouting report on demand for any pitcher / hitter / matchup
/prop-analysis (Discord)
Player prop analysis with on-the-fly model inference + narrative

The separation matters. The track record on this site reflects model quality, not LLM quality. If the LLM layer were swapped to a different provider tomorrow — or removed entirely — the picks, the closing line value, and the post-mortem grades would be identical. The LLM is a presentation surface, not a prediction surface.

[10]

Operations

The system runs as a continuous 24-hour cycle. Six scheduled GitHub Actions plus an AWS Lambda pregame service drive everything from data ingestion to publishing. The website updates from this loop via ISR — the pages are static, regenerated hourly from the same database the bot writes to.

11:00 PM ET (prev day)
Daily card — picks generated overnight for the next day's slate (captures early lines), posted to Discord with per-pick narrative thesis, risk-checked against drawdown thresholds
Post-card
Pregame orchestrator — Lambda schedules per-game checks at T-60min (SP integrity, lineup integration) and T-10min (CLV capture)
3:00 AM ET
Nightly sync — yesterday's results, ELO update, stat refreshes
3:45 AM ET
Track results — resolve picks, shadow-evaluate all daily_predictions, post-mortem grading, game recap
6:30 AM ET
Morning brief — regression candidates, depleted bullpens, recent performance, Sonnet landscape narrative
2:00 PM ET
Supplemental card — re-evaluates games that had no overnight odds (west coast, late posts)
Sunday 6 AM ET
Weekly retrain — backfill, retrain models, drift detection with Sonnet root-cause hypothesis

Nothing runs by hand. If a step fails, the system either retries inside the workflow or surfaces the failure to Discord — there is no quiet degradation. This site is one consumer of the data this loop produces; the Discord bot is another.

[11]

What we don't claim

Things that aren't real yet, or aren't shipping the way the ideal version would:

Calibration
trained but disabled — raw ensemble currently outperforms
Runline
shadow-tracked only, not on the public card; accumulating sample to validate
Hitter & K props
research mode only, not on the public card — tracked separately
Lineup-aware platoon features
Backfill + ingest infrastructure ships; the K%-vs-hand variant was measured (ΔBrier +0.00009 on 2025 val) and reverted as inert. Data flow stays for future iterations.
Calibration plot, CLV histogram
coming to /track-record

The bar for shipping a feature is empirical — it has to clear a validation Brier threshold on held-out data. Several plausible additions (alternative bullpen scoring, lineup K%-vs-hand) have been measured and closed because the lift wasn't there at the sample size we have.

Everything that does ship is visible on /track-record — every settled pick, every day, since inception.