03 · Methodology

How the model actually works.

The optimization target is closing line value, not win rate. The architecture follows from that.

[01]

What we model

Three markets per game: moneyline, total runs, and runline. Each is a separate prediction. Strikeout props and hitter props are tracked in research mode but not on the public card.

For every market, the model produces a probability of a specific outcome (e.g., home team wins, total goes under). That probability is compared to the implied probability of the best price across the sharp books. If the gap exceeds a market-specific EV threshold, the pick goes on the card.

Moneyline / Totals / K props threshold

≥ 3% EV

Hitter props / parlay legs threshold

≥ 0% EV

Stake sizing

flat 0.5 units

Daily card cap

top 3 ML+totals by EV (tier 1 only)

[02]

One model per market

Closerbets isn't a single model — it's four, one per market we price. Each is trained, validated, and stored independently.

Moneyline

Linear ensemble: ELO 34% / Logistic 33% / XGBoost 33%

Totals

TotalsModel — Ridge + negative-binomial run distribution

Strikeouts

StrikeoutModel — Ridge + Poisson

Hitter props

HitterPropModel — Ridge + Poisson

Runline

Derived from the ML ensemble + totals via Pythagorean inversion

Calibration layer

trained, currently disabled

The moneyline ensemble weights are fixed by hand, not learned. The weights are stable across cross-validation folds, so a learned blender would be fitting noise rather than capturing real signal at this sample size.

The calibrator (isotonic regression on out-of-fold predictions) was trained and tested. It hurt validation Brier slightly, so the raw ensemble ships instead. Re-evaluated every retrain.

[03]

Features

Fifteen features for moneyline. Nineteen for totals. Every feature respects temporal boundaries — nothing computed from data that wasn't available at the game's first pitch.

Pitching

xFIP, SIERA, CSW%, velocity delta. Never raw ERA.

Offense

wRC+, ISO, BB%, K%, recent rolling form

Defense

team OAA (Outs Above Average) for both teams

Bullpen

rest state, recent appearances, depleted-pen flag

Park

venue run environment, roof state

Weather

temp + wind, backfilled per game time and roof

Umpire

called-strike tendency, RPG residual

Division

divisional matchup binary

Pythagorean gap

runs scored vs. allowed mismatch

Pregame overlays fire at T-60 minutes when lineups post: lineup-adjusted wRC+, catcher framing impact, velocity delta vs. season norm. Market overlays include a contrarian streak fade (capped at 2% adjustment) on 4+ win streak favorites.

Live data flows in from the MLB Stats and Transactions APIs throughout the day — injury news, roster moves, and lineup changes trigger re-evaluations when starting pitchers shift or key bats are scratched. Game-lineup data is captured into game_lineupsfor downstream features that depend on who's actually starting, not who was projected to start.

[04]

Stabilization

Pitcher and team stats are blended with prior-year priors by sample size — a Bayesian credibility weighting rather than a raw running average. A pitcher with 30 innings of current-season xFIP gets mostly weighted toward his prior-year true talent; one with 150 innings gets mostly current-year.

This matters most early in the season, where ten starts of noisy ERA-style metrics would otherwise drive the model. By late August the prior weight is small and the model is essentially looking at current-year numbers.

[05]

Closing line value

Anyone can show wins. Closing line value is the metric that survives variance — beat the closing number consistently and the wins follow over a full season.

Entry odds

−110

52.4% implied

line moves

CLV

+3.2pp

Closing odds

−125

55.6% implied

We bet at −110. By first pitch the line had moved to −125. The market adjusted toward our position by 3.2 percentage points — that's the CLV, regardless of whether the bet ultimately won.

We optimize for CLV because it's the only public-domain signal that predicts long-term edge before the sample is large enough to prove it from raw P&L. A model that consistently beats the close by 1-2% is one that the books — who have every incentive to price tightly — aren't keeping up with.

Win rate is noisy at any sample under several thousand picks. CLV stabilizes much faster. So we publish both, but treat CLV as the primary scorecard.

[06]

Closing line capture

Capturing the actual closing line — not the opener, not the line when we bet — requires infrastructure. Naive approaches (capturing at scheduled times) miss because games start at irregular minutes past the hour and odds APIs lag.

Our setup: an AWS Lambda function, scheduled per-game by EventBridge based on each game's commence_time, fires at T-10 minutes per game. It pulls the current odds from the same sharp books we entered at, stores them as the closing line, and computes CLV against the entry odds. CLV is what the website displays — not opening-line value, not midline value.

Lambda fire time

T-10min per game (game-specific)

Books captured

FanDuel, DraftKings, BetMGM, Caesars, Fanatics

Schedule source

EventBridge, populated from tracked picks

Storage

closing_odds + clv columns on tracked_picks

[07]

Post-mortem grading

Every settled pick is graded on decision quality, not just result. The grade lives in one of four quadrants based on the combination of CLV (did we beat the close?) and result (did we win?):

Good Bet

+CLV and won — model was right and the market agreed

Unlucky

+CLV and lost — model was right, variance went the other way

Got Lucky

−CLV and won — market disagreed and we got bailed out

Bad Bet

−CLV and lost — model was wrong, full stop

The Good Bet + Unlucky bucket is what compounds. The Got Lucky + Bad Bet bucket is what we work to shrink. Treating losses as uniformly bad — or wins as uniformly good — masks the actual model quality and leads to chasing the wrong corrections.

Quadrant grades show up next to every settled pick on the daily cards and roll up into the aggregate breakdown on the track record page.

[08]

Books, sample, and limits

We only use odds from FanDuel, DraftKings, BetMGM, Caesars, and Fanatics. Offshore books are excluded — their lines are stale and their CLV signal is unreliable. Including them inflates apparent edge without predicting anything.

Live games are excluded entirely from odds fetches via the commenceTimeFrom API parameter, so we never accidentally price into in-play markets.

[09]

Generative analysis layer

The prediction stack produces probabilities and edges. Anthropic Claude (Sonnet) runs on top of that output to produce natural-language analysis — never to drive picks, never to set probabilities, never to choose lines. Every number that ships is from the math models in Section 02. The LLM layer is reproducible commentary on those numbers.

Eight surfaces use it today:

Daily card

Per-pick narrative thesis — why the model likes this pick

Post-mortem

Loss attribution — which features failed yesterday's pick

Drift detection

Root-cause hypothesis when CLV or Brier deteriorates week-over-week

Morning brief

Landscape narrative — today's storylines and regression candidates

Pregame alerts

Impact explanation when a pitcher changes or an injury lands

Game recap

Narrative summary of yesterday's results

/matchup (Discord)

Scouting report on demand for any pitcher / hitter / matchup

/prop-analysis (Discord)

Player prop analysis with on-the-fly model inference + narrative

The separation matters. The track record on this site reflects model quality, not LLM quality. If the LLM layer were swapped to a different provider tomorrow — or removed entirely — the picks, the closing line value, and the post-mortem grades would be identical. The LLM is a presentation surface, not a prediction surface.

[10]

Operations

The system runs as a continuous 24-hour cycle. Six scheduled GitHub Actions plus an AWS Lambda pregame service drive everything from data ingestion to publishing. The website updates from this loop via ISR — the pages are static, regenerated hourly from the same database the bot writes to.

11:00 PM ET (prev day)

Daily card — picks generated overnight for the next day's slate (captures early lines), posted to Discord with per-pick narrative thesis, risk-checked against drawdown thresholds

Post-card

Pregame orchestrator — Lambda schedules per-game checks at T-60min (SP integrity, lineup integration) and T-10min (CLV capture)

3:00 AM ET

Nightly sync — yesterday's results, ELO update, stat refreshes

3:45 AM ET

Track results — resolve picks, shadow-evaluate all daily_predictions, post-mortem grading, game recap

6:30 AM ET

Morning brief — regression candidates, depleted bullpens, recent performance, Sonnet landscape narrative

2:00 PM ET

Supplemental card — re-evaluates games that had no overnight odds (west coast, late posts)

Sunday 6 AM ET

Weekly retrain — backfill, retrain models, drift detection with Sonnet root-cause hypothesis

Nothing runs by hand. If a step fails, the system either retries inside the workflow or surfaces the failure to Discord — there is no quiet degradation. This site is one consumer of the data this loop produces; the Discord bot is another.

[11]

What we don't claim

Things that aren't real yet, or aren't shipping the way the ideal version would:

Calibration

trained but disabled — raw ensemble currently outperforms

Runline

shadow-tracked only, not on the public card; accumulating sample to validate

Hitter & K props

research mode only, not on the public card — tracked separately

Lineup-aware platoon features

Backfill + ingest infrastructure ships; the K%-vs-hand variant was measured (ΔBrier +0.00009 on 2025 val) and reverted as inert. Data flow stays for future iterations.

Calibration plot, CLV histogram

coming to /track-record

The bar for shipping a feature is empirical — it has to clear a validation Brier threshold on held-out data. Several plausible additions (alternative bullpen scoring, lineup K%-vs-hand) have been measured and closed because the lift wasn't there at the sample size we have.

Everything that does ship is visible on /track-record — every settled pick, every day, since inception.