03 · Methodology
How the model actually works.
The optimization target is closing line value, not win rate. The architecture follows from that.
What we model
Three markets per game: moneyline, total runs, and runline. Each is a separate prediction. Strikeout props and hitter props are tracked in research mode but not on the public card.
For every market, the model produces a probability of a specific outcome (e.g., home team wins, total goes under). That probability is compared to the implied probability of the best price across the sharp books. If the gap exceeds a market-specific EV threshold, the pick goes on the card.
One model per market
Closerbets isn't a single model — it's four, one per market we price. Each is trained, validated, and stored independently.
The moneyline ensemble weights are fixed by hand, not learned. The weights are stable across cross-validation folds, so a learned blender would be fitting noise rather than capturing real signal at this sample size.
The calibrator (isotonic regression on out-of-fold predictions) was trained and tested. It hurt validation Brier slightly, so the raw ensemble ships instead. Re-evaluated every retrain.
Features
Fifteen features for moneyline. Nineteen for totals. Every feature respects temporal boundaries — nothing computed from data that wasn't available at the game's first pitch.
Pregame overlays fire at T-60 minutes when lineups post: lineup-adjusted wRC+, catcher framing impact, velocity delta vs. season norm. Market overlays include a contrarian streak fade (capped at 2% adjustment) on 4+ win streak favorites.
Live data flows in from the MLB Stats and Transactions APIs throughout the day — injury news, roster moves, and lineup changes trigger re-evaluations when starting pitchers shift or key bats are scratched. Game-lineup data is captured into game_lineupsfor downstream features that depend on who's actually starting, not who was projected to start.
Stabilization
Pitcher and team stats are blended with prior-year priors by sample size — a Bayesian credibility weighting rather than a raw running average. A pitcher with 30 innings of current-season xFIP gets mostly weighted toward his prior-year true talent; one with 150 innings gets mostly current-year.
This matters most early in the season, where ten starts of noisy ERA-style metrics would otherwise drive the model. By late August the prior weight is small and the model is essentially looking at current-year numbers.
Closing line value
Anyone can show wins. Closing line value is the metric that survives variance — beat the closing number consistently and the wins follow over a full season.
We bet at −110. By first pitch the line had moved to −125. The market adjusted toward our position by 3.2 percentage points — that's the CLV, regardless of whether the bet ultimately won.
We optimize for CLV because it's the only public-domain signal that predicts long-term edge before the sample is large enough to prove it from raw P&L. A model that consistently beats the close by 1-2% is one that the books — who have every incentive to price tightly — aren't keeping up with.
Win rate is noisy at any sample under several thousand picks. CLV stabilizes much faster. So we publish both, but treat CLV as the primary scorecard.
Closing line capture
Capturing the actual closing line — not the opener, not the line when we bet — requires infrastructure. Naive approaches (capturing at scheduled times) miss because games start at irregular minutes past the hour and odds APIs lag.
Our setup: an AWS Lambda function, scheduled per-game by EventBridge based on each game's commence_time, fires at T-10 minutes per game. It pulls the current odds from the same sharp books we entered at, stores them as the closing line, and computes CLV against the entry odds. CLV is what the website displays — not opening-line value, not midline value.
Post-mortem grading
Every settled pick is graded on decision quality, not just result. The grade lives in one of four quadrants based on the combination of CLV (did we beat the close?) and result (did we win?):
The Good Bet + Unlucky bucket is what compounds. The Got Lucky + Bad Bet bucket is what we work to shrink. Treating losses as uniformly bad — or wins as uniformly good — masks the actual model quality and leads to chasing the wrong corrections.
Quadrant grades show up next to every settled pick on the daily cards and roll up into the aggregate breakdown on the track record page.
Books, sample, and limits
We only use odds from FanDuel, DraftKings, BetMGM, Caesars, and Fanatics. Offshore books are excluded — their lines are stale and their CLV signal is unreliable. Including them inflates apparent edge without predicting anything.
Live games are excluded entirely from odds fetches via the commenceTimeFrom API parameter, so we never accidentally price into in-play markets.
Generative analysis layer
The prediction stack produces probabilities and edges. Anthropic Claude (Sonnet) runs on top of that output to produce natural-language analysis — never to drive picks, never to set probabilities, never to choose lines. Every number that ships is from the math models in Section 02. The LLM layer is reproducible commentary on those numbers.
Eight surfaces use it today:
The separation matters. The track record on this site reflects model quality, not LLM quality. If the LLM layer were swapped to a different provider tomorrow — or removed entirely — the picks, the closing line value, and the post-mortem grades would be identical. The LLM is a presentation surface, not a prediction surface.
Operations
The system runs as a continuous 24-hour cycle. Six scheduled GitHub Actions plus an AWS Lambda pregame service drive everything from data ingestion to publishing. The website updates from this loop via ISR — the pages are static, regenerated hourly from the same database the bot writes to.
Nothing runs by hand. If a step fails, the system either retries inside the workflow or surfaces the failure to Discord — there is no quiet degradation. This site is one consumer of the data this loop produces; the Discord bot is another.
What we don't claim
Things that aren't real yet, or aren't shipping the way the ideal version would:
The bar for shipping a feature is empirical — it has to clear a validation Brier threshold on held-out data. Several plausible additions (alternative bullpen scoring, lineup K%-vs-hand) have been measured and closed because the lift wasn't there at the sample size we have.
Everything that does ship is visible on /track-record — every settled pick, every day, since inception.