Skip to content

The Count Tells You Everything

We built a pitch prediction model on 729,827 pitches. It works — but it barely beats just knowing the count.


The idea felt obvious. Every pitcher has patterns. If you know the count, the handedness matchup, the base-runner situation, the previous pitch, and a dozen other factors — surely you can predict the next pitch type well enough to be useful.

We pulled the full 2025 MLB season from Statcast — 729,827 pitches — and built exactly that model. XGBoost, random forest, logistic regression, per-pitcher and pooled, with every pre-pitch feature we could engineer: count, handedness, score differential, pitch count, times through the order, previous pitch in the at-bat, previous pitch result, runner positions. Strict time-based train/test splits. No random shuffling. No cheating.

The First Run: 97% Accuracy

The first model came back at 97% accuracy. That immediately told us something was wrong.

We audited the features and found the model was using release_speed and release_spin_rate — measurements taken after the pitch is thrown. Of course you can predict what pitch was thrown if you already know how fast it went and how much it spun. Strip those post-pitch features out, and the picture changes completely.

This is the kind of data leakage that kills machine learning projects in the real world. If we'd trusted the initial result instead of auditing it, everything that followed would have been wrong.

The Honest Results

Here's what the properly constructed model produced across the full league — all 1,081 pitchers who appeared in 2025:

Model Accuracy Baseline Gain
XGBoost 39.61% 37.47% (marginal) +2.1pp
XGBoost 39.61% 39.09% (count-conditional) +0.5pp

The model beats the marginal baseline (just picking the most common pitch type overall) by +2.1 percentage points. Sounds decent. But it beats the count-conditional baseline (picking the most common pitch type for that count) by only +0.5pp.

Log-loss tells the real story

Accuracy alone can be misleading. Log-loss measures how calibrated the model's probability estimates are — a more demanding metric. And here the model actually performs worse than the baselines:

Approach Log-loss
XGBoost model1.5248
Marginal baseline1.4974
Count-conditional baseline1.4884

Lower is better. The model is less calibrated than simply using count-based frequencies. It's overfitting to noise rather than capturing real signal.

The Exception That Proves the Rule

Not every pitcher is equally unpredictable. When we filtered to the five most predictable pitchers in the sample, the thesis came back to life:

Pitcher Accuracy Gain vs baseline
Chris Sale58.14%+8.48pp
Corbin Burnes56.50%+2.22pp
Logan Gilbert51.24%+2.92pp
Tarik Skubal38.61%+4.17pp
Seth Lugo27.98%+3.35pp

Chris Sale is the standout. His behavioral signature is compact and stable: slider 43%, four-seam 40%, changeup 11%, sinker 6%. On a 0-2 count against a left-handed hitter, he throws the slider 76% of the time. On 3-0, it's fastball 90% of the time.

But here's the uncomfortable truth: even Sale's predictability doesn't obviously translate into a product. The model says "slider" and it's right 58% of the time. Okay — but the batter also knows it's probably a slider. The advantage isn't asymmetric. The information is already priced in. The top-5 most predictable pitchers represent too small a subset to build a viable product around.

Why the Count Wins

In plain language: if you know the count and nothing else, you already have most of what's knowable about the next pitch. The sequencing history, the platoon matchup, the score, the runners, the fatigue indicators — all of those sophisticated features, combined, add about half a percentage point beyond simply knowing it's a 1-2 count.

This makes intuitive sense once you think about it. The count encodes the game state that matters most to the pitcher's decision: how far ahead or behind he is. A pitcher on 0-2 is going to nibble at the edges regardless of who's batting, what runners are on, or what the last pitch was. A pitcher on 3-0 is going to throw a fastball down the middle almost every time.

Explore the count's power yourself — select any count below to see the full outcome distribution:

Count Calculator: Outcome Probabilities

0S
1S
2S
0B
1B
2B
3B

No 2026 outcome data available yet for this count. Data is updated nightly.

Glossary: wOBA = Weighted On-Base Average (league avg ~.310; combines all offensive outcomes weighted by run value). Avg EV = average exit velocity on balls in play. False strike = called strike on pitch outside zone. Missed strike = called ball on pitch inside zone. Challenge value = counterfactual wOBA impact of a wrong call.

Select a count to see strikeout, walk, hit, and other outcome probabilities from 2025 Statcast data. Notice how dramatically the distribution shifts with each ball and strike.

Verdict

The broad pitch prediction thesis is dead. The count wins.

A handful of extremely predictable pitchers exist, but the subset is too small and the advantage is symmetric — if a model can spot the pattern, so can the batter. For the league as a whole, the full machinery of modern machine learning adds half a percentage point over a lookup table indexed by count.

This is a negative result, and we're publishing it because negative results matter. If we'd stopped at the +2.1pp marginal improvement, this would have looked promising. It's only by comparing against the right baseline — the count-conditional one — that the finding becomes clear. The count already encodes nearly all predictive information about the next pitch.


Methodology

Data source: 2025 MLB Statcast data (729,827 pitches across all 1,081 pitchers). Accessed via pybaseball.

Features: Count, batter handedness, pitcher handedness, score differential, pitch count, times through the order, previous pitch type and result, runner positions, inning. No post-pitch features (release_speed, release_spin_rate, etc.).

Models: XGBoost (primary), random forest, logistic regression. Both pooled (all pitchers) and per-pitcher variants.

Baselines: Marginal baseline = most common pitch type overall (37.47%). Count-conditional baseline = most common pitch type for that specific count (39.09%).

Splits: Strict temporal train/test split. No random shuffling. No future information in any feature.

Log-loss: Cross-entropy loss measuring calibration quality. Lower is better. The model's 1.5248 is worse than both baselines (1.4974 marginal, 1.4884 count-conditional).

If you find an error, tell us — we'd rather be corrected than wrong.

Full methodology documentation →