The idea felt obvious. Every pitcher has patterns. If you know the count, the handedness matchup, the base-runner situation, the previous pitch, and a dozen other factors — surely you can predict the next pitch type well enough to be useful.
We pulled the full 2025 MLB season from Statcast — 729,827 pitches — and built exactly that model. XGBoost, random forest, logistic regression, per-pitcher and pooled, with every pre-pitch feature we could engineer: count, handedness, score differential, pitch count, times through the order, previous pitch in the at-bat, previous pitch result, runner positions. Strict time-based train/test splits. No random shuffling. No cheating.
The First Run: 97% Accuracy
The first model came back at 97% accuracy. That immediately told us something was wrong.
We audited the features and found the model was using release_speed and release_spin_rate — measurements taken after the pitch is thrown. Of course you can predict what pitch was thrown if you already know how fast it went and how much it spun. Strip those post-pitch features out, and the picture changes completely.
This is the kind of data leakage that kills machine learning projects in the real world. If we'd trusted the initial result instead of auditing it, everything that followed would have been wrong.
The Honest Results
Here's what the properly constructed model produced across the full league — all 1,081 pitchers who appeared in 2025:
| Model | Accuracy | Baseline | Gain |
|---|---|---|---|
| XGBoost | 39.61% | 37.47% (marginal) | +2.1pp |
| XGBoost | 39.61% | 39.09% (count-conditional) | +0.5pp |
The model beats the marginal baseline (just picking the most common pitch type overall) by +2.1 percentage points. Sounds decent. But it beats the count-conditional baseline (picking the most common pitch type for that count) by only +0.5pp.
Log-loss tells the real story
Accuracy alone can be misleading. Log-loss measures how calibrated the model's probability estimates are — a more demanding metric. And here the model actually performs worse than the baselines:
| Approach | Log-loss |
|---|---|
| XGBoost model | 1.5248 |
| Marginal baseline | 1.4974 |
| Count-conditional baseline | 1.4884 |
Lower is better. The model is less calibrated than simply using count-based frequencies. It's overfitting to noise rather than capturing real signal.
The Exception That Proves the Rule
Not every pitcher is equally unpredictable. When we filtered to the five most predictable pitchers in the sample, the thesis came back to life:
| Pitcher | Accuracy | Gain vs baseline |
|---|---|---|
| Chris Sale | 58.14% | +8.48pp |
| Corbin Burnes | 56.50% | +2.22pp |
| Logan Gilbert | 51.24% | +2.92pp |
| Tarik Skubal | 38.61% | +4.17pp |
| Seth Lugo | 27.98% | +3.35pp |
Chris Sale is the standout. His behavioral signature is compact and stable: slider 43%, four-seam 40%, changeup 11%, sinker 6%. On a 0-2 count against a left-handed hitter, he throws the slider 76% of the time. On 3-0, it's fastball 90% of the time.
But here's the uncomfortable truth: even Sale's predictability doesn't obviously translate into a product. The model says "slider" and it's right 58% of the time. Okay — but the batter also knows it's probably a slider. The advantage isn't asymmetric. The information is already priced in. The top-5 most predictable pitchers represent too small a subset to build a viable product around.
Why the Count Wins
In plain language: if you know the count and nothing else, you already have most of what's knowable about the next pitch. The sequencing history, the platoon matchup, the score, the runners, the fatigue indicators — all of those sophisticated features, combined, add about half a percentage point beyond simply knowing it's a 1-2 count.
This makes intuitive sense once you think about it. The count encodes the game state that matters most to the pitcher's decision: how far ahead or behind he is. A pitcher on 0-2 is going to nibble at the edges regardless of who's batting, what runners are on, or what the last pitch was. A pitcher on 3-0 is going to throw a fastball down the middle almost every time.
Explore the count's power yourself — select any count below to see the full outcome distribution: