Watch enough baseball and you eventually notice that umpires call different games. Some have generous zones, some tight ones. Pitchers complain about it. Hitters complain about it. The booth complains about it. But how different is “different”? And is it actually the umpire, or just the catcher framing pitches for him?
We looked at every called pitch from the 2025 regular season — 363,665 of them, across 83 home-plate umpires — and tested three questions that come up every game. The answers were sharper, and weirder, than we expected.
What we found
- The outside corner is 19 percentage points wide, from the most outside-generous umpire to the most inside-generous. That’s wider than the entire range of catcher framing — and the gap holds even after we drop the most influential framers from the data. It’s the umpire, not the catcher.
- The broadcaster line “he’ll expand the zone with two strikes” is backwards. Three independent statistical methods agree: on the same exact pitch location, umpires call fewer strikes on 0-2 than on 0-0 — by about 16 to 21 percentage points. The 3-0 expansion is real, the 0-2 expansion isn’t.
- Umpires don’t come in discrete “personality types.” We pre-registered an attempt to cluster 83 of them into groups (a Tight Outsider type, a Generous Expander type, etc.) and the data refused. There’s no taxonomy — just a continuous gradient.
1. The 19-point outside corner
For each of 83 qualified umpires, we measured how often they call a strike on the outside edge of the plate versus the inside edge, relative to what a baseline model expects from pitch location alone. A positive number means “more generous on outside”; negative means “more generous on inside.” We call it the In-Out Asymmetry Index (IOAI).
Here’s the leaderboard:
All 83 qualified 2025 home-plate umpires. Bars to the right of zero mean extra strikes on the outside corner relative to league baseline; bars to the left mean extra strikes on the inside corner. Top 5 and bottom 5 are labeled with 95% confidence intervals. The middle 73 are shown as a faint gradient so you can see there’s no natural break in the population.
The gap from Stu Scheurwater (+12.0pp, top) to Alex MacKay (−7.0pp, bottom) is 18.9 percentage points. That’s wider than the catcher-framing range we’ve previously measured — and unlike framing, it’s about the umpire’s own habit, not the catcher’s skill. Click any umpire in the chart above to see their full profile.
Here’s what 18.9 percentage points looks like at the plate:
Per-pitch called-strike rate by plate location for each umpire, right-handed batters, 2025 regular season. Rulebook zone overlaid in white. Catcher’s perspective — outside corner for RHB is on the right.
To translate: at the league-wide borderline strike rate, the difference between a Scheurwater outside-corner call and a MacKay outside-corner call — on the same pitch — is roughly the difference between a 70% strike rate and a 51% strike rate. Different physics, different umpire, different game.
It’s the umpire, not the catcher
The obvious objection: maybe Scheurwater just keeps drawing the best framing catchers, who pull pitches back from the outside corner for him. To test this, we recomputed every umpire’s IOAI after dropping the pitches caught by the top-10 framing catchers in 2025 (about 14% of qualified pitches per umpire). If framing were doing the work, IOAI rankings should reshuffle a lot.
They didn’t. The correlation between each umpire’s full-sample IOAI and their top-framer-dropped IOAI is r > 0.94 in both of our independent statistical pipelines. Catcher framing isn’t the primary driver of the asymmetry. The umpire is doing the umpiring.
2. The two-strike myth
Every broadcast says it. “He’ll expand the zone with two strikes.” The implication is that umpires reward the pitcher who has done the work to get to 0-2 with a few extra inches off the plate — finishing the at-bat.
We tested it three different ways. All three agree the story is sign-backwards.
| Method (location-controlled) | Change from 0-0 to 0-2 | Change from 0-0 to 3-0 |
|---|---|---|
| Logistic regression | −16.5pp | +9.7pp |
| Stratified location bins | −20.0pp | +7.7pp |
| Generalized additive model | −20.7pp | +8.6pp |
Each row is an independent statistical model. The "change from 0-0 to 0-2" is the difference in called-strike rate at the same plate location between a 0-0 count and a 0-2 count. All three methods agree umpires call FEWER strikes (red, negative) on 0-2, and MORE strikes (green, positive) on 3-0, holding location constant.
On the same exact pitch location, an umpire is roughly 18 percentage points less likely to call it a strike when the count is 0-2 than when it’s 0-0. That’s the opposite of what every broadcast graphic implies.
The 3-0 side does match folk wisdom — umpires give the pitcher about 8 percentage points more on a 3-0 count, helping him climb back into the at-bat. So the broadcasters are half-right (3-0 expansion) and half-backwards (2-strike contraction). The “help the pitcher recover” pattern is real; the “reward the pitcher who’s ahead” pattern is the opposite of real.
One thing we can’t cleanly separate with this analysis: the pitches that arrive in the borderline zone differ subtly by count — a 0-2 chase pitch and a 0-0 first-pitch fastball can end up at the same coordinates with different perceived intent. The triple-method agreement strongly suggests the umpire-side effect is real, but a fully causal claim would need a controlled experiment we don’t have.
And here’s the kicker: even though the population-level pattern is real, specific umpires don’t reliably differ on it. The umpires who most aggressively expanded the zone on 2-strike counts in the first half of 2025 are essentially uncorrelated with the umpires who did it in the second half. The pattern is a league behavior, not a per-umpire habit. So a broadcast graphic flagging “Ump X expands on two strikes” is reading a sample-size artifact on top of an inverted population claim.
3. No personality types
The original premise of this analysis was simpler: cluster the 83 umpires into a small number of distinct calling-style types — the kind of taxonomy you could put on a broadcast graphic. We pre-registered five behavioral features per umpire (in-out asymmetry, edge aggression, high-low bias, count-conditioned expansion, and handedness sensitivity), used two completely independent statistical methods to attempt the clustering, and pre-committed to a set of tests the typology had to pass before we’d publish it.
It failed every test. Both methods. Four of five pre-registered tests came back negative, including the most direct one: when we asked the two methods whether they agreed on which umpires belonged in which group, the agreement was essentially zero.
Here’s why. If discrete types existed, the distributions of the five features would show bumps or gaps — clusters separated by empty space. They don’t. Every feature distribution is a single smooth peak:
And of those five features, only three were even stable habits across halves of the season. The chart below shows each feature’s first-half value vs second-half value for every umpire. If a feature represents a real persistent habit, the dots cluster along the y=x diagonal:
So the right description of an umpire isn’t a category like “Tight Outsider.” It’s a position on two continuous axes: how asymmetric their plate is (IOAI), and how aggressively they patrol the rulebook edge (EAR). The other three features we tested either don’t hold up (CCZE, LHB−RHB) or depend on which model you fit (HLB).
What this means for tonight’s game
Two practical takeaways.
First: the umpire is information. Knowing your starter works the outside corner and tonight’s umpire scores +4pp on the asymmetry leaderboard is a real edge. The 19-point spread above isn’t a list of bad umpires — it’s a list of umpires whose habits genuinely change the geometry of the strike zone.
Second: don’t trust broadcast graphics that type umpires. Categorical labels (“the strict ump,” “the count expander”) aren’t in the data. And the “expands with two strikes” trope isn’t just non-individual — it’s sign-backwards at the league level. A graphic asserting that pattern is reading noise on top of an inverted claim.
A live “Tonight’s Umpire on the IOAI / EAR axes” component is coming. The static leaderboard above is what we have today.
How this fits with “Four Kinds of Zone”
We previously published Four Kinds of Zone, which sorted umpires into a 2×2 grid (Aggressive Ace, Conservative Ace, Wild Expander, Tight Struggler) based on whether each was above or below the league median on accuracy and on borderline-strike rate. These two pieces sound contradictory, but they aren’t. They’re answering different questions.
The quadrants are a descriptive overlay. They exist because we drew them — we picked the median as a cut line on two convenient axes. That’s useful for fan-readable communication (“tonight’s ump is a Wild Expander”), and the underlying axes are real: edge aggression (similar to our EAR feature here) is one of the two genuinely persistent umpire habits we just confirmed.
This piece asked the harder statistical question: if you don’t pre-impose cut lines and instead let clustering algorithms search the data for natural groups, do groups emerge? The answer is no — the population is a continuous gradient. Each umpire is a point on the IOAI and EAR axes, not a member of a category. The Four Kinds labels remain useful editorial shorthand; they just aren’t boundaries the data drew for itself.
What we’re not claiming
- We’re not saying umpires are random. They’re strongly consistent on two of five features (IOAI and EAR) — just continuous, not categorical.
- We’re not claiming a typology is impossible — only that the five-feature space we pre-registered doesn’t support one. A different feature set or much larger sample might.
- We’re not claiming the leaderboard rankings within the top or bottom tier are precise. The 19-point Scheurwater-to-MacKay gap is statistically real (95% CI excludes zero by a wide margin). Whether Scheurwater is strictly #1 vs Dreckman strictly #2 is within sampling noise.
- We’re not yet making per-umpire claims about ABS-challenge behavior. The 2026 data is too thin (about 10–15 challenges per umpire so far). All-Star break revisit.
Methodology & full results
The analytical work used CalledThird’s standard dual-agent protocol: two completely independent statistical pipelines, each blind to the other during analysis, with two rounds of formal cross-review afterward. The 2-strike myth section was further validated in a third round with an independent verification step. The full process caught two real software bugs (both sign errors in a zone-distance calculation, in different scripts) that would have made it into a published article. Both methods ultimately PASS cross-review.
The pre-registered tests we used to decide whether typology was real
Five tests, all set before either pipeline ran:
| Test | Threshold to pass | Method A | Method B | Verdict |
|---|---|---|---|---|
| Do the two methods agree on cluster membership? | Chance-corrected agreement ≥ 0.30 | 0.136 | 0.105 | FAIL |
| Are clusters stable across halves of the season? | Cross-half agreement ≥ 0.40 | −0.010 | 0.116 | FAIL |
| Is each cluster large enough to name? | ≥ 5 umpires | 29 | 6 | pass |
| Do both methods place the 12 most-frequently-assigned umpires in matching clusters? | ≥ 70% match | 58.3% | 58.3% | FAIL |
| Are the clusters geometrically separable in feature space? | Separation index ≥ 0.20 | 0.065 | 0.0 | FAIL |
Sample & data
363,665 to 368,515 called pitches across the 2025 regular season (small difference reflects independent filter pipelines). 83 umpires qualified at the ≥1,500-called-pitch threshold; 16 below threshold were excluded. Both methods compute features residual to a baseline model that conditions on pitch location, count, batter handedness, and pitcher handedness.
Acknowledged limitations
- The cross-half stability test in Method A’s pipeline reuses a shared full-season scaler and dimensionality reduction. This conservatively inflates the agreement number, so the failing −0.010 result is even more damning than reported. Method B’s independent-half pipeline also fails by a wide margin (0.116) and is methodologically cleaner.
- The catcher-framing-isn’t-driving-this analysis retains 86% of pitches after dropping the top-10 framers. The high correlation between full-sample and dropped-sample IOAI is partly mechanical overlap, not pure independence from framing. The conservative read is “framing is not the primary driver,” not “framing has zero effect.”
- Per-umpire features are unweighted empirical residual means; we did not apply hierarchical shrinkage. The KILL verdict on typology is robust to this choice, but extreme leaderboard rankings have wider uncertainty than the order alone suggests.
- The 2026 overlay is too thin to support per-umpire ABS-challenge claims (about 10–15 challenges per umpire so far; ~540 challenges total). All-Star break revisit.
- The 2-strike contraction is robust across three location-controlled methods (logistic regression, stratified location bins, and a generalized additive model) at the population level. Whether the effect represents umpire behavior or a subtle pitch-selection difference at the borderline (e.g., 0-2 chase pitches arriving with different perceived intent) is something we can’t fully separate with observational data.
Cross-review
Three rounds of formal cross-review across the main analysis. Round 1: independent analysis. Round 2: each method’s pipeline reviewed by the other; one real bug caught in a zone-distance calculation, fixed, re-run. Round 3: re-review of the revised work, both PASS with independent numerical audits of every kill-gate computation and confidence interval. The 2-strike myth section then went through a fourth round of supplemental verification: a second method independently reproduced the location-controlled count counterfactual, finding the same direction with method-dependent magnitudes in the −16 to −21pp range.
Code & data
The repo is on GitHub at prismindanalytics/calledthird. Specifically:
- Pre-registered memo — the hypothesis, the 5 features, the kill gates, all set before either pipeline ran.
- Chart data — per-umpire IOAI/EAR/HLB/CCZE/LHB−RHB from both methods (
umpire-asymmetry-2025.json), cross-half splits (umpire-cross-half-2025.json), and the binned heatmap cell data (umpire-zone-heatmaps.json). - Chart components — the SVG rendering for the leaderboard, zone heatmaps, feature distributions, and cross-half scatter.
The full analysis workspace (analysis scripts, both methods’ reports, all four rounds of cross-reviews) lives at calledthird-research/analyses/2026-05-24_umpire-typology/ in the same repo — the binary outputs (parquets, large CSVs) are gitignored to keep the repo light, but the source scripts and reports are available on request.
If you find a flaw, we want to know.