We Tried to Build a Pressure Grade. Most of It Was Skill in a Costume.

There’s a pitcher on your team you don’t trust with a one-run lead. There’s another you’d hand the ball to in any inning of any game. The belief that some arms handle pressure and others come apart is one of the most durable ideas in baseball — and one of the least measured. So we tried to measure it: build a single “Pressure Grade,” league-wide, that says who actually pitches well when the game is on the line.

We ran it as a two-method bake-off — a Bayesian pipeline and a gradient-boosted ML pipeline, each working the same six seasons of Statcast independently, with the win conditions written down in advance. The point of two methods is simple: when both land on the same answer, you can trust it; when they split, the split tells you what’s fragile. They converged. And the answer to “can you grade who handles pressure” turned out to be a careful, three-part no — and here’s the part that’s yes.

What we found

The “clutch gene” is noise. Whether a pitcher beats his own baseline specifically in high leverage carries a year-over-year correlation of r = 0.08 — statistically indistinguishable from a coin flip. This year’s clutch hero is not next year’s.
No grade beats just being good. On a held-out season, the best pressure composite either loses to a pitcher’s plain overall-skill line or ties it. There is no number that adds predictive juice on top of “how good is he.”
Our own headline hypothesis failed. We bet that command — not walking people under pressure — would be the separator. Out of sample, the command grade was the worst predictor of next-year pressure results. We were wrong, on the record.
But the multi-dimensional read is real. The two methods grade each pitcher’s stuff, command, and contact nearly identically (stuff agreement r = 0.95). It describes how a pitcher succeeds under pressure — it just doesn’t forecast whether better than his ERA card already does.
Pressure performance is projectable — via skill. A pitcher’s overall ability this year predicts his high-leverage results next year better than his actual high-leverage line does. “Handles pressure” mostly means “is good, and throws strikes.”

None of that is the article we expected to write. It’s the one the data supports, and — because this is the rare myth-bust that ships a working tool — every claim below is something you can now check yourself on any pitcher’s page.

1. The number everyone wants — and why one number lies

Start with the read that does work, because it’s the reason the whole project was worth doing. Instead of one grade, score each pitcher on separate axes — command, stuff, and contact suppression in high-leverage spots, role-relative. Now a profile tells a story a single number can’t.

The league's filthiest reliever and one of its leakiest strike-throwers — the same arm. A single number (67th pct) buries that.

See the live profile →

What it looks like when every axis is there. No tax to pay.

See the live profile →

Eroding axis by axis — the multi-dimensional read catches a decline a single grade would blur.

See the live profile →

Click any card for the live profile. The point is Andrés Muñoz: the league’s nastiest reliever (A+ stuff) and one of its leakiest strike-throwers (D command), the same arm. A single “pressure grade” would file him at the 67th percentile and bury exactly the thing that makes him volatile. Chapman has no tax to pay; Castillo is eroding on every axis at once.

That Muñoz split is the entire case for going multi-dimensional. It’s legible, it matches what you see when you watch him walk the leadoff man and then strike out the side, and — crucially — it’s not an artifact of one method’s choices. We’ll prove the two pipelines agree on it in a minute.

The same split sorts the whole league. The pressure-proof — elite stuff and command, no weakness to exploit — are arms like Aroldis Chapman, Josh Hader, Garrett Whitlock, and Gabe Speier. The volatile pair an ace-level out-pitch with shaky control: Muñoz, Bryan Abreu, Jesús Luzardo — filthy, and a walk from trouble. A single grade calls both groups “good”; the profile tells you which kind — the difference between trusting a lead and holding your breath.

Where the grades come from. For each pitcher we split his pitches by Leverage Index into high- and low-pressure situations, then grade three things in the high-pressure half — command (stays in the zone, avoids walks), stuff (still misses bats, holds velocity), and contact (how hard he’s hit) — each scored against others in his role and shrunk toward the role average so a thin sample can’t spike a letter. Because that high-leverage sample is small, two independent methods compute every grade and we publish only what they agree on. The letters describe how a pitcher works when it matters; they are not a forecast that he’ll succeed.

So what do those letters actually measure? Here are three contrasting arms with the real high-leverage numbers behind every grade — the same inputs the model scores, each shown next to the average for the role:

Andrés MuñozRP · 188 hi-LI batters

CommandD

Walk rate12.2% · avg 8.5%

In-zone %45% · avg 49%

StuffA+

Strikeout %31% · avg 23%

2-strike whiff %38% · avg 27%

Fastball mph98.1 · avg 94.9

ContactA

xwOBA allowed.277 · avg .316

Hard-hit %44% · avg 44%

Aroldis ChapmanRP · 142 hi-LI batters

CommandA+

Walk rate6.3% · avg 8.5%

In-zone %55% · avg 49%

StuffA+

Strikeout %35% · avg 23%

2-strike whiff %35% · avg 27%

Fastball mph98.7 · avg 94.9

ContactA+

xwOBA allowed.252 · avg .316

Hard-hit %40% · avg 44%

Tyler RogersRP · 110 hi-LI batters

CommandA+

Walk rate2.7% · avg 8.5%

In-zone %57% · avg 49%

StuffD

Strikeout %11% · avg 23%

2-strike whiff %12% · avg 27%

Fastball mph83.4 · avg 94.9

ContactA+

xwOBA allowed.323 · avg .316

Hard-hit %28% · avg 44%

Real 2025 high-leverage numbers (Leverage Index ≥ 1.5), each next to the average for the role — green beats the average, red trails it. The letters are these inputs scored against the pitcher’s peers and shrunk toward the role average. Note how a grade can ride one signal: Muñoz’s A contact is his .277 xwOBA (his hard-hit rate is ordinary), while Rogers’s A+ contact is elite hard-hit suppression at league-worst velocity.

Three cards are three dots in a much bigger cloud — and 450-odd dots is just a blob. The clearer way to read the whole population is by the shape of the grades: the recognizable pitcher types, each with its own command/stuff/contact profile.

Complete89 arms

Strong command and stuff — no obvious way in.

Command

Stuff

Contact

Aroldis Chapman · Gabe Speier · Josh Hader · Garrett Crochet · Bryan King

Power, leaky command79 arms

Ace-level stuff over shaky control — the volatile arms.

Command

Stuff

Contact

Edwin Díaz · Seranthony Domínguez · Zack Wheeler · Andrew Chafin · Steven Okert

Command & contact101 arms

Pounds the zone, lives on weak contact — softer stuff.

Command

Stuff

Contact

A.J. Puk · Logan Allen · Tyler Rogers · Seth Halvorsen · Javier Assad

Below grade87 arms

Trails the field on both command and stuff.

Command

Stuff

Contact

Drew Rasmussen · Jonathan Cannon · Shawn Dubin · Tyler Ferguson · Calvin Faucher

Balanced150 arms

Middle of the pack across all three.

Command

Stuff

Contact

Parker Messick · Johan Oviedo · Shawn Armstrong · Orion Kerkering · Aaron Ashby

Bars are each type’s average percentile per axis (high = better); click any arm for its profile. “Power, leaky command” is the volatile bucket — ace stuff over thin control, where Muñoz, Bryan Abreu, and Jesús Luzardo live. These are descriptive shapes of the current grades, not a stable typology (we tested that — it didn’t hold).

Prefer the raw cloud? Here is every graded pitcher on all three axes at once — drag to rotate the cube, search any arm to pin and label it, and click a dot for the profile:

Drag to rotate · search to find an arm · hover for grades. Color = role.

506 pitchers on all three axes at once · blue = starters, orange = relievers · nearer dots are larger. The far top-back-right corner (high on all three) is where the complete arms live; drag to spin and find the structure, or search a name to pin it.

Command, stuff, and contact are the three spatial axes; color marks role (blue starters, orange relievers) and nearer dots are larger. Spin it to read the structure — the complete arms cluster in the high-on-all-three corner, the volatile arms hug the stuff axis with nothing on command.

2. What persists, and what’s a costume

A “skill” has to repeat. If a number is real, a pitcher who posts it one year should tend to post it the next; if it doesn’t carry over, it was luck wearing a label. So we took every pitcher with a real workload in back-to-back seasons and correlated each metric with itself, year over year.

Year-over-year correlation, 2024 → 2025 (235 pitchers, 200+ batters faced each year). Higher = a repeatable skill.

Strikeout rate

0.75

Walk rate

0.57

xwOBA-against

0.50

WPA / leverage

0.26

Clutch score

0.08

Pressure gap (high − low LI)

0.03

■ skill■ mixed■ noise

The tell: a pitcher's overall skill this year predicts his high-leverage results next year (r = 0.28) better than his actual high-leverage line did (r = 0.13). "Handles pressure" is real and projectable — the projector is just being good.

Skill metrics (green) repeat strongly. The leverage-specific ones — a FanGraphs-style Clutch score, and the gap between a pitcher’s high- and low-leverage results — sit at 0.08 and 0.03: noise. WPA-per-leverage lands in between, because most of it is just skill leaking through.

This is the sabermetric consensus on hitter clutch, now confirmed on pitchers over six seasons: the residual — how much better or worse a pitcher does in big spots than his own overall level — does not stick. The pitcher who “came up huge” in September wasn’t accessing a stable trait. He was good, and the sample was small, and small samples in high leverage are where narratives are born.

But notice the callout under the chart, because it’s the part that rescues the question from pure nihilism: a pitcher’s overall skill this year predicts his high-leverage results next year (r = 0.29) better than his actual high-leverage line does (r = 0.13). Pressure performance is forecastable. You just forecast it with talent, not with a clutch stat — because talent is measured on thousands of pitches and the clutch stat on a few dozen.

3. Two methods, one honest disagreement

Here’s where dual-agent earns its keep. We computed the per-pitcher grades two completely different ways — a hierarchical Bayesian model that shrinks small samples toward the role mean, and a gradient-boosted model that learns its weights from what actually forecasts next-season results — and asked how much they agree.

Agreement between two independent methods (Bayesian vs. gradient-boosted ML), per pitcher. ≥ 0.50 = robust.

Stuff sub-grade

0.95 ✓

Contact sub-grade

0.80 ✓

Command sub-grade

0.72 ✓

Composite ranking

0.46 ✗

Two totally different engines grade each pitcher's stuff, command, and contact nearly identically — that descriptive read is real. The one thing they don't agree on is the single composite ranking — the same finding from a third angle: there is no stable one-number "pressure" trait to rank by.

The descriptive sub-grades replicate across methods: two engines with nothing in common grade stuff at r = 0.95, contact at 0.80, command at 0.72. The one thing they don’t reproduce is the single composite ranking (0.46) — the same null from a third angle.

Read that carefully, because it’s the cleanest result in the study. When you ask the two methods “how good is this pitcher’s stuff under pressure,” they answer the same. When you ask them to collapse everything into one rank-ordered Pressure Grade, they diverge — because there’s no stable signal in the middle of the pack to rank by. The multi-dimensional description is robust. The one-number verdict is method-dependent fiction.

4. We pre-registered the kill shots

The first pass produced one overconfident sentence — “there is no pressure trait at all” — that the cross-review caught: our test had quietly over-controlled. So we ran a convergence round with the corrections baked in and the thresholds fixed in advance, and made both methods answer the same three questions.

Question	Bayesian	ML	Verdict
Is there a pressure-specific signal beyond overall skill?	BB .10 / K .09 / xwOBA .08	BB .09 / K .12 / xwOBA .08	Weak — not predictive
Does behavioral pitch-mix shift hold up as a trait?	2 of 9 settings	5 of 9 settings	Doesn’t ship
Does any grade beat overall skill out-of-sample?	−0.23 (loses)	≈ 0 (ties)	No edge

Both methods, all three questions, same side of the line. The corrected residual is weak — about 0.08–0.12 — not the zero our first draft claimed, and not large enough to bet on either.

Note the second row, because it’s a finding we wanted to be true and killed anyway. The intuitive idea that pitchers change what they throw under pressure — abandon the slider, lean on the heater — showed up at a tempting strength in one configuration. Vary the thresholds even a little and it falls apart. A real effect survives reasonable changes to how you measure it. This one didn’t.

5. So what do you do with it

You ship the part that’s real, labeled honestly, and you don’t ship the part that isn’t. Every pitcher’s page now carries the Pressure Profile — the command/stuff/contact read, role-relative, with the uncertainty stated — right next to a Scouting Report of league percentiles and year-over-year trajectory, and a Match Tracker that follows velocity and breaking-ball movement start by start. What you won’t find anywhere is a single “Pressure Grade” leaderboard, because we couldn’t build one that survives its own holdout.

That’s the honest shape of it. “Who handles pressure” is a real question with a real answer, and the answer is mostly: the pitchers who are good, and especially the ones who throw strikes. The multi-dimensional profile is worth having because it tells you how a given arm gets there — whether he’s a Chapman with no weakness or a Muñoz one leaked walk from disaster. It just won’t tell you whether he’ll come through tonight any better than knowing he’s nasty already does. Anyone selling you that number is selling you a costume.

Methodology

Data

League-wide Statcast, 2021–2026 regular season (~3.85M pitches); 2026 through June 1 held out as an out-of-sample check. Leverage Index from Tom Tango’s published table; high leverage = LI ≥ 1.5, low = LI < 0.85. WPA signed to the pitcher’s team. All per-pitcher estimates are shrunk (partial pooling / empirical Bayes) because the high-leverage sample per pitcher is small — the central difficulty of the whole problem.

Methods

Two independent pipelines with deliberately divergent inductive biases: a hierarchical Bayesian model (numpyro; non-centered parameterization; convergence checked, worst R-hat 1.02) and a gradient-boosted / regularized ML model that learns composite weights by predicting next-season high-leverage xwOBA. Identical leverage substrate and pre-registered gates; everything else divergent. Persistence is per-transition Spearman, Fisher-z averaged; the “skill removed” residual controls each high-LI rate on the pitcher’s disjoint low-LI rate (not his overall rate, which contains the high-LI sample — the over-control our first pass made and the cross-review caught).

Cross-method numbers

Per-pitcher sub-grade agreement (Spearman, n ≈ 434): stuff 0.95, contact 0.80, command 0.72, composite 0.46. Year-over-year skill persistence (2024→2025, n = 235): K% 0.75, BB% 0.57, xwOBA-against 0.50; clutch residual 0.08, high-minus-low-LI gap 0.03. Skill→future-pressure r = 0.29 vs. past-pressure→future-pressure r = 0.13. The full per-agent reports, reviews, and the comparison memo live in the research repo.

Honest limits

The corrected pressure-specific residual is weak, not provably zero. The 2026 holdout is partial and reliever-heavy. The grades are descriptive and role-relative — read ±1 grade, not the exact letter.

Cite this analysis

CalledThird. "We Tried to Build a Pressure Grade. Most of It Was Skill in a Costume." CalledThird.com, June 4, 2026. https://calledthird.com/analysis/the-pressure-grade

All CalledThird analysis is original research. If you reference our findings, data, or charts in your work, please link back to the original article. For data inquiries: hello@calledthird.com

Research code on GitHub

What we found

1. The number everyone wants — and why one number lies

2. What persists, and what’s a costume

3. Two methods, one honest disagreement

4. We pre-registered the kill shots

5. So what do you do with it

Methodology

Data

Methods

Cross-method numbers

Honest limits

Cite this analysis

Related analysis

The Arm-Angle Gambit: We Went Looking for the Cheat Code. We Found a Tax.

Three Weeks Later: The Walk Spike Is Fading, and We Know Who’s Paying the Bill

The Pitch Tunneling Atlas