Guide to Stats
Any good baseball analysis will to some degree involve statistics. Statistics are, after all, a record of what occurred on the field. They supply us with a level of information that no human can discern with the naked eye alone. As any RAB reader knows. we often use numbers to support our arguments.
While the use of numbers to support arguments can provide insight, the misuse or overuse of statistics can lead to poorly formed, or even downright wrong, conclusions. They can also ruin an otherwise worthwhile point. While we love embedding statistics in our articles, we try our best to not over-rely on them.
Our previous guide to stats introduced a number of statistics that we use, with technical explanations. This time around we’re going to simplify things. Herein we will espouse our statistical philosophy. This will, hopefully, provide an explanation for the statistics you see on this site, as well as our feelings towards them.
The concept of sample size is easily defined using a prominent Yankees issue. In the 2011 ALDS Nick Swisher continued his string of woeful playoff performances. There is no way to look at his overall postseason numbers and come away impressed. He simply has not been good in the playoffs.
The statistically inclined will say that Swisher’s sample of playoff data is too small to draw any conclusions. Others will scoff and say that small sample size doesn’t explain how Swisher is swinging right through breaking balls and behind fastballs. It doesn’t show how he gets too passive and finds himself in too many 0-2 and 1-2 counts. But these two ideas are not mutually exclusive.
Yes, Swisher might have failed in the playoffs because of the aforementioned issues. If he’s swinging through breaking balls, sure, he’s going to hit poorly. Yet this has nothing to do with sample size. When the statistically inlined say that Swisher’s sample of playoff data is too small, they’re looking forward. That is, just because Swisher has hit poorly in the playoffs does not mean that he will always hit poorly in the playoffs. To illustrate this point we needn’t look far: Tino Martinez was as bad as, if not worse than, Swisher in his first 150 playoff plate appearances. After that point, though, he enjoyed relative success in the postseason.
When do samples stabilize? It’s tough to say. Realistically, a half-season’s worth of data is too small to draw conclusions from. Even a full season’s data is not a reliable barometer for future seasons. Using not only the most data, but the most relevant data, is the best path here. The last three or so years of data, weighted towards the most recent, is usually a good method.
We can have more confidence in offensive statistics, because we know what we’re measuring. That is, we’re measuring results. This isn’t perfect, of course. Process matters at the plate as much as it does anywhere else. But we have a few advantages when it comes to offensive numbers, though it mostly boils down to sample size. Players can rack up plate appearances quickly, giving us a decent sampling we can use to judge skill.
We don’t typically use old school stats such as runs and RBI here, because they are team dependent. That is, runs and RBI assume that only two players are involved in the process, and that each contributes equally. Players can get RBIs unless their teammates get on base ahead of them, just as players can’t score runs unless their teammates behind them get hits. We also don’t use batting average as the judge of a player. In fact, the offensive stat we prefer puts batting average into context.
The problem with the array of traditional offensive rate statistics is that they all leave something, or many things, out of the equation. Batting average counts all hits, no matter the type, the same, and completely disregards walks. On base percentage counts every time on base the same, thus valuing a homer and a walk as the same. Slugging percentage assigns values based on the number of bases advanced, but that doesn’t necessarily match up with reality. That is, a homer is not necessarily twice as valuable as a double. It also ignores walks.
Linear weights is actually an old concept that has gained wider acceptance lately. It’s complicated to calculate compared to other stats, but the concept is dead simple. Essentially, linear weights assigns a point value to each outcome. This is based on years and years of historical data. Using this data from baseball’s past, we can say that a walk is worth X, a single is worth Y, a double is worth Z, and so on. In other words, linear weights not only observes that a hit is better than a walk, but it demonstrates how much more valuable — whereas batting average says a hit is infinitely more valuable and OBP says they’re the same.
Many different stats, especially the ones you’ll find on FanGraphs, incorporate linear weights. These are typically denoted by a lowercase w before the stat. You’ve certainly seen wOBA on here before. This is the rate version of linear weights, scaled to on base percentage. You might have seen wRC+, which is basically wOBA on the same scale as OPS+: 100 is league average and the higher the number the better. There is also wRAA, sometimes just RAA (runs above average). This is a counting number, so players with more playing time have an advantage. These are the basic offensive stats you’ll find on RAB.
You might also see plate discipline stats here, but we use those reluctantly. If we do, we try to use Pitch f/x based ones, since those are not based on human strike zone plotting (see defensive stats for a better explanation of that). Batted ball type is also something you might see here or there, but that actually comes more from the pitching side. We try not to use line drive rate, since again, line drive rate is a number prone to human error.
With pitching stats we have a bit less confidence, because there are more factors at play. Specifically, fielding plays a large role in how well a pitcher performs. They don’t mean everything, but a poor defense can cost his pitcher a number of runs in any given appearance. Fielding and offense are two reasons why we don’t prefer using the two most common pitching stats.
While many still cite pitcher wins as a measure of ability, it is too dependent on variables outside the pitcher’s control. For starters, the offense has a distinct role in assigning pitcher wins, since it has to score more runs than the pitcher allows. That is, a pitcher can get a win even if he lets up a dozen runs if his team scores 13 behind him. Defense plays a role here, too, as mentioned above. A pitcher controls less than 50 percent of the game, so assigning him a win for his performance seems silly.
ERA suffers from similar issues. While a pitcher does have a certain level of control over the outcome of his pitches, there is also fielding to consider. A pitcher with three slow outfielders will see many more hits drop in, and therefore many more runs score. Those runs aren’t necessarily his fault, since decent defenders would catch more balls, create more outs, and prevent more runs. Then there’s the issue of the earned run, a convoluted concept that should probably be done away with. It just means the fielder touched the ball before failing to make a play. Those balls that a poor defender never gets a chance to touch are 100 percent debited to the pitcher.
Pitching stats are certainly a complicated matter, then. There are some alternative stats, but even they have flaws. FIP, for instance, measures the three events over which the pitcher has the most individual control: strikeouts, walks, and home runs. But they’re all weighed and smushed together. It’s probably better to examine them individually. There’s also xFIP, which turns to the theoretical. It assumes a league average home run to fly ball ratio, so we’re no longer measuring outcomes, but rather expected outcomes. There are also a host of problems with home run to fly ball ratio. Home runs per contact rate is a better measure when spread over a number of years.
This doesn’t even get into other advanced ERA estimators, such as SIERA and tERA. These both involve complicated formulae. While they’re well intentioned, they’re still a bit too complex for our purposes. We like to keep things relatively simple. To sum up:
When we talk about pitching we prefer to talk about skills rather than outcomes. Sure, we’ll cite ERA here and there, because it does tell us something. We’d prefer straight RA, but that’s not a readily available stat (for some strange reason). But we’ll mostly talk about strikeout rate, walk rate, home run tendencies, command, control, velocity, movement, deception, and everything else that goes into pitching. It gives us a better idea of the pitcher as a whole. As I think we’ve made clear, there are many variables in the pitching equation, and not all of them have to do with the pitcher himself.
There is no larger controversy in baseball statistics than defensive statistics. There is no shortage of them, but they all have flaws. Some of them, such as UZR and DRS, use batted ball data (more on that in a second). Some of them just look at total number of plays made against the league average at the position, without any context. None of them are precise in classifying the difficulty of play. And so we don’t like to use defensive numbers.
(This is, without a doubt, our biggest about-face in the past year or 18 months.)
If you read Colin Wyers from Baseball Prospectus, you know the issues with UZR and other play-by-play based stats. They’re all scored by humans, and humans are prone to biases. If you look at a batted ball scatter plot from a human scorer, you’ll see clumping in certain areas. If you look at a batted ball scatter plot from Hit F/X, you’ll see a much more even distribution. This is partly because they’re collecting this data from video feeds. That 1) gives them little in terms of landmark points, and 2) attempts to make 3D judgements on a 2D screen.
That last point is something to hammer home. Whether it’s judging where a fielder picked up a baseball or plotting a pitch in the strike zone, anyone scoring on TV is at a disadvantage. Life happens in 3D. TV is 3D compressed into a 2D image. That makes it exceedingly difficult to make accurate assessments. Add in off-center camera angles, and you have a whole host of biases built into the system. We’re all for scouting, even on video, but when precision is the goal it’s not the ideal solution.
(The same goes for fly balls and line drives. They’re often scored from the press box, but even then it’s difficult to score with accuracy. Different press boxes are at different heights, thus making it difficult to make an objective judgment about a fly ball vs. a line drive. Also, the difference between a fly and a liner is subjective to being with, at least around the point that they meet.)
So what do we do when it comes to defensive stats? We either 1) consider many sources of defensive numbers in making an assessment, 2) use the admittedly unreliable eye test, or 3) ignore it completely. The last isn’t preferable, but in some instances it is better than making a judgment based on poor data.
A word on WAR
WAR has become a popular statistic, since it takes into account both offense and defense. Yet with the above-mentioned problems with defensive statistics, we can’t be certain of WAR’s accuracy. There’s also the matter of scarcity. If Team A has a 7-win player and a 0-win player and Team B has two 3.5-win players, Team A is far better off. For starters, there aren’t many 7-win players in the league, so they have a rare commodity. Also, the 0-win player is easy to replace at little cost. The 3.5-win players are difficult to upgrade, 1) because players above that level are increasingly scarce, and 2) because players above that level are expensive to obtain. Team A has a better chance of improvement at a reasonable cost. Thus, the 7-win player isn’t 3.5 wins better than the 3.5-win player. There’s another level of analysis between them.
The concept of a win is solid, though. While we might not have an exact formula, each player is worth a certain number of wins to his team, when compared to a replacement player. We mostly use this in the theoretical, though. It can illuminate an argument or a discussion point, but it will probably not drive one.
While our previous guides are pretty much obsolete, either because we don’t hold the same beliefs or because the explanations were too complex, there is one holdover from our previous guide to stats. WPA is something that has remained largely the same. We don’t often use it for analysis, but it’s a fun little stat.