New and improved PITCHf/x data

Yanks sign David Aardsma
David Aardsma gives Yanks another mid-season bullpen option

A couple of Fridays ago a bomb was dropped on the analytical baseball community. However, in this case, it was perhaps the greatest bomb ever deployed. You see my friends, Dan Brooks of the renowned Brooks Baseball announced with zero fanfare that Brooks — a terrific asset as far as individual game data goes, but lagging behind and in multi-seasonal data — would not only now be carrying Player Cards featuring seasonal data, but that the PITCHf/x data dating back to 2007 (the first year data became available) had been manually reclassified by PITCHf/x gods Lucas Apostoleris and Harry Pavlidis.

That’s right; somehow, someway, Lucas and Harry sifted through three-and-a-half-million pitches worth of PITCHf/x data, so that amateur analysts like myself would have the most accurate data possible to play with.

Why is this important? Well, for starters, pretty much any time I’ve talked about Ivan Nova over the last six months, it came with the caveat that we knew his second-half success was due in part to increased deployment of his slider, but I didn’t have the data to back this assertion up, as the PITCHf/x system stubbornly insisted that Nova only threw a slider 3.9% of the time. Now we know the truth.

Check out the following table, showing Nova’s non-reclassified 2011 PITCHf/x data, against Lucas and Harry’s reclassified 2011 PITCHf/x data:

The four-seam classification was pretty much on the money, as was the curveball, but the rest of Nova’s repertoire was pretty butchered by PITCHf/x. As you can see, Nova actually threw his slider 13% of the time instead of 3.9%, while the reclassification also determined that Nova threw a sinker, not a two-seamer. He also threw about half as many changeups as the un-reclassified data said he did, and he doesn’t actually have a cutter at all.

However, Lucas and Harry could’ve called it a project and we would’ve been plenty happy simply having accurate PITCHf/x data. But no, they decided to go even further, providing pitch and sabermetric outcome breakdowns by pitch type, and while some of these categories have been available at T-Leaguers and Lefkowitz, never before has all of this data been available in one place. In particular, the Whiff/Swing% on an individual pitch level is simply astounding, and something that’s never been freely available. Check out the remainder of Nova’s 2011 stats:

Now, we had a pretty good idea that Nova’s new-and-improved slider was nasty, but I don’t think anyone realized it was 43.1% Whiff-per-Swings-Taken nasty! As a frame of reference, CC Sabathia, who boasts one of the top sliders in the game, recorded a Whiff/Swing of 40.9% last season (though in fairness, he also threw it 27% of the time).

In the aftermath of this insane treasure trove of new data, I couldn’t help but wonder whether they’d be adding league average data (helpful as an additional reference point), and also if we could expect to have manually reclassified data for the upcoming 2012 season, as it’d be quite helpful to have the full spectrum of accurate data when looking at a given pitcher’s offerings across multiple seasons. Incredibly, both Lucas and Harry confirmed via e-mail that they do indeed plan to reclassify pitches on an ongoing basis throughout the season.

This is probably one of the most important sabermetric projects undertaken in the last 10 years. It’s incredible that not only have they devoted their time and energy into delivering a product any of us can access free of charge, but that they’ve also committed to maintaining an accurate set of data on a go-forward basis is just mind-blowingly awesome.

Yanks sign David Aardsma
David Aardsma gives Yanks another mid-season bullpen option
  • Gonzo

    Wow, this was a massive undertaking. Wow.

  • Johnny

    How do they determine that he throws a sinker instead of a two-seamer? I thought everybody had pretty much agreed that it was the latter. I assume they’re looking at break, yes? In any case, the whole thing is pretty cool.

    • Harry Pavlidis

      That’s a good question — I don’t. By convention, two-seamers are called sinkers. Even if it’s a tailing pitch. Further confounding is Tim Lincecum, who supposedly does not employ a four seam grip but has two fastballs, one which I call a fastball the other a sinker/two-seamer … they’re both two seam fastballs, neither are sinkers. The main think is finding the clusters/groups, labeling is another thing.

  • Michael

    Nerd alert! #ILoveIt

  • Tom Zig

    Good god. This is tremendous. Good work.

  • CMP

    Are these pitches categorized manually with someone watching ever pitch of every game or by computer?

    • Harry Pavlidis

      PITCHf/x systems include a couple of calibrated cameras that give us the path of the ball. From the deflection from the path of a spinless ball, we can distinguish pitch types quite well. Most of the time.

      • CMP

        I must say the whole thing is extremely impressive.

  • Jeff

    Sweet statistical jesus

  • JoeyA

    So we arent using wins and ERA anymore?

    In all seriousness, did anyone think to simply ask the dude what he throws? How did it take all of this analysis and work to find out Nova doesnt throw a cutter?

    Anyway, this is amazing and only furthers my need to understand all of these statistics in more detail.

    • A.D.

      Pssssh Nova doesn’t know what he throws, the machines will tell him!

      • Thomas

        Ivan Nova is a machine. Albert Pujols made love to a toaster. Nine months later the toaster finished and Nova came out. He wasn’t even burnt.

        Pujols also did it with a microwave, but it has been more than nine month and that burrito still isn’t done.

        • Gonzo

          Post of the day!

        • Robinson Tilapia

          Homage to TSJC?

          • Thomas

            Of course.

            • Havok9120

              Then I approve.

  • iftheshoe_fits

    Lucas and Harry are amazing…

    • Lucas Apostoleris

      Harry is much more amazing

  • FP

    Or you could have just looked at his Fangraphs profile with Baseball Info Solutions data months ago and seen virtually the exact pitch percentage breakdown, minus the sinkers separated out.

    • Bo Knows

      I’ve never found fangraphs all that good at determining pitches, or the advanced breakdowns of said pitches

    • Larry Koestler

      Though BIS may have come closer to the correct SL% for Nova than the non-reclassified PITCHf/x data did, the BIS data — which, as you note, lumps all four- and two-seamers into one catchall and unhelpful “FB” category — is generally considered inadequate by most analysts.

      I’d rather have the best possible data available than flawed data, even if the latter may be slightly easier to come by.

      • Harry Pavlidis

        Nova, and Melancon, have split IDs that are unpublished. Actually, as I noted in a comment below, that would be in Lucas’s data (Nova, I’ve got Melancon sinkers and cutters).

  • Jesse

    Holy shit, this is amazing.

  • I am not the droids you’re looking for…

    Hey Larry maybe you and the guys can now go back and re-do all those posts wherein you analyzed why it was possible that AJ might not completely suck this year.

    • Frank

      No way man… enough time and money wasted, lol.

  • Harry Pavlidis

    Hey, thanks all. I did classify every pitch, Lucas has reviewed many and we are going to merge his data, and others, over time. No I’m not crazy (that’s a lie).

    Not everything is right, that’s for sure, especially since this project was done over a four year period. We’ve already made some improvements.

    So, please sign up for the forums and let us know when you see something that needs work. Otherwise, I just hope you all enjoy it.


  • Dick Whitman

    Dear Lucas and Harry,

    Thank you.

  • Uli440

    Swing% and Swing-Miss% (whiffs/total swings) aren’t quite new. Joe Lefkowitz’s pitcher cards have them broken down by pitch type and batter handedness.

  • BobCano

    This is remarkable from a data perspective, great work by all.

    Now my comment is very nit-picky:

    When you prepare these charts, please rank both against the same metric (in this case Sel/Freq). It would make it easier to see what’s thrown most -> least in one quick glance.

    Also, for the layman of the stat world (myself included within said bucket) a legend would be extremely helpful. This way one would avoid thinking that CU was referring to Cutter as opposed to Change-Up.

    Thanks in advance,

    Your brother