Posted by Sean Forman on July 13, 2010
This post is the ninth in our series of ten new features for our tenth anniversary.
Thanks to the heroes at RetroSheet, we now have access to box scores and gamelogs for every major league game from 1920-2009, and play-by-play going back to 1950. Ninety years of major league history and more than 150,000 games. That includes over 80% of the batting seasons in major league history (back to 1871), 74% of the team seasons, and full careers for 72% of all major league players.
This latest update fills the doughnut hole we had for 1940-1951, so we now have Ted Williams' 84-game on-base streak in 1949, and Joe Dimaggio's 56-game hitting streak.
The Play Index now contains all of these seasons as well. Just to brag a little bit about the database we now have on the site. Our play-by-play database (1950-2010) has over 9m rows with 200 columns each or just over 1.8 billion cells of pbp data. The batting gamelog table from 1920-2010 has over 3.7m rows, and you can search them all. The batting splits table has over 6.1m different yearly splits calculated. There is a lot of data on our server.
Thanks again to the people at RetroSheet (led by Dave Smith) for this tremendous body of work. It really takes your breath away to see what these folks have accomplished. When I heard of their project ten or so years ago it struck me as the Crazy Horse Monument of baseball statistics, but the big difference is that they're now 80% done with it.
For a complete listing of what our data now covers, see our Data Coverage Page. It runs down the years we have pitch data, hit location and type data, and the extent to which we are missing play-by-play from 1950-1973.
The last of our 10 for 10 will launch next week, Murphy willing.