There has been some speculation as to how I manage to update the stats daily. Here is the rundown.
I buy the stats from a third party. I get the play-by-play accounts for the games and also summary stats for the majors and minors. They deposit them on my server between 6 and 7am each morning. These are in a format unique to them, but using perl, I manipulate them apply my own id’s and then stuff them into the site’s database, about 15 tables of data total.
This all happens automatically. I use a cron job to check every five minutes for the stats to arrive, and once they have and the program is sure that they have finished uploading (don’t want a partial file) it marks that we can update the site.
I then have other scripts that work on the database and build new tables for specific things like splits, gamelogs, etc. These are all derived from the play-by-play, so if a play-by-play account can’t be parsed (like a strange runner interference when attempting to advance on a pitch that gets away), it breaks everything downstream, and I have to go back in and fix some stuff by hand. This has happened about 6 times this season (I get a page when the stats don’t run right).
Next the scripts build the pages (I try to only update the 2007 stuff, so sometimes the 2006 and 2007 pages may not have the same stats or layout as I’m not re-running the entire site every day).
All told this takes about 90 minutes.
This isn’t happening live on the server. This happens on a second server and the db being updated is a second db. So now if it all worked ok, I then can transfer it over to the main server.
The pages transfer in seconds, but syncing the play index and other databases takes almost two hours. The big time sinks are that I have to recompute the career splits and batter vs. pitcher tables to handle the new 2007 data.
Then if everything gets updated, I get a page around 9am that everything has been fixed. I also re-run the previews in there to be updated with the previous days data.
I’d love to hear how the big guys make the real-time updates they do because things like splits and the like take forever for me to re-compute.
As for hardware, we’ve got three servers.
1) the main webserver
2) the main dbserver (running three mysql instances to fully use the 6GB of RAM it has), and
3) the backup server and image server that is a small machine that just serves static things like images, js and css files.
I’m happy to answer questions folks might have.