UPDATED: Tweaking the Pythagorean Formula

UPDATED:  So I ran the numbers for all the teams in 2008.  The short of it is this:  if you strip out blowouts, you get closer to the real results for some teams, particularly when they outperformed the regular Pythagorean expectations (Astros, LA Angels, Marlins).  Other teams however, the method went the wrong direction, and in a big way (particularly the Braves, but the Mariners too). So this isn't really an improvement on the system, but now we know what happens when you disregard games decided by 9 runs or more.  I learned a lot about the Pythagorean formula along the way and now I don't have this nagging question in the back of my head, so for me it was a win.  Sorry if you feel I wasted your time!

For the curious, here are the numbers:

Team Actual Record Pythagorean Record Pythagastro Record
Arizona Diamondbacks 82-80 83-79 82-80
Atlanta Braves 72-90 78-84 83-79
Baltimore Orioles 68-93 72-90 74-88
Boston Red Sox 95-67 97-65 95-67
Chicago Cubs 97-64 100-62 96-66
Chicago White Sox 89-74 90-72 84-78
Cincinnati Reds 74-88 71-91 74-88
Cleveland Indians 81-81 86-76 84-78
Colorado Rockies 74-88 73-89 75-87
Detroit Tigers 74-88 78-84 76-86
Florida Marlins 84-77 81-81 84-78
Houston Astros 86-75 78-84 85-77
Kansas City Royals 75-87 71-91 71-91
L.A. Angels 100-62 89-73 92-70
L. A. Dodgers 84-78 87-75 87-75
Milwaukee Brewers 90-72 88-74 85-77
Minnesota Twins 88-75 90-72 93-69
New York Mets 89-73 90-72 87-75
New York Yankees 89-73 88-74 87-75
Oakland Athletics 75-86 76-86 79-83
Philadelphia Phillies 92-70 94-68 88-74
Pittsburgh Pirates 67-95 66-96 70-92
San Diego Padres 63-99 66-96 66-96
San Francisco Giants 72-90 67-95 69-93
Seattle Mariners 61-101 66-96 69-93
St. Louis Cardinals 86-76 87-75 86-76
Tampa Bay Rays 97-65 92-70 88-74
Texas Rangers 79-83 75-87 76-87
Toronto Blue Jays 86-76 94-68 86-76
Washington Nationals 59-102 61-101 62-100


OK, so this is going to be a statistics-oriented FanPost and it's going to discuss the Pythagorean win percentage.  So don't say I didn't warn you.  I originally buried it in a comment-thread, but Dying Quail suggested it was FanPost material, so here it is. 

Everybody knows that in 2008 the Astros actual record was significantly different from their Pythagorean record.  Actual record:  86-75.  Pythagorean record:  77-84.  That's a huge swing, and it has been the basis of many people predicting a significant drop-off for the Astros in 2009.  Some people have tried adjusting the Pythagorean formula to get more accurate results by fiddling with the exponent in the formula.  I'm going to try to do it a different way (one that is a bit more conducive to working on pen and paper).

The Pythagorean formula only looks at how many runs a team scored and how many runs a team gave up over the course of its season to determine how many games they were "expected" to win.  This methodology, however, is subject to distortion by blowout games where one team scores a lot of runs and the other team scores few.  These blowout games, though, are not normally indicative of how a team plays in most of their games. 

This got me wondering whether the Pythagorean records would come closer to the real records if we stripped out these blowout wins and losses.  In theory, removing the blowout wins and losses could give you a better picture of how the team plays most of the time and gives less emphasis to those fluke-y games.  So I did this for the 2008 Astros. 

First, I had to figure out which games to strip out.  You don't want to remove too many of the blowouts or you could easily distort the record the wrong way.  So I decided to remove the top 5% of blowouts.  The way to do this is by looking at each game and calculating the run differential, whether it was a win or a loss.  If the Astros score 4 and the Dodgers score 2, the run differential is 2.  If the scores were flip-flopped, the run differential would still be 2. Then you calculate the average run differential over the course of the season. In this case, the average was 3.39 runs.

The next step is to calculate the standard deviation.  By definition, 95% of the game run differentials should fall within 2 standard deviations of the average.  The standard deviation in the case of the 2008 Astros was 2.41 runs.  So, in theory, 95% of games would have a run differential up to 3.39 + (2 x 2.41) = 8.22 runs. 

So I went through the Astros games and removed all the games where the run differential was 9 runs or more.  In a 161 game season, this ended up being 8 games, or 4.9% of the season (interestingly, the Astros only had a single blowout win and the remaining seven blowouts were losses).  In the remaining 153 games, the Astros scored 688 runs and gave up 659 runs.  Although in real games, the Astros scored fewer runs than they gave up, if we remove the 5% of blowout games, we find that the numbers are reversed.  This suggests that the blowout games had a truly significant distortion effect on the Astros' pythagorean record. 

When you plug these numbers into the standard Pythagorean Win Formula, you get a win percentage of 52.4%. Over the course of a 161-game season, you get a win loss record of 84-77. This is much much closer to the Astros actual record of 86-75. I’m not sure if this method would work for all teams, but it certainly worked here.  I'll leave it to others to see if this method can be extended to the full league.

In This FanPost