# To Add or to Multiply, Part I of many.

April 18, 2007 15 Comments

In sabermetric analysis, there are several circumstances where we need to adjust a stat by some type of factor. Examples include minor league equivalencies (mle’s), park factors, and age adjustments. Say Conan hits 70 homeruns in Coors field. How many would he have hit in a normal park? How many would he have hit in a tough homerun park, like Seattle? A top prospect hits .375 in the Pacific Coast League. What’s that worth if he’s called up to the Majors?

The first time I saw MLE’s and park factors was in the Baseball Abstracts of Bill James. He used multiplicative factors, such as a hitter loses 18% of his value in jumping from AAA to the majors, or Wrigley Field inflates run scoring by 12%. As more data became available, we could go a lot further than looking at an end result such as runs. What’s the factor for batting average? For homeruns, doubles, and strikeouts?

It wasn’t until somewhat recently that I read of using additive factors. Additive factors were used in the projections published in The Hardball Times Season Preview, yet so far I haven’t seen a good explanation why I should use them after so many years of using multiplicative factors. At the same time, just because Bill James did it this way and so has everyone else who followed in his footsteps does not mean that it has to be done that way.

What is the right way to adjust? First, lets stop and think about what it implies. Lets say a ballpark is especially good for hitting doubles, inflating doubles by 20%. This could also be expressed as +5 doubles for every 500 balls in play. If an average player hits 25 doubles with 500 BIP, then in this park he’ll hit 30. It doesn’t matter, for the average player, whether you add or multiply.

Now, take a high double hitter, he normally hits 40. If you multiply, he’s at 48, if you add, he’s at 45. By multiplying, the hitters who are already good at something benefit most from a friendly park. By adding, all players gain the same benefit.

What is the correct method? I really don’t know. My answer is to use whatever best models reality. I will start looking at pitcher strikeouts and park factors. At this point, I can’t skip ahead and tell you to add or multiply, I haven’t done the work yet, but its possible that both will have their uses depending on the situation.

The first look is at the 2006 pitching staffs of the A’s and Mariners. The A’s had a one-year strikeout factor of 0.91 according to the 2007 Bill James Handbook. The Mariners on the other hand had a factor of 1.11. Looking at only Mariner pitchers, and using a matched at bat method instead of simply dividing team home strikeout rate by road strikeout rate, I get an even larger adjustment, 1.185. Next, I split the Mariners into two groups, the high strikeout pitchers and low strikeout pitchers. The low strikeout pitchers struck out 26% more batters at home than they did on the road. The high strikeout pitchers only struck out 14% more. The low strikeout pitchers added 3.1 strikeouts per 100 at bats, and the high strikeout pitchers added an almost identical 3.3 strikeouts per 100 AB. Score one for additive factors!

I repeated the same exercise for the A’s, who played in a park where strikeouts were scarcer in 2006. In total, A’s pitchers struck out 8% fewer hitters at home than on the road. Now things get a little tricky. The percentage of strikeouts lost was 3.2% for the high strikeout pitchers, and 16.7% for the low strikeout pitchers. Per 100 AB, its 0.7 strikeouts lost for the high-K guys, and 2.5 for the low-K guys. This is not as obvious as for the Mariners pitchers, but a multiplicative factor will punish the high-K guys more, when it is the low strikeout pitchers that are affected most. An additive factor, while not fitting perfectly, is a better fit here as well.

Its only two teams and two years, and we’ll have to look at more data before this is settled, but for pitcher strikeout park factors, additive factors look like the way to go.

Sean, what you’re describing are random effects in a general linear model. Not sure if you’re familiar with them, but the idea is that the slope of the line depends on the intercept of the line. Also, a standard regression line has both a multiplicative term (slope) and an additive term (intercept). Perhaps you are simply looking for a regression equation?

But we can also get park factors for 1B, 2B, 3b, HR etc etc … I’m not sure what you suggest makes most sense (i think it could have greater error), but would need to think about it some more

PC — are you saying that park factors should/could be regression equations with both an additive an multiplicative effects?

I haven’t done nearly enough research to make any kind of firm statement. My money, however, would be on a random effects model as a best fit. Let the slope vary with the intercept and see what happens. This allows for the thought that players who are already hitting a lot of HR (and thus, deep fly balls) will have their HR totals boosted by a “hitter friendly” park, but a slap/groundball hitter won’t even notice moving from Old Tiger Stadium to a little bandbox. I don’t see why it’s not possible though to take a look. When I have a little more done on my dissertation (you know, the real data set I should be looking at…) I’ll see about diving into that.

PC brings up a good point. I haven’t thought about using a slop/intercept equation, but maybe it would work.

I don’t have a lot of data to work with on parks, not being a retrosheet wiz. I can grab home/away stats one team at a time.

A Juan Pierre situation (no power no matter where he plays) suggest a pure multiplicative effect, as his intercept will be zero or pretty close to it. Tango Tiger did an example on homeruns using an additive effect, he used John Olerud, Mark McGwire and I think Pierre as his examples.

The short of it is that while Olerud might hit barely more homers than Pierre in the Astrodome, moving the 3 players to Coors might mean as many additional HR for John as for big Mac.

To do something like that properly you’d need way more data than I have, and more than is available right now. You’d need to know how many deep flyball outs a guy hits – and hittracker is only doing that for a selected number of teams right now.

One other thing about park factors –

The way they are calculated is a primitive, 1980’s approach. We really should control for both hitter and pitcher – Seattle’s park factor can be greatly affected if lets say Felix Hernandez (sorry) gets a majority of his innings at home and Jeff Weaver a majority on the road.

We should look at lets say Felix vs. Garret Anderson at home, Felix vs Garret on the road, match plate appearances, and repeat for every hitter/pitcher combo in baseball.

Then again, maybe its a lot of work for little gain and we get the same thing by using a 3-4 year sample for park affects. I don’t know.

Sean — good luck if you want to do that!! That’s is a lot of work.

I would like to do that, but don’t hold your breath.

Sean, you’re right on the primitive nature of the park factors. Let’s see if we can bring them into the 00’s before the 00’s end!

Current park factors are hopelessly useless…primative and they make no logical sense. The essential problem I have with pure multiplicative factors is that they are, essentially, unitless. Parks don’t influence you the more X you do…they influence you the longer you play there.

That being said, it’s possible that linear regression is a useful tool in the era of play by play. Since we know exactly how many times each batter hit in each park in any given year, we can group all of the park data together, weight by each player’s playing time (the part that makes them logically make sense to me) and see how each player is effected (on average) for each event per unit playing time…and also check to see if there is a relationship between a player’s skill in producing X statistic and how the park tends to effect him.

The problem is that pitchers impact the rate of an event too…Turner field is going to look like it suppresses 2Bs when Maddux and Glavine and Smoltz are all there and Andruew Jones is running down flyballs. It’s all deeply interconnected. The pitchers impact the events. The batters impact the events. The parks impact the events. The unique combinations of batter vs. pitcher often add an additional impact on events.

Pizza Cutter…you’re an expert on multivariate statistical analysis…I tend to use linear algebra (Matrices) to deal with systems of variables such as what I did with the FSIA (Fiato-Souders Intrinsic Adjustments) Matrix, but I’m not entirely sold that that accounts for everything…I’d like to get your take on how to solve the intertwining of variables impacting on each other in unison.

A little background for those interested:

http://www.baseballprospectus.com/article.php?articleid=1164

There are so many parameters to tease apart. I call it the “Alex Cole postulate.” Teams build their rosters around the quirks of their ballpark. Think the Rockies check the GB/FB ratio of their potential free-agent pitching signees? Would their strategy work on the carpet in the Metrodome? Probably not.

(The Cole example was the hilarious attempt by the Indians to build their ballpark around the quirks of their roster. At the time, Cole was a young speedster rookie, and in 1991 or 1992, the Indians actually moved the fences back in an effort to make CF bigger, figuring that he could cover it.)

The problem that I see with park effects is that we’re measuring outcomes, when what it seems we’d be best served looking for is what the park means for the flight of the baseball. For example, does the baseball really fly further in Coors Field? (an article I’ve done some preliminary work on… answer is obviously yes). Do more ground balls go through the infield for singles on turf? Does the amazing amount of foul ground in Oakland lead to more foul outs rather than foul balls?

Adjusting for individual players, some players have skills that will be more affected by certain ballparks than others. One possibility that I hadn’t mentioned before is that the effect is actually polynomial in form. For some reason, Sabermetricians seem to be afraid of equations that go above a first-order term. What if we tried to fit the park effect equations (if we could solve the other problems) with quadratic or cubic terms?

It’s late. I need sleep. I’m sure we’ll be talking about this more later.

I definitely don’t fear using multiple-order functions, but I would need a logical explanation as to why park factors should be considered polynomials. Everything I do is based on the demand that what I’m doing makes logical sense. A five year old (with proper training) can understand why run expectency makes sense. Linear weights…same story. Baseball is a game of discrete events proceeding seemlessly from one to the other…the math shouldn’t be that complex.

I work with five year olds for a living… I’ll try explaining run expectancy next time. The reason that I considered polynomials was that it would allow for for a non-linear relationship. That was my only rationale and I don’t have a theoretical basis.

LOL OK so I was being slightly hyperbolous…my point is…it doesn’t take a towering intellect to understand the basic principles that underline baseball IMHO…it just takes a good logical understanding. BTW I’m not saying you’re wrong to suggest polynomials…perhaps research into polynomial factors would give us some hint at a piece of logic we’re missing. Empiricism does two things for us…it gives us relationships that work better than the ones we’re currently using and it hints at why those relationships work.

MSN I NIIPET

MSN