We are told that violence is inevitable, an unfortunate part of the way things are. And that the way things are cannot change. Russian military forces invaded a sovereign nation on February 24th…

*Note: This post was co-authored by **Al Palensky**.*

A team with the ability to know the outcome of a plate appearance before it happens would have the perfect strategy when picking a reliever or pinch hitter to put in the game. This is obviously not possible, but there are ways to optimize matchups.

To make this easier to think about, we can consider major league players. If Clayton Kershaw must face either Mike Trout or Adam Dunn in a hypothetical situation, looking at stats will show that Mike Trout has struggled against him with three hits in 18 at-bats. Meanwhile, Dunn did very well against him with eight hits in 13 at-bats and four home runs. However, a manager would never play Dunn over Trout even if these stats suggest it is the best idea. Another example is looking at batter handedness to make decisions. Assuming that any right-handed hitter is a better matchup than a given left-handed hitter against any left-handed pitcher is not always true. A team’s scouting process can be greatly improved by quantifying the process of matchup optimization.

This project had two attempts. The first was last spring and ended up having room for improvement. The goal was to simulate every pitch of every plate appearance. Each pitcher and hitter matchup would be combined to find the probability of each pitch type being thrown in each count. After the pitch type was predicted, we would then find if the pitch was in the zone. This would then determine if the batter swung, if contact was made, and if the contact was over 95 miles per hour. This was completed and showed some promise. Play-by-play data was available using this method with a pitch type, location, and outcome available. However, the final stats also ended up featuring far too many strikeouts and walks. With this in mind, we decided the best option was to redo the methodology using a more simplified method that did not require every pitch to be documented.

The second attempt, which was completed with fellow analyst Al Palensky, created a more effective method for simulating plate appearances. The first step was the same as the project’s previous attempt. We found every cluster and group thrown by a given pitcher and combined it with the hitter’s stats when facing that pitch type. These group of stats include the swing, contact, and hard-hit rates for pitches inside and outside of the zone. Next, the percentage of strikeouts among all walks and strikeouts was calculated. The final variable, which would become the target variable in a future model, is the percentage of plate appearances ending with the given pitch type and a ball in play. A success is when this value is over 50% and a failure to put the ball in play is when this is below 50%.

Once the data was prepared, we were able to train a random forest model. With the first step, the goal is to see results of all pitches from each specific group. Some pitches may end up being out of the zone more often and some may end up getting swung at more often league wide. The specified pitcher’s stats with each of their pitches were then put into the created model. Here is an example of what this would look like:

This gives us an idea of how this pitcher has performed with these five pitch clusters. The next step is to find these stats for a given hitter. The hitter’s stats are calculated very similarly based on the clusters that they have faced. Here is an example of this output, with pitcher handedness specified:

You will notice that these are the exact same stats. There is only one addition to the hitter stats, which is batted ball statistics.

If the model predicts that a plate appearance will result in a ball in play, these three variables will be generated under the assumption that they all follow a normal distribution. These two tables give us all the information we need to move onto the next step, which is to combine the two tables.

The idea here is to balance out the strengths and weaknesses in each matchup. A hitter with a good K/BB ratio will end up striking out more than normal against a pitcher who accumulates more strikeouts, while a hitter with a bad K/BB ratio may end up improving this number against pitchers who do not get a lot of strikeouts. Usage is normalized from zero to one, so the most frequently used pitch is given a one and the least frequently used pitch is given a zero with everything else in between.

Now that this table has been created for the matchup, we are ready to simulate. As mentioned previously, the variable being predicted is whether the plate appearance ends in a ball in play or not. If a ball in play is not predicted, a strikeout or walk is assumed. This will be based on a randomly generated value where the probability of a strikeout is also calculated using the pitcher and hitter stats, while one minus that is the walk probability. If a ball in play is predicted, the exit velocity, launch angle, and hit direction are generated based on a normal distribution and the values are put into another model that predicts the outcome.

Here is an example of the output when plate appearances are simulated. You can see that the model resulted in 260 out of 300 plate appearances ending in a ball in play with 51 strikeouts and 40 walks. The hitter’s projected line is .300/.393/.650 with a 1.043 OPS, 30.6% hard-hit rate, and 17 home runs (17.7 AB/HR). This matchup is unfavorable for the pitcher with the excellent stat line.

Adding projected matchup results to advance scouting adds an invaluable dimension to our game plan. It is important to not be discouraged by results if the matchup is not predicted correctly in one instance, since the idea is to project stats over a large sample. Before playing against a team, we can simulate plate appearances against every hitter or pitcher with enough data.

This is a sample taken from a pitcher’s simulations against a team’s lineup. Each row represents a hitter on the opposing team and their stats against the given pitcher. The third row immediately stands out with zero hits in 300 plate appearances. When this happens, it means that there was not enough data on the hitter for the model to be useful. These hitters must be removed from the final product, unfortunately. The other four hitters’ projections indicate that they would be tough matchups for this pitcher. The fourth hitter would represent a matchup where one of the three true outcomes can be expected. If this pitcher faced this hitter, we would be risking the possibility of a home run or walk for an increased chance at a strikeout.

With plate appearances simulated and stats calculated, the final product can now be produced. We would generally view an OPS under .750 as a favorable matchup for the pitcher and an OPS over .850 as a favorable matchup for the hitter. Anything in between that can be viewed as a neutral matchup or adjusted as needed if there are a lot of home runs, strikeouts, or some other trend.

This matrix is an example of what the final product looks like with a team’s pitchers as columns and opposing team’s hitters as rows. It has been color-coded to make favorable, unfavorable, neutral, and insufficient matchups easier to find. Insufficient matchups are the blacked-out rows that did not have enough data to create a meaningful projection. When hitters do not have enough data against any pitcher or pitchers do not have enough data against any hitter, they are removed. Some hitters are obviously far more favorable than others, but we can see who might be favorable against some of the better hitters. Neutral matchups are not the best scenario, but they are better than unfavorable matchups.

There is one major limitation that the first attempt addressed much better than this one. While we are modeling the outcome of plate appearances, we do not have any idea how many pitches each plate appearance lasts or how the hitter performs in different counts. Finding a way to combine the two methods so that we can predict each pitch and find results based on the count would be a great addition.

There is also the issue where a first round prospect might throw pitches in the same cluster as a given pitcher against a hitter and cause the projections to be lower. Another addition would be to adjust hitter performance based on the quality of pitching they faced. Right now, all pitchers who throw a pitch in each cluster are assumed to be the same. While this was good enough for creating meaningful output, the model could be improved further with this adjustment.

Even with its limitations, we have created an excellent tool for projecting matchups before they happen. If there is indecisiveness over who to bring into a game at any given point, this matrix can be analyzed to find the best matchup. This approach to simulations has proven to be useful for predicting the outcome of matchups.

We are very excited to have simulations added to our growing list of tools created by our analytics department. It will be exciting to see how the simulations can be improved in the future and how they perform the rest of the season.

Buying.com is ramping up expansion into all 50 US states, a major milestone in the company’s mission to reshape the landscape of e-commerce. Join the Buying.com Telegram channel for the latest…

The Tokyo 2020 Olympics might seem far away — yet for lead engineers in charge of streaming the games, it’s not far enough. Tokyo will have a total of 33 official sports — five more than the Rio…

The shooting at a Walmart in El Paso, Texas was a disturbing reminder of the rise in hate in this country. In fact, according to the FBI the number of hate crimes increased three years in a row…