R tutorial - Exploratory Data Analysis with Baseball Data

Exploring Pitch Data in R: A Course with a Focus on Zack Greinke's 2015 Season

Hello and welcome to this course on exploring pitch data in R, my name is Brian Mills, and I'll be your instructor in this course. You will use a Cranky's 2015 season as your case study, while Gronke has been an excellent MLB pitcher for a number of years, he had an especially dominant month of July in 2015. The data in this course will allow you to examine this more closely and compare his July to other months throughout the course.

You will use a single dataset describing every pitch thrown by Greinke to explore changes that may have taken place with respect to his pitch velocity, pitch usage, pitch location, embedded ball outcomes across different months. You'll begin in this chapter by exploring pitch velocity by the end of the chapter you'll have done enough exploratory analysis to answer the question of whether Grind's fastball velocity was higher in July relative to the other months he pitched in 2015.

Throughout the exercises in this chapter, you'll gradually develop graphical skills to make colorful histograms to meaningfully compare velocity distributions. Finally, note that you'll return to some of that analysis in the final chapter of the course where you'll evaluate whether changes in fastball velocity had any impact on hitting outcomes.

The data used for this course are usually referred to as pitch FX or statcast data. These data use high-tech cameras in a Doppler radar system to collect information on how fast pitches are thrown, the type of pitch thrown, the location of each pitch, the movement and spin rate of each pitch, the velocity of batted balls, and the direction and distance the ball is hit. There's also information on better handedness and the ultimate outcome of each pitch that's thrown. For example, if the batter swung and missed this is included in the data and the variable pitch result with the entry swinging strike alternatively if the batter didn't swing the data included information on whether the pitch was called a ball or a strike by the umpire.

Finally, there's information on the outcome of the at-bat such as whether the batter hit a home run, popped out, walked, struck out, or any number of other general baseball outcomes. As you proceed through the course, you'll examine all of these variables but first, you'll want to clean the data so that certain analyses can be performed correctly.

The initial data set you'll use is named "cranky" and it requires some editing for this purpose. When you check the class of the game date variable, it's listed as a character string. It works better if this is entered as a date to allow for plotting time series data later in the chapter. To reformat the variable as a date, you'll use the as dot date function, which allows you to specify how your dates are formatted in your data.

The dates in the Grind data set should be formatted as month/day/year separated by forward slashes. You will also make use of the separate function in the tidy our package to manipulate the game date variable this allows you to specify a new variable called month so that you can break up your data into each month later in the chapter.

Remember, tidying up your data is always an important first step before diving into your analyses. This step ensures that more complex analysis down the road will go more smoothly.

"WEBVTTKind: captionsLanguage: enhello and welcome to this course on exploring pitch data in R my name is Brian Mills and I'll be your instructor in this course you'll use a Cranky's 2015 season as your case study while gronke has been an excellent MLB pitcher for a number of years he had an especially dominant month of July in 2015 the data in this course will allow you to examine this more closely and compare his july to other months throughout the course you'll use a single dataset describing every pitch thrown by Greinke to explore changes that may have taken place with respect to his pitch velocity pitch usage pitch location embedded ball outcomes across different months you'll begin in this chapter by exploring pitch velocity by the end of the chapter you'll have done enough exploratory analysis to answer the question of whether grand Keys fastball velocity was higher in July relative to the other months he pitched in 2015 throughout the exercises in this chapter you'll gradually develop graphical skills to make colorful histograms to meaningfully compare velocity distributions finally note that you'll return to some of that analysis in the final chapter of the course where you'll evaluate whether changes in fastball velocity had any impact on hitting outcomes the data used for this course are usually referred to as pitch FX or statcast data these data use high tech cameras in a Doppler radar system to collect information on how fast pitches are thrown the type of pitch thrown the location of each pitch the movement and spin rate of each pitch the velocity of batted balls and the direction and distance the ball is hit there's also information on better handedness and the ultimate outcome of each pitch that's thrown for example if the batter swung and missed this is included in the data and the variable pitch result with the entry swinging strike alternatively if the batter didn't swing the data included information on whether the pitch was called a ball or a strike by the umpire finally there's information on the outcome of the at-bat such as whether the batter hit a homerun popped out walked struck out or any number of other general baseball outcomes as you proceed through the course you'll examine all of these variables but first you'll want to clean the data so that certain analyses can be performed correctly in our the initial data set you'll use is named cranky the dates in the data set requires some editing and for you to tell are that it should read the game date column as a date notice that when you check the class of the game date variable it's listed as a character string it works better if this is entered as a date to allow for plotting time series data later in the chapter to reformat the variable as a date you'll use the as dot date function this function allows you to specify how your dates are formatted in your data the dates in the gran key data set should be formatted as month/day/year separated by forward slashes you'll also make use of the separate function in the tidy our package to manipulate the game date variable this allows you to specify a new variable called month so that you can break up your data into each month later in the chapter as you proceed through the next few exercises remember that tidying up your data is always an important first step before diving into your analyses this step ensures that more complex analysis down the road will go more smoothly go ahead and get startedhello and welcome to this course on exploring pitch data in R my name is Brian Mills and I'll be your instructor in this course you'll use a Cranky's 2015 season as your case study while gronke has been an excellent MLB pitcher for a number of years he had an especially dominant month of July in 2015 the data in this course will allow you to examine this more closely and compare his july to other months throughout the course you'll use a single dataset describing every pitch thrown by Greinke to explore changes that may have taken place with respect to his pitch velocity pitch usage pitch location embedded ball outcomes across different months you'll begin in this chapter by exploring pitch velocity by the end of the chapter you'll have done enough exploratory analysis to answer the question of whether grand Keys fastball velocity was higher in July relative to the other months he pitched in 2015 throughout the exercises in this chapter you'll gradually develop graphical skills to make colorful histograms to meaningfully compare velocity distributions finally note that you'll return to some of that analysis in the final chapter of the course where you'll evaluate whether changes in fastball velocity had any impact on hitting outcomes the data used for this course are usually referred to as pitch FX or statcast data these data use high tech cameras in a Doppler radar system to collect information on how fast pitches are thrown the type of pitch thrown the location of each pitch the movement and spin rate of each pitch the velocity of batted balls and the direction and distance the ball is hit there's also information on better handedness and the ultimate outcome of each pitch that's thrown for example if the batter swung and missed this is included in the data and the variable pitch result with the entry swinging strike alternatively if the batter didn't swing the data included information on whether the pitch was called a ball or a strike by the umpire finally there's information on the outcome of the at-bat such as whether the batter hit a homerun popped out walked struck out or any number of other general baseball outcomes as you proceed through the course you'll examine all of these variables but first you'll want to clean the data so that certain analyses can be performed correctly in our the initial data set you'll use is named cranky the dates in the data set requires some editing and for you to tell are that it should read the game date column as a date notice that when you check the class of the game date variable it's listed as a character string it works better if this is entered as a date to allow for plotting time series data later in the chapter to reformat the variable as a date you'll use the as dot date function this function allows you to specify how your dates are formatted in your data the dates in the gran key data set should be formatted as month/day/year separated by forward slashes you'll also make use of the separate function in the tidy our package to manipulate the game date variable this allows you to specify a new variable called month so that you can break up your data into each month later in the chapter as you proceed through the next few exercises remember that tidying up your data is always an important first step before diving into your analyses this step ensures that more complex analysis down the road will go more smoothly go ahead and get started\n"