2 Data Manipulation

In this section, you will learn:

Some nuances of reading in data.
The basics of dplyr to manipulate data.

select
mutate
filter
group_by
summarize
arrange

A few more advanced dplyr concepts.
How to do ‘joins’.

In this section, we will use the following libraries:

library(tidyverse)
library(janitor)

2.1 Read in Data

read_csv("data_sources/Batting.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   playerID = col_character(),
##   teamID = col_character(),
##   lgID = col_character(),
##   SF = col_logical(),
##   GIDP = col_logical()
## )

## See spec(...) for full column specifications.

## Warning: 45441 parsing failures.
##   row  col           expected actual                       file
## 25015 GIDP 1/0/T/F/TRUE/FALSE     2  'data_sources/Batting.csv'
## 25016 GIDP 1/0/T/F/TRUE/FALSE     10 'data_sources/Batting.csv'
## 25018 GIDP 1/0/T/F/TRUE/FALSE     4  'data_sources/Batting.csv'
## 25028 GIDP 1/0/T/F/TRUE/FALSE     8  'data_sources/Batting.csv'
## 25030 GIDP 1/0/T/F/TRUE/FALSE     3  'data_sources/Batting.csv'
## ..... .... .................. ...... ..........................
## See problems(...) for more details.

## # A tibble: 102,816 x 22
##    playerID  yearID stint teamID lgID      G    AB     R     H  `2B`  `3B`    HR   RBI    SB    CS    BB    SO   IBB   HBP    SH SF    GIDP 
##    <chr>      <dbl> <dbl> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
##  1 abercda01   1871     1 TRO    <NA>      1     4     0     0     0     0     0     0     0     0     0     0    NA    NA    NA NA    NA   
##  2 addybo01    1871     1 RC1    <NA>     25   118    30    32     6     0     0    13     8     1     4     0    NA    NA    NA NA    NA   
##  3 allisar01   1871     1 CL1    <NA>     29   137    28    40     4     5     0    19     3     1     2     5    NA    NA    NA NA    NA   
##  4 allisdo01   1871     1 WS3    <NA>     27   133    28    44    10     2     2    27     1     1     0     2    NA    NA    NA NA    NA   
##  5 ansonca01   1871     1 RC1    <NA>     25   120    29    39    11     3     0    16     6     2     2     1    NA    NA    NA NA    NA   
##  6 armstbo01   1871     1 FW1    <NA>     12    49     9    11     2     1     0     5     0     1     0     1    NA    NA    NA NA    NA   
##  7 barkeal01   1871     1 RC1    <NA>      1     4     0     1     0     0     0     2     0     0     1     0    NA    NA    NA NA    NA   
##  8 barnero01   1871     1 BS1    <NA>     31   157    66    63    10     9     0    34    11     6    13     1    NA    NA    NA NA    NA   
##  9 barrebi01   1871     1 FW1    <NA>      1     5     1     1     1     0     0     1     0     0     0     0    NA    NA    NA NA    NA   
## 10 barrofr01   1871     1 BS1    <NA>     18    86    13    13     2     1     0    11     1     0     0     0    NA    NA    NA NA    NA   
## # ... with 102,806 more rows

Fix Read In Errors

read_csv Looks at the first 1000 rows of data to guess column types, so it often makes mistakes if those 1000 rows are empty for a specific column. We can specify the types of certain columns, or tell it to look at more rows before guessing data types.

read_csv("data_sources/Batting.csv", col_types = cols(SF = col_double(), GIDP = col_double()))

## # A tibble: 102,816 x 22
##    playerID  yearID stint teamID lgID      G    AB     R     H  `2B`  `3B`    HR   RBI    SB    CS    BB    SO   IBB   HBP    SH    SF  GIDP
##    <chr>      <dbl> <dbl> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 abercda01   1871     1 TRO    <NA>      1     4     0     0     0     0     0     0     0     0     0     0    NA    NA    NA    NA    NA
##  2 addybo01    1871     1 RC1    <NA>     25   118    30    32     6     0     0    13     8     1     4     0    NA    NA    NA    NA    NA
##  3 allisar01   1871     1 CL1    <NA>     29   137    28    40     4     5     0    19     3     1     2     5    NA    NA    NA    NA    NA
##  4 allisdo01   1871     1 WS3    <NA>     27   133    28    44    10     2     2    27     1     1     0     2    NA    NA    NA    NA    NA
##  5 ansonca01   1871     1 RC1    <NA>     25   120    29    39    11     3     0    16     6     2     2     1    NA    NA    NA    NA    NA
##  6 armstbo01   1871     1 FW1    <NA>     12    49     9    11     2     1     0     5     0     1     0     1    NA    NA    NA    NA    NA
##  7 barkeal01   1871     1 RC1    <NA>      1     4     0     1     0     0     0     2     0     0     1     0    NA    NA    NA    NA    NA
##  8 barnero01   1871     1 BS1    <NA>     31   157    66    63    10     9     0    34    11     6    13     1    NA    NA    NA    NA    NA
##  9 barrebi01   1871     1 FW1    <NA>      1     5     1     1     1     0     0     1     0     0     0     0    NA    NA    NA    NA    NA
## 10 barrofr01   1871     1 BS1    <NA>     18    86    13    13     2     1     0    11     1     0     0     0    NA    NA    NA    NA    NA
## # ... with 102,806 more rows

read_csv("data_sources/Batting.csv", guess_max = 10000)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   playerID = col_character(),
##   teamID = col_character(),
##   lgID = col_character()
## )

## See spec(...) for full column specifications.

## # A tibble: 102,816 x 22
##    playerID  yearID stint teamID lgID      G    AB     R     H  `2B`  `3B`    HR   RBI    SB    CS    BB    SO   IBB   HBP    SH    SF  GIDP
##    <chr>      <dbl> <dbl> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 abercda01   1871     1 TRO    <NA>      1     4     0     0     0     0     0     0     0     0     0     0    NA    NA    NA    NA    NA
##  2 addybo01    1871     1 RC1    <NA>     25   118    30    32     6     0     0    13     8     1     4     0    NA    NA    NA    NA    NA
##  3 allisar01   1871     1 CL1    <NA>     29   137    28    40     4     5     0    19     3     1     2     5    NA    NA    NA    NA    NA
##  4 allisdo01   1871     1 WS3    <NA>     27   133    28    44    10     2     2    27     1     1     0     2    NA    NA    NA    NA    NA
##  5 ansonca01   1871     1 RC1    <NA>     25   120    29    39    11     3     0    16     6     2     2     1    NA    NA    NA    NA    NA
##  6 armstbo01   1871     1 FW1    <NA>     12    49     9    11     2     1     0     5     0     1     0     1    NA    NA    NA    NA    NA
##  7 barkeal01   1871     1 RC1    <NA>      1     4     0     1     0     0     0     2     0     0     1     0    NA    NA    NA    NA    NA
##  8 barnero01   1871     1 BS1    <NA>     31   157    66    63    10     9     0    34    11     6    13     1    NA    NA    NA    NA    NA
##  9 barrebi01   1871     1 FW1    <NA>      1     5     1     1     1     0     0     1     0     0     0     0    NA    NA    NA    NA    NA
## 10 barrofr01   1871     1 BS1    <NA>     18    86    13    13     2     1     0    11     1     0     0     0    NA    NA    NA    NA    NA
## # ... with 102,806 more rows

Clean Column Names

clean_names provides some helpful processing, such as making column names lowercase and replacing spaces and periods with underscores.

read_csv("data_sources/Batting.csv", col_types = cols(SF = col_double(), GIDP = col_double())) %>% 
  clean_names()

## # A tibble: 102,816 x 22
##    player_id year_id stint team_id lg_id     g    ab     r     h   x2b   x3b    hr   rbi    sb    cs    bb    so   ibb   hbp    sh    sf  gidp
##    <chr>       <dbl> <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 abercda01    1871     1 TRO     <NA>      1     4     0     0     0     0     0     0     0     0     0     0    NA    NA    NA    NA    NA
##  2 addybo01     1871     1 RC1     <NA>     25   118    30    32     6     0     0    13     8     1     4     0    NA    NA    NA    NA    NA
##  3 allisar01    1871     1 CL1     <NA>     29   137    28    40     4     5     0    19     3     1     2     5    NA    NA    NA    NA    NA
##  4 allisdo01    1871     1 WS3     <NA>     27   133    28    44    10     2     2    27     1     1     0     2    NA    NA    NA    NA    NA
##  5 ansonca01    1871     1 RC1     <NA>     25   120    29    39    11     3     0    16     6     2     2     1    NA    NA    NA    NA    NA
##  6 armstbo01    1871     1 FW1     <NA>     12    49     9    11     2     1     0     5     0     1     0     1    NA    NA    NA    NA    NA
##  7 barkeal01    1871     1 RC1     <NA>      1     4     0     1     0     0     0     2     0     0     1     0    NA    NA    NA    NA    NA
##  8 barnero01    1871     1 BS1     <NA>     31   157    66    63    10     9     0    34    11     6    13     1    NA    NA    NA    NA    NA
##  9 barrebi01    1871     1 FW1     <NA>      1     5     1     1     1     0     0     1     0     0     0     0    NA    NA    NA    NA    NA
## 10 barrofr01    1871     1 BS1     <NA>     18    86    13    13     2     1     0    11     1     0     0     0    NA    NA    NA    NA    NA
## # ... with 102,806 more rows

2.2 Analysis with `dplyr`

Who has the highest career slugging percentage? Let’s save the data we figured out how to correctly read in as an R object and start analyzing it.

Select

data <- read_csv("data_sources/Batting.csv", col_types = cols(SF = col_double(), GIDP = col_double())) %>% 
  clean_names()

select allows you to keep only certain columns. While this isn’t always necessary, it can be helpful to make it easier to glance at the data and see what you’re interested in.

data %>% 
  select(player_id, year_id, h ,x2b, x3b, hr, ab, g)

## # A tibble: 102,816 x 8
##    player_id year_id     h   x2b   x3b    hr    ab     g
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 abercda01    1871     0     0     0     0     4     1
##  2 addybo01     1871    32     6     0     0   118    25
##  3 allisar01    1871    40     4     5     0   137    29
##  4 allisdo01    1871    44    10     2     2   133    27
##  5 ansonca01    1871    39    11     3     0   120    25
##  6 armstbo01    1871    11     2     1     0    49    12
##  7 barkeal01    1871     1     0     0     0     4     1
##  8 barnero01    1871    63    10     9     0   157    31
##  9 barrebi01    1871     1     1     0     0     5     1
## 10 barrofr01    1871    13     2     1     0    86    18
## # ... with 102,806 more rows

There are a number of helper functions that can be used in select to make it easier to “search” for columns to pick.

In addition, a minus sign can be used to deselect columns.

data %>%
  select(contains("id"))

## # A tibble: 102,816 x 5
##    player_id year_id team_id lg_id  gidp
##    <chr>       <dbl> <chr>   <chr> <dbl>
##  1 abercda01    1871 TRO     <NA>     NA
##  2 addybo01     1871 RC1     <NA>     NA
##  3 allisar01    1871 CL1     <NA>     NA
##  4 allisdo01    1871 WS3     <NA>     NA
##  5 ansonca01    1871 RC1     <NA>     NA
##  6 armstbo01    1871 FW1     <NA>     NA
##  7 barkeal01    1871 RC1     <NA>     NA
##  8 barnero01    1871 BS1     <NA>     NA
##  9 barrebi01    1871 FW1     <NA>     NA
## 10 barrofr01    1871 BS1     <NA>     NA
## # ... with 102,806 more rows

data %>%
  select(ends_with("_id"), -lg_id)

## # A tibble: 102,816 x 3
##    player_id year_id team_id
##    <chr>       <dbl> <chr>  
##  1 abercda01    1871 TRO    
##  2 addybo01     1871 RC1    
##  3 allisar01    1871 CL1    
##  4 allisdo01    1871 WS3    
##  5 ansonca01    1871 RC1    
##  6 armstbo01    1871 FW1    
##  7 barkeal01    1871 RC1    
##  8 barnero01    1871 BS1    
##  9 barrebi01    1871 FW1    
## 10 barrofr01    1871 BS1    
## # ... with 102,806 more rows

Group_by

One immediate challenge we can see in the data is that we have statistics for each year of each player’s career. To get career stats, we’ll need to add up all rows of data for each player.

We can do this with the combination of group_by and summarize. The first step, group_by, doesn’t actually do anything to change the data - it essentially sets a flag on the dataframe that lets future dplyr functions know that they should operate on the data by group, instead of operating on the entire dataframe at once.

data %>% 
  select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>% 
  group_by(player_id)

## # A tibble: 102,816 x 8
## # Groups:   player_id [18,915]
##    player_id year_id     h   x2b   x3b    hr    ab     g
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 abercda01    1871     0     0     0     0     4     1
##  2 addybo01     1871    32     6     0     0   118    25
##  3 allisar01    1871    40     4     5     0   137    29
##  4 allisdo01    1871    44    10     2     2   133    27
##  5 ansonca01    1871    39    11     3     0   120    25
##  6 armstbo01    1871    11     2     1     0    49    12
##  7 barkeal01    1871     1     0     0     0     4     1
##  8 barnero01    1871    63    10     9     0   157    31
##  9 barrebi01    1871     1     1     0     0     5     1
## 10 barrofr01    1871    13     2     1     0    86    18
## # ... with 102,806 more rows

Summarize

Summarize runs the requested functions on the dataframe, by group, and returns a dataframe with one row per group and one column per specified summary. Here, we want to get each player’s career hits, doubles, triples, home runs, at-bats, and games, so that we can calculate slugging percentage, so we group by player_id and sum each of those columns.

data %>% 
  select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>% 
  group_by(player_id) %>% 
  summarize(h = sum(h),
            x2b = sum(x2b),
            x3b = sum(x3b),
            hr = sum(hr),
            ab = sum(ab),
            g = sum(g))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 18,915 x 7
##    player_id     h   x2b   x3b    hr    ab     g
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 aardsda01     0     0     0     0     4   331
##  2 aaronha01  3771   624    98   755 12364  3298
##  3 aaronto01   216    42     6    13   944   437
##  4 aasedo01      0     0     0     0     5   448
##  5 abadan01      2     0     0     0    21    15
##  6 abadfe01      1     0     0     0     9   315
##  7 abadijo01    11     0     0     0    49    12
##  8 abbated01   772    99    43    11  3044   855
##  9 abbeybe01    38     3     3     0   225    79
## 10 abbeych01   492    67    46    19  1751   451
## # ... with 18,905 more rows

Filter

We might only care about players with longer careers; players with very short but successful careers might bias our statistics.

Filter allows us to only keep rows of our dataframe that return TRUE for a given logical statement. Here, we only keep rows (players) with more than 1000 career games. We filter after summarizing, so this refers to each player’s career stats; filtering before summarizing, here, would remove all of our data, unless a player somehow figured out how to play 1000 games in a single season!

data %>% 
  select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>% 
  group_by(player_id) %>% 
  summarize(h = sum(h), 
            x2b = sum(x2b),
            x3b = sum(x3b),
            hr = sum(hr),
            ab = sum(ab),
            g = sum(g)) %>% 
  filter(g > 1000)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 1,564 x 7
##    player_id     h   x2b   x3b    hr    ab     g
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 aaronha01  3771   624    98   755 12364  3298
##  2 abreubo01  2470   574    59   288  8480  2425
##  3 adairje01  1022   163    19    57  4019  1165
##  4 adamsbo03  1082   188    49    37  4019  1281
##  5 adamssp01  1588   249    48     9  5557  1424
##  6 adcocjo01  1832   295    35   336  6606  1959
##  7 ageeto01    999   170    27   130  3912  1129
##  8 ainsmed01   707   108    54    22  3048  1078
##  9 alfoned01  1532   282    18   146  5385  1506
## 10 alicelu01  1031   189    53    47  3971  1341
## # ... with 1,554 more rows

Mutate

Mutate allows us to create or modify columns of our dataframe. Here, we use it to calculate slugging percentage for each row (player). We can refer to other columns with bare column names and use them in our calculations.

data %>% 
  select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>% 
  group_by(player_id) %>% 
  summarize(h = sum(h), 
            x2b = sum(x2b),
            x3b = sum(x3b),
            hr = sum(hr),
            ab = sum(ab),
            g = sum(g)) %>% 
  filter(g > 1000) %>% 
  mutate(slg = (h + x2b + 2 * x3b + 3 * hr) / ab)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 1,564 x 8
##    player_id     h   x2b   x3b    hr    ab     g   slg
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 aaronha01  3771   624    98   755 12364  3298 0.555
##  2 abreubo01  2470   574    59   288  8480  2425 0.475
##  3 adairje01  1022   163    19    57  4019  1165 0.347
##  4 adamsbo03  1082   188    49    37  4019  1281 0.368
##  5 adamssp01  1588   249    48     9  5557  1424 0.353
##  6 adcocjo01  1832   295    35   336  6606  1959 0.485
##  7 ageeto01    999   170    27   130  3912  1129 0.412
##  8 ainsmed01   707   108    54    22  3048  1078 0.324
##  9 alfoned01  1532   282    18   146  5385  1506 0.425
## 10 alicelu01  1031   189    53    47  3971  1341 0.369
## # ... with 1,554 more rows

To modify an existing column, just set the column name for the output of mutate to an existing column name. For instance, to round our new slg column to the conventional three digits:

data %>% 
  select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>% 
  group_by(player_id) %>% 
  summarize(h = sum(h), 
            x2b = sum(x2b),
            x3b = sum(x3b),
            hr = sum(hr),
            ab = sum(ab),
            g = sum(g)) %>% 
  filter(g > 1000) %>% 
  mutate(slg = (h + x2b + 2 * x3b + 3 * hr) / ab) %>%
  mutate(slg = round(slg, 3))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 1,564 x 8
##    player_id     h   x2b   x3b    hr    ab     g   slg
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 aaronha01  3771   624    98   755 12364  3298 0.555
##  2 abreubo01  2470   574    59   288  8480  2425 0.475
##  3 adairje01  1022   163    19    57  4019  1165 0.347
##  4 adamsbo03  1082   188    49    37  4019  1281 0.368
##  5 adamssp01  1588   249    48     9  5557  1424 0.353
##  6 adcocjo01  1832   295    35   336  6606  1959 0.485
##  7 ageeto01    999   170    27   130  3912  1129 0.412
##  8 ainsmed01   707   108    54    22  3048  1078 0.324
##  9 alfoned01  1532   282    18   146  5385  1506 0.425
## 10 alicelu01  1031   189    53    47  3971  1341 0.369
## # ... with 1,554 more rows

Arrange

arrange sorts dataframes by one or more columns. It sorts in ascending order by default; to sort descending, wrap a column name in desc().

slgdata <- data %>% 
  select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>% 
  group_by(player_id) %>% 
  summarize(h = sum(h), 
            x2b = sum(x2b),
            x3b = sum(x3b),
            hr = sum(hr),
            ab = sum(ab),
            g = sum(g)) %>% 
  filter(g>1000) %>% 
  mutate(slg = (h + x2b + 2*x3b + 3*hr)/ab) %>% 
  mutate(slg = round(slg, 3)) %>%
  arrange(desc(slg))

## `summarise()` ungrouping output (override with `.groups` argument)

slgdata

## # A tibble: 1,564 x 8
##    player_id     h   x2b   x3b    hr    ab     g   slg
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ruthba01   2873   506   136   714  8398  2503 0.69 
##  2 willite01  2654   525    71   521  7706  2292 0.634
##  3 gehrilo01  2721   534   163   493  8001  2164 0.632
##  4 foxxji01   2646   458   125   534  8134  2317 0.609
##  5 bondsba01  2935   601    77   762  9847  2986 0.607
##  6 greenha01  1628   379    71   331  5193  1394 0.605
##  7 mcgwima01  1626   252     6   583  6187  1874 0.588
##  8 ramirma02  2574   547    20   555  8244  2302 0.585
##  9 dimagjo01  2214   389   131   361  6821  1736 0.579
## 10 hornsro01  2930   541   169   301  8173  2259 0.577
## # ... with 1,554 more rows

2.3 Joins

In many projects, you will not have all of the data you need contained in a single table. You can use dplyr’s SQL-style join functions to combine tables. This isn’t always intuitive at first, especially with large amounts of data joining on multiple columns, but it’s dramatically easier and quicker than tools like VLOOKUP once you’re used to it.

Player ID Data

Our existing player data only has player ids with partial names. However, we have a dataset with lots of additional player information.

player_info <- read_csv("data_sources/Master.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   birthYear = col_double(),
##   birthMonth = col_double(),
##   birthDay = col_double(),
##   deathYear = col_double(),
##   deathMonth = col_double(),
##   deathDay = col_double(),
##   weight = col_double(),
##   height = col_double(),
##   debut = col_date(format = ""),
##   finalGame = col_date(format = "")
## )

## See spec(...) for full column specifications.

player_info

## # A tibble: 19,105 x 24
##    playerID birthYear birthMonth birthDay birthCountry birthState birthCity deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst nameLast nameGiven weight
##    <chr>        <dbl>      <dbl>    <dbl> <chr>        <chr>      <chr>         <dbl>      <dbl>    <dbl> <chr>        <chr>      <chr>     <chr>     <chr>    <chr>      <dbl>
##  1 aardsda~      1981         12       27 USA          CO         Denver           NA         NA       NA <NA>         <NA>       <NA>      David     Aardsma  David Al~    215
##  2 aaronha~      1934          2        5 USA          AL         Mobile           NA         NA       NA <NA>         <NA>       <NA>      Hank      Aaron    Henry Lo~    180
##  3 aaronto~      1939          8        5 USA          AL         Mobile         1984          8       16 USA          GA         Atlanta   Tommie    Aaron    Tommie L~    190
##  4 aasedo01      1954          9        8 USA          CA         Orange           NA         NA       NA <NA>         <NA>       <NA>      Don       Aase     Donald W~    190
##  5 abadan01      1972          8       25 USA          FL         Palm Bea~        NA         NA       NA <NA>         <NA>       <NA>      Andy      Abad     Fausto A~    184
##  6 abadfe01      1985         12       17 D.R.         La Romana  La Romana        NA         NA       NA <NA>         <NA>       <NA>      Fernando  Abad     Fernando~    220
##  7 abadijo~      1850         11        4 USA          PA         Philadel~      1905          5       17 USA          NJ         Pemberton John      Abadie   John W.      192
##  8 abbated~      1877          4       15 USA          PA         Latrobe        1957          1        6 USA          FL         Fort Lau~ Ed        Abbatic~ Edward J~    170
##  9 abbeybe~      1869         11       11 USA          VT         Essex          1962          6       11 USA          VT         Colchest~ Bert      Abbey    Bert Wood    175
## 10 abbeych~      1866         10       14 USA          NE         Falls Ci~      1926          4       27 USA          CA         San Fran~ Charlie   Abbey    Charles ~    169
## # ... with 19,095 more rows, and 7 more variables: height <dbl>, bats <chr>, throws <chr>, debut <date>, finalGame <date>, retroID <chr>, bbrefID <chr>

Let’s create a single player column from the first and last name columns, and drop everything else besides our player id and name columns.

player_names <- player_info %>% 
  select(playerID, nameFirst, nameLast) %>% 
  mutate(player = str_c(nameFirst, " ", nameLast)) %>% 
  select(-starts_with("name"))

player_names

## # A tibble: 19,105 x 2
##    playerID  player        
##    <chr>     <chr>         
##  1 aardsda01 David Aardsma 
##  2 aaronha01 Hank Aaron    
##  3 aaronto01 Tommie Aaron  
##  4 aasedo01  Don Aase      
##  5 abadan01  Andy Abad     
##  6 abadfe01  Fernando Abad 
##  7 abadijo01 John Abadie   
##  8 abbated01 Ed Abbaticchio
##  9 abbeybe01 Bert Abbey    
## 10 abbeych01 Charlie Abbey 
## # ... with 19,095 more rows

Join the Data

A “left join” on dataframes A and B keeps all of table A and adds columns from table B, matching up rows based on one or more specified joining columns. Here, we want to add the player names from B, matching by player id. However, the player id columns in the two dataframes are not spelled exactly the same way, so we have to tell left_join precisely which two columns to match up between dataframes.

slgdata %>% 
  left_join(player_names, by = c("player_id" = "playerID"))

## # A tibble: 1,564 x 9
##    player_id     h   x2b   x3b    hr    ab     g   slg player        
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>         
##  1 ruthba01   2873   506   136   714  8398  2503 0.69  Babe Ruth     
##  2 willite01  2654   525    71   521  7706  2292 0.634 Ted Williams  
##  3 gehrilo01  2721   534   163   493  8001  2164 0.632 Lou Gehrig    
##  4 foxxji01   2646   458   125   534  8134  2317 0.609 Jimmie Foxx   
##  5 bondsba01  2935   601    77   762  9847  2986 0.607 Barry Bonds   
##  6 greenha01  1628   379    71   331  5193  1394 0.605 Hank Greenberg
##  7 mcgwima01  1626   252     6   583  6187  1874 0.588 Mark McGwire  
##  8 ramirma02  2574   547    20   555  8244  2302 0.585 Manny Ramirez 
##  9 dimagjo01  2214   389   131   361  6821  1736 0.579 Joe DiMaggio  
## 10 hornsro01  2930   541   169   301  8173  2259 0.577 Rogers Hornsby
## # ... with 1,554 more rows

slgname <- slgdata %>% 
  left_join(player_names, by = c("player_id" = "playerID"))

Rename Columns

rename makes it easy to rename columns without changing anything else in a dataframe. It uses the format newname = oldname.

slgname %>% 
  rename(doubles = x2b, triples = x3b)

## # A tibble: 1,564 x 9
##    player_id     h doubles triples    hr    ab     g   slg player        
##    <chr>     <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>         
##  1 ruthba01   2873     506     136   714  8398  2503 0.69  Babe Ruth     
##  2 willite01  2654     525      71   521  7706  2292 0.634 Ted Williams  
##  3 gehrilo01  2721     534     163   493  8001  2164 0.632 Lou Gehrig    
##  4 foxxji01   2646     458     125   534  8134  2317 0.609 Jimmie Foxx   
##  5 bondsba01  2935     601      77   762  9847  2986 0.607 Barry Bonds   
##  6 greenha01  1628     379      71   331  5193  1394 0.605 Hank Greenberg
##  7 mcgwima01  1626     252       6   583  6187  1874 0.588 Mark McGwire  
##  8 ramirma02  2574     547      20   555  8244  2302 0.585 Manny Ramirez 
##  9 dimagjo01  2214     389     131   361  6821  1736 0.579 Joe DiMaggio  
## 10 hornsro01  2930     541     169   301  8173  2259 0.577 Rogers Hornsby
## # ... with 1,554 more rows

slgname <- slgname %>% 
  rename(doubles = x2b, triples = x3b)

Reorder Columns

slgname %>% 
  select(player_id, player, everything())

## # A tibble: 1,564 x 9
##    player_id player             h doubles triples    hr    ab     g   slg
##    <chr>     <chr>          <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ruthba01  Babe Ruth       2873     506     136   714  8398  2503 0.69 
##  2 willite01 Ted Williams    2654     525      71   521  7706  2292 0.634
##  3 gehrilo01 Lou Gehrig      2721     534     163   493  8001  2164 0.632
##  4 foxxji01  Jimmie Foxx     2646     458     125   534  8134  2317 0.609
##  5 bondsba01 Barry Bonds     2935     601      77   762  9847  2986 0.607
##  6 greenha01 Hank Greenberg  1628     379      71   331  5193  1394 0.605
##  7 mcgwima01 Mark McGwire    1626     252       6   583  6187  1874 0.588
##  8 ramirma02 Manny Ramirez   2574     547      20   555  8244  2302 0.585
##  9 dimagjo01 Joe DiMaggio    2214     389     131   361  6821  1736 0.579
## 10 hornsro01 Rogers Hornsby  2930     541     169   301  8173  2259 0.577
## # ... with 1,554 more rows

As of dplyr 1.0.0, there is a new relocate function that moves the specified columns to the front by default.

slgname %>% 
  dplyr::relocate(contains("player"))

## # A tibble: 1,564 x 9
##    player_id player             h doubles triples    hr    ab     g   slg
##    <chr>     <chr>          <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ruthba01  Babe Ruth       2873     506     136   714  8398  2503 0.69 
##  2 willite01 Ted Williams    2654     525      71   521  7706  2292 0.634
##  3 gehrilo01 Lou Gehrig      2721     534     163   493  8001  2164 0.632
##  4 foxxji01  Jimmie Foxx     2646     458     125   534  8134  2317 0.609
##  5 bondsba01 Barry Bonds     2935     601      77   762  9847  2986 0.607
##  6 greenha01 Hank Greenberg  1628     379      71   331  5193  1394 0.605
##  7 mcgwima01 Mark McGwire    1626     252       6   583  6187  1874 0.588
##  8 ramirma02 Manny Ramirez   2574     547      20   555  8244  2302 0.585
##  9 dimagjo01 Joe DiMaggio    2214     389     131   361  6821  1736 0.579
## 10 hornsro01 Rogers Hornsby  2930     541     169   301  8173  2259 0.577
## # ... with 1,554 more rows

We could also then move the slg column so that it’s right after the player columns.

slgname %>% 
  dplyr::relocate(contains("player")) %>%
  dplyr::relocate(slg, .after = player)

## # A tibble: 1,564 x 9
##    player_id player           slg     h doubles triples    hr    ab     g
##    <chr>     <chr>          <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 ruthba01  Babe Ruth      0.69   2873     506     136   714  8398  2503
##  2 willite01 Ted Williams   0.634  2654     525      71   521  7706  2292
##  3 gehrilo01 Lou Gehrig     0.632  2721     534     163   493  8001  2164
##  4 foxxji01  Jimmie Foxx    0.609  2646     458     125   534  8134  2317
##  5 bondsba01 Barry Bonds    0.607  2935     601      77   762  9847  2986
##  6 greenha01 Hank Greenberg 0.605  1628     379      71   331  5193  1394
##  7 mcgwima01 Mark McGwire   0.588  1626     252       6   583  6187  1874
##  8 ramirma02 Manny Ramirez  0.585  2574     547      20   555  8244  2302
##  9 dimagjo01 Joe DiMaggio   0.579  2214     389     131   361  6821  1736
## 10 hornsro01 Rogers Hornsby 0.577  2930     541     169   301  8173  2259
## # ... with 1,554 more rows

2.4 Other dplyr Tricks

Count

count creates tidy-format frequency tables for a given combination of columns. It’s a shortcut for group_by(cols) %>% summarize(n = n()).

data %>% 
  count(player_id)

## # A tibble: 18,915 x 2
##    player_id     n
##    <chr>     <int>
##  1 aardsda01     9
##  2 aaronha01    23
##  3 aaronto01     7
##  4 aasedo01     13
##  5 abadan01      3
##  6 abadfe01      8
##  7 abadijo01     2
##  8 abbated01    10
##  9 abbeybe01     6
## 10 abbeych01     5
## # ... with 18,905 more rows

data %>% 
  count(player_id, sort = TRUE)

## # A tibble: 18,915 x 2
##    player_id     n
##    <chr>     <int>
##  1 mcguide01    31
##  2 henderi01    29
##  3 newsobo01    29
##  4 johnto01     28
##  5 kaatji01     28
##  6 ansonca01    27
##  7 baineha01    27
##  8 carltst01    27
##  9 moyerja01    27
## 10 ryanno01     27
## # ... with 18,905 more rows

2.4.1 Grouped Mutate

group_by doesn’t just allow you to summarize - it also allows you to do mutate calculations within each group. For example, to calculate the proportion of their career home runs that each player hit during each year of their career, we could do the following. The key here is that the sum inside the mutate only sums up home runs within each group.

prop_hr <- data %>%
  select(player_id, year_id, hr) %>%
  group_by(player_id) %>%
  mutate(prop_career_hr = hr/sum(hr))

prop_hr %>%
  filter(player_id == "ruthba01")

## # A tibble: 22 x 4
## # Groups:   player_id [1]
##    player_id year_id    hr prop_career_hr
##    <chr>       <dbl> <dbl>          <dbl>
##  1 ruthba01     1914     0        0      
##  2 ruthba01     1915     4        0.00560
##  3 ruthba01     1916     3        0.00420
##  4 ruthba01     1917     2        0.00280
##  5 ruthba01     1918    11        0.0154 
##  6 ruthba01     1919    29        0.0406 
##  7 ruthba01     1920    54        0.0756 
##  8 ruthba01     1921    59        0.0826 
##  9 ruthba01     1922    35        0.0490 
## 10 ruthba01     1923    41        0.0574 
## # ... with 12 more rows

Conditional verb variants

dplyr provides conditional variants of many of the main verbs, ending in _at, _if, and _all. These are most often used with summarize, mutate, or rename to operate on multiple columns at once.

The rather wordy summarize that we used earlier can be replaced with a summarize_at that specifies the columns to be summarized and one or more functions to apply to each columns. Anything that would work in select() can be wrapped in vars() to select columns, and function names can be bare, base-R-style anonymous functions, or purrr-style formula notation, which will be covered in section 4.

slg_summary <- data %>% 
  group_by(player_id) %>% 
  summarize(h = sum(h), 
            x2b = sum(x2b),
            x3b = sum(x3b),
            hr = sum(hr),
            ab = sum(ab),
            g = sum(g))

## `summarise()` ungrouping output (override with `.groups` argument)

head(slg_summary)

## # A tibble: 6 x 7
##   player_id     h   x2b   x3b    hr    ab     g
##   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aardsda01     0     0     0     0     4   331
## 2 aaronha01  3771   624    98   755 12364  3298
## 3 aaronto01   216    42     6    13   944   437
## 4 aasedo01      0     0     0     0     5   448
## 5 abadan01      2     0     0     0    21    15
## 6 abadfe01      1     0     0     0     9   315

# single function with no additional arguments: bare function name
concise_slg_summary <- data %>% 
  group_by(player_id) %>% 
  summarize_at(.vars = vars(h, x2b, x3b, hr, ab, g), .funs = sum)

concise_slg_summary

## # A tibble: 18,915 x 7
##    player_id     h   x2b   x3b    hr    ab     g
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 aardsda01     0     0     0     0     4   331
##  2 aaronha01  3771   624    98   755 12364  3298
##  3 aaronto01   216    42     6    13   944   437
##  4 aasedo01      0     0     0     0     5   448
##  5 abadan01      2     0     0     0    21    15
##  6 abadfe01      1     0     0     0     9   315
##  7 abadijo01    11     0     0     0    49    12
##  8 abbated01   772    99    43    11  3044   855
##  9 abbeybe01    38     3     3     0   225    79
## 10 abbeych01   492    67    46    19  1751   451
## # ... with 18,905 more rows

Passing a named list of functions as the .funs argument allows you to run multiple summary functions on each selected column and name the resulting summary columns.

multi_summary <- data %>% 
  group_by(player_id) %>% 
  summarize_at(.vars = vars(h, x2b, x3b, hr, ab, g), .funs = list("sum" = sum, "sd" = sd))

multi_summary

## # A tibble: 18,915 x 13
##    player_id h_sum x2b_sum x3b_sum hr_sum ab_sum g_sum   h_sd x2b_sd x3b_sd hr_sd   ab_sd  g_sd
##    <chr>     <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>   <dbl> <dbl>
##  1 aardsda01     0       0       0      0      4   331  0      0      0      0      0.726 22.0 
##  2 aaronha01  3771     624      98    755  12364  3298 40.6   10.1    3.84  11.2   98.4   18.5 
##  3 aaronto01   216      42       6     13    944   437 29.7    7.02   1.21   2.79 125.    45.5 
##  4 aasedo01      0       0       0      0      5   448  0      0      0      0      1.39  16.2 
##  5 abadan01      2       0       0      0     21    15  1.15   0      0      0      8.72   4   
##  6 abadfe01      1       0       0      0      9   315  0.354  0      0      0      2.42  18.0 
##  7 abadijo01    11       0       0      0     49    12  6.36   0      0      0     29.0    7.07
##  8 abbated01   772      99      43     11   3044   855 65.7    9.29   4.55   1.20 250.    66.7 
##  9 abbeybe01    38       3       3      0    225    79  3.78   0.548  0.837  0     27.3   10.5 
## 10 abbeych01   492      67      46     19   1751   451 53.8    8.88   5.40   3.56 170.    41.8 
## # ... with 18,905 more rows

The _if variant allows the user to run a dplyr operation on all columns for which some function returns TRUE. For instance, to add the string "_numeric" to the names of all numeric columns, we could use rename_if. We’ll run this on a subset of columns to simplify the display.

numeric_rename <- data %>%
  select(player_id, year_id, lg_id, g, ab) %>%
  rename_if(is.numeric, ~str_c(.x, "_numeric"))

numeric_rename

## # A tibble: 102,816 x 5
##    player_id year_id_numeric lg_id g_numeric ab_numeric
##    <chr>               <dbl> <chr>     <dbl>      <dbl>
##  1 abercda01            1871 <NA>          1          4
##  2 addybo01             1871 <NA>         25        118
##  3 allisar01            1871 <NA>         29        137
##  4 allisdo01            1871 <NA>         27        133
##  5 ansonca01            1871 <NA>         25        120
##  6 armstbo01            1871 <NA>         12         49
##  7 barkeal01            1871 <NA>          1          4
##  8 barnero01            1871 <NA>         31        157
##  9 barrebi01            1871 <NA>          1          5
## 10 barrofr01            1871 <NA>         18         86
## # ... with 102,806 more rows

The _all variant simply runs the same operation on all columns. We’ll select down to a set of numeric columns, and then multiply all of them by two.

data %>%
  select(hr, ab, g) %>%
  mutate_all(~ .x * 2)

## # A tibble: 102,816 x 3
##       hr    ab     g
##    <dbl> <dbl> <dbl>
##  1     0     8     2
##  2     0   236    50
##  3     0   274    58
##  4     4   266    54
##  5     0   240    50
##  6     0    98    24
##  7     0     8     2
##  8     0   314    62
##  9     0    10     2
## 10     0   172    36
## # ... with 102,806 more rows

2.4.2 `across()`

The newest version of dplyr (1.0.0) introduced a new, arguably simpler way to achieve the same goals as these conditional function variants: the across function. across can be used with many of the dplyr verbs to operate on a specific subset of columns, specified with the same language as select. For example, across could be used in similar ways as the conditional variants above:

multi_summary_across <- data %>% 
  group_by(player_id) %>% 
  summarize(across(c(h, x2b, x3b, hr, ab, g), .funs = list("sum" = sum, "sd" = sd)))

## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)

multi_summary_across

## # A tibble: 102,816 x 7
## # Groups:   player_id [18,915]
##    player_id     h   x2b   x3b    hr    ab     g
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 aardsda01     0     0     0     0     0    11
##  2 aardsda01     0     0     0     0     2    45
##  3 aardsda01     0     0     0     0     0    25
##  4 aardsda01     0     0     0     0     1    47
##  5 aardsda01     0     0     0     0     0    73
##  6 aardsda01     0     0     0     0     0    53
##  7 aardsda01     0     0     0     0     0     1
##  8 aardsda01     0     0     0     0     0    43
##  9 aardsda01     0     0     0     0     1    33
## 10 aaronha01   131    27     6    13   468   122
## # ... with 102,806 more rows

# multiply all selected numeric columns by 2
mutate_numeric_across <- data %>%
  select(player_id, hr, ab, g) %>%
  mutate(across(where(is.numeric), ~ .x * 2))

mutate_numeric_across

## # A tibble: 102,816 x 4
##    player_id    hr    ab     g
##    <chr>     <dbl> <dbl> <dbl>
##  1 abercda01     0     8     2
##  2 addybo01      0   236    50
##  3 allisar01     0   274    58
##  4 allisdo01     4   266    54
##  5 ansonca01     0   240    50
##  6 armstbo01     0    98    24
##  7 barkeal01     0     8     2
##  8 barnero01     0   314    62
##  9 barrebi01     0    10     2
## 10 barrofr01     0   172    36
## # ... with 102,806 more rows

However, across is very new and occasionally has serious performance issues, so it is not used above. More information can be found (here)[https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/];

2.4.3 Write Data

We will write this file to a folder so we can access it in later lessons / chapters.

slgname %>% 
  write_csv("data_sources/saved_data/slgname.csv")