2 Data Manipulation
In this section, you will learn:
- Some nuances of reading in data.
- The basics of
dplyr
to manipulate data.
select
mutate
filter
group_by
summarize
arrange
- A few more advanced
dplyr
concepts. - How to do ‘joins’.
In this section, we will use the following libraries:
2.1 Read in Data
## Parsed with column specification:
## cols(
## .default = col_double(),
## playerID = col_character(),
## teamID = col_character(),
## lgID = col_character(),
## SF = col_logical(),
## GIDP = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 45441 parsing failures.
## row col expected actual file
## 25015 GIDP 1/0/T/F/TRUE/FALSE 2 'data_sources/Batting.csv'
## 25016 GIDP 1/0/T/F/TRUE/FALSE 10 'data_sources/Batting.csv'
## 25018 GIDP 1/0/T/F/TRUE/FALSE 4 'data_sources/Batting.csv'
## 25028 GIDP 1/0/T/F/TRUE/FALSE 8 'data_sources/Batting.csv'
## 25030 GIDP 1/0/T/F/TRUE/FALSE 3 'data_sources/Batting.csv'
## ..... .... .................. ...... ..........................
## See problems(...) for more details.
## # A tibble: 102,816 x 22
## playerID yearID stint teamID lgID G AB R H `2B` `3B` HR RBI SB CS BB SO IBB HBP SH SF GIDP
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 abercda01 1871 1 TRO <NA> 1 4 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA
## 2 addybo01 1871 1 RC1 <NA> 25 118 30 32 6 0 0 13 8 1 4 0 NA NA NA NA NA
## 3 allisar01 1871 1 CL1 <NA> 29 137 28 40 4 5 0 19 3 1 2 5 NA NA NA NA NA
## 4 allisdo01 1871 1 WS3 <NA> 27 133 28 44 10 2 2 27 1 1 0 2 NA NA NA NA NA
## 5 ansonca01 1871 1 RC1 <NA> 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
## 6 armstbo01 1871 1 FW1 <NA> 12 49 9 11 2 1 0 5 0 1 0 1 NA NA NA NA NA
## 7 barkeal01 1871 1 RC1 <NA> 1 4 0 1 0 0 0 2 0 0 1 0 NA NA NA NA NA
## 8 barnero01 1871 1 BS1 <NA> 31 157 66 63 10 9 0 34 11 6 13 1 NA NA NA NA NA
## 9 barrebi01 1871 1 FW1 <NA> 1 5 1 1 1 0 0 1 0 0 0 0 NA NA NA NA NA
## 10 barrofr01 1871 1 BS1 <NA> 18 86 13 13 2 1 0 11 1 0 0 0 NA NA NA NA NA
## # ... with 102,806 more rows
Fix Read In Errors
read_csv
Looks at the first 1000 rows of data to guess column types, so it often makes mistakes if those 1000 rows are empty for a specific column. We can specify the types of certain columns, or tell it to look at more rows before guessing data types.
## # A tibble: 102,816 x 22
## playerID yearID stint teamID lgID G AB R H `2B` `3B` HR RBI SB CS BB SO IBB HBP SH SF GIDP
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 abercda01 1871 1 TRO <NA> 1 4 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA
## 2 addybo01 1871 1 RC1 <NA> 25 118 30 32 6 0 0 13 8 1 4 0 NA NA NA NA NA
## 3 allisar01 1871 1 CL1 <NA> 29 137 28 40 4 5 0 19 3 1 2 5 NA NA NA NA NA
## 4 allisdo01 1871 1 WS3 <NA> 27 133 28 44 10 2 2 27 1 1 0 2 NA NA NA NA NA
## 5 ansonca01 1871 1 RC1 <NA> 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
## 6 armstbo01 1871 1 FW1 <NA> 12 49 9 11 2 1 0 5 0 1 0 1 NA NA NA NA NA
## 7 barkeal01 1871 1 RC1 <NA> 1 4 0 1 0 0 0 2 0 0 1 0 NA NA NA NA NA
## 8 barnero01 1871 1 BS1 <NA> 31 157 66 63 10 9 0 34 11 6 13 1 NA NA NA NA NA
## 9 barrebi01 1871 1 FW1 <NA> 1 5 1 1 1 0 0 1 0 0 0 0 NA NA NA NA NA
## 10 barrofr01 1871 1 BS1 <NA> 18 86 13 13 2 1 0 11 1 0 0 0 NA NA NA NA NA
## # ... with 102,806 more rows
## Parsed with column specification:
## cols(
## .default = col_double(),
## playerID = col_character(),
## teamID = col_character(),
## lgID = col_character()
## )
## See spec(...) for full column specifications.
## # A tibble: 102,816 x 22
## playerID yearID stint teamID lgID G AB R H `2B` `3B` HR RBI SB CS BB SO IBB HBP SH SF GIDP
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 abercda01 1871 1 TRO <NA> 1 4 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA
## 2 addybo01 1871 1 RC1 <NA> 25 118 30 32 6 0 0 13 8 1 4 0 NA NA NA NA NA
## 3 allisar01 1871 1 CL1 <NA> 29 137 28 40 4 5 0 19 3 1 2 5 NA NA NA NA NA
## 4 allisdo01 1871 1 WS3 <NA> 27 133 28 44 10 2 2 27 1 1 0 2 NA NA NA NA NA
## 5 ansonca01 1871 1 RC1 <NA> 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
## 6 armstbo01 1871 1 FW1 <NA> 12 49 9 11 2 1 0 5 0 1 0 1 NA NA NA NA NA
## 7 barkeal01 1871 1 RC1 <NA> 1 4 0 1 0 0 0 2 0 0 1 0 NA NA NA NA NA
## 8 barnero01 1871 1 BS1 <NA> 31 157 66 63 10 9 0 34 11 6 13 1 NA NA NA NA NA
## 9 barrebi01 1871 1 FW1 <NA> 1 5 1 1 1 0 0 1 0 0 0 0 NA NA NA NA NA
## 10 barrofr01 1871 1 BS1 <NA> 18 86 13 13 2 1 0 11 1 0 0 0 NA NA NA NA NA
## # ... with 102,806 more rows
Clean Column Names
clean_names
provides some helpful processing, such as making column names lowercase and replacing spaces and periods with underscores.
read_csv("data_sources/Batting.csv", col_types = cols(SF = col_double(), GIDP = col_double())) %>%
clean_names()
## # A tibble: 102,816 x 22
## player_id year_id stint team_id lg_id g ab r h x2b x3b hr rbi sb cs bb so ibb hbp sh sf gidp
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 abercda01 1871 1 TRO <NA> 1 4 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA
## 2 addybo01 1871 1 RC1 <NA> 25 118 30 32 6 0 0 13 8 1 4 0 NA NA NA NA NA
## 3 allisar01 1871 1 CL1 <NA> 29 137 28 40 4 5 0 19 3 1 2 5 NA NA NA NA NA
## 4 allisdo01 1871 1 WS3 <NA> 27 133 28 44 10 2 2 27 1 1 0 2 NA NA NA NA NA
## 5 ansonca01 1871 1 RC1 <NA> 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
## 6 armstbo01 1871 1 FW1 <NA> 12 49 9 11 2 1 0 5 0 1 0 1 NA NA NA NA NA
## 7 barkeal01 1871 1 RC1 <NA> 1 4 0 1 0 0 0 2 0 0 1 0 NA NA NA NA NA
## 8 barnero01 1871 1 BS1 <NA> 31 157 66 63 10 9 0 34 11 6 13 1 NA NA NA NA NA
## 9 barrebi01 1871 1 FW1 <NA> 1 5 1 1 1 0 0 1 0 0 0 0 NA NA NA NA NA
## 10 barrofr01 1871 1 BS1 <NA> 18 86 13 13 2 1 0 11 1 0 0 0 NA NA NA NA NA
## # ... with 102,806 more rows
2.2 Analysis with dplyr
Who has the highest career slugging percentage? Let’s save the data we figured out how to correctly read in as an R object and start analyzing it.
Select
data <- read_csv("data_sources/Batting.csv", col_types = cols(SF = col_double(), GIDP = col_double())) %>%
clean_names()
select
allows you to keep only certain columns. While this isn’t always necessary, it can be helpful to make it easier to glance at the data and see what you’re interested in.
## # A tibble: 102,816 x 8
## player_id year_id h x2b x3b hr ab g
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 abercda01 1871 0 0 0 0 4 1
## 2 addybo01 1871 32 6 0 0 118 25
## 3 allisar01 1871 40 4 5 0 137 29
## 4 allisdo01 1871 44 10 2 2 133 27
## 5 ansonca01 1871 39 11 3 0 120 25
## 6 armstbo01 1871 11 2 1 0 49 12
## 7 barkeal01 1871 1 0 0 0 4 1
## 8 barnero01 1871 63 10 9 0 157 31
## 9 barrebi01 1871 1 1 0 0 5 1
## 10 barrofr01 1871 13 2 1 0 86 18
## # ... with 102,806 more rows
There are a number of helper functions that can be used in select
to make it easier to “search” for columns to pick.
In addition, a minus sign can be used to deselect columns.
## # A tibble: 102,816 x 5
## player_id year_id team_id lg_id gidp
## <chr> <dbl> <chr> <chr> <dbl>
## 1 abercda01 1871 TRO <NA> NA
## 2 addybo01 1871 RC1 <NA> NA
## 3 allisar01 1871 CL1 <NA> NA
## 4 allisdo01 1871 WS3 <NA> NA
## 5 ansonca01 1871 RC1 <NA> NA
## 6 armstbo01 1871 FW1 <NA> NA
## 7 barkeal01 1871 RC1 <NA> NA
## 8 barnero01 1871 BS1 <NA> NA
## 9 barrebi01 1871 FW1 <NA> NA
## 10 barrofr01 1871 BS1 <NA> NA
## # ... with 102,806 more rows
## # A tibble: 102,816 x 3
## player_id year_id team_id
## <chr> <dbl> <chr>
## 1 abercda01 1871 TRO
## 2 addybo01 1871 RC1
## 3 allisar01 1871 CL1
## 4 allisdo01 1871 WS3
## 5 ansonca01 1871 RC1
## 6 armstbo01 1871 FW1
## 7 barkeal01 1871 RC1
## 8 barnero01 1871 BS1
## 9 barrebi01 1871 FW1
## 10 barrofr01 1871 BS1
## # ... with 102,806 more rows
Group_by
One immediate challenge we can see in the data is that we have statistics for each year of each player’s career. To get career stats, we’ll need to add up all rows of data for each player.
We can do this with the combination of group_by
and summarize
. The first step, group_by
, doesn’t actually do anything to change the data - it essentially sets a flag on the dataframe that lets future dplyr
functions know that they should operate on the data by group, instead of operating on the entire dataframe at once.
## # A tibble: 102,816 x 8
## # Groups: player_id [18,915]
## player_id year_id h x2b x3b hr ab g
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 abercda01 1871 0 0 0 0 4 1
## 2 addybo01 1871 32 6 0 0 118 25
## 3 allisar01 1871 40 4 5 0 137 29
## 4 allisdo01 1871 44 10 2 2 133 27
## 5 ansonca01 1871 39 11 3 0 120 25
## 6 armstbo01 1871 11 2 1 0 49 12
## 7 barkeal01 1871 1 0 0 0 4 1
## 8 barnero01 1871 63 10 9 0 157 31
## 9 barrebi01 1871 1 1 0 0 5 1
## 10 barrofr01 1871 13 2 1 0 86 18
## # ... with 102,806 more rows
Summarize
Summarize runs the requested functions on the dataframe, by group, and returns a dataframe with one row per group and one column per specified summary. Here, we want to get each player’s career hits, doubles, triples, home runs, at-bats, and games, so that we can calculate slugging percentage, so we group by player_id and sum each of those columns.
data %>%
select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>%
group_by(player_id) %>%
summarize(h = sum(h),
x2b = sum(x2b),
x3b = sum(x3b),
hr = sum(hr),
ab = sum(ab),
g = sum(g))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 18,915 x 7
## player_id h x2b x3b hr ab g
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aardsda01 0 0 0 0 4 331
## 2 aaronha01 3771 624 98 755 12364 3298
## 3 aaronto01 216 42 6 13 944 437
## 4 aasedo01 0 0 0 0 5 448
## 5 abadan01 2 0 0 0 21 15
## 6 abadfe01 1 0 0 0 9 315
## 7 abadijo01 11 0 0 0 49 12
## 8 abbated01 772 99 43 11 3044 855
## 9 abbeybe01 38 3 3 0 225 79
## 10 abbeych01 492 67 46 19 1751 451
## # ... with 18,905 more rows
Filter
We might only care about players with longer careers; players with very short but successful careers might bias our statistics.
Filter allows us to only keep rows of our dataframe that return TRUE for a given logical statement. Here, we only keep rows (players) with more than 1000 career games. We filter after summarizing, so this refers to each player’s career stats; filtering before summarizing, here, would remove all of our data, unless a player somehow figured out how to play 1000 games in a single season!
data %>%
select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>%
group_by(player_id) %>%
summarize(h = sum(h),
x2b = sum(x2b),
x3b = sum(x3b),
hr = sum(hr),
ab = sum(ab),
g = sum(g)) %>%
filter(g > 1000)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1,564 x 7
## player_id h x2b x3b hr ab g
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aaronha01 3771 624 98 755 12364 3298
## 2 abreubo01 2470 574 59 288 8480 2425
## 3 adairje01 1022 163 19 57 4019 1165
## 4 adamsbo03 1082 188 49 37 4019 1281
## 5 adamssp01 1588 249 48 9 5557 1424
## 6 adcocjo01 1832 295 35 336 6606 1959
## 7 ageeto01 999 170 27 130 3912 1129
## 8 ainsmed01 707 108 54 22 3048 1078
## 9 alfoned01 1532 282 18 146 5385 1506
## 10 alicelu01 1031 189 53 47 3971 1341
## # ... with 1,554 more rows
Mutate
Mutate allows us to create or modify columns of our dataframe. Here, we use it to calculate slugging percentage for each row (player). We can refer to other columns with bare column names and use them in our calculations.
data %>%
select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>%
group_by(player_id) %>%
summarize(h = sum(h),
x2b = sum(x2b),
x3b = sum(x3b),
hr = sum(hr),
ab = sum(ab),
g = sum(g)) %>%
filter(g > 1000) %>%
mutate(slg = (h + x2b + 2 * x3b + 3 * hr) / ab)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1,564 x 8
## player_id h x2b x3b hr ab g slg
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aaronha01 3771 624 98 755 12364 3298 0.555
## 2 abreubo01 2470 574 59 288 8480 2425 0.475
## 3 adairje01 1022 163 19 57 4019 1165 0.347
## 4 adamsbo03 1082 188 49 37 4019 1281 0.368
## 5 adamssp01 1588 249 48 9 5557 1424 0.353
## 6 adcocjo01 1832 295 35 336 6606 1959 0.485
## 7 ageeto01 999 170 27 130 3912 1129 0.412
## 8 ainsmed01 707 108 54 22 3048 1078 0.324
## 9 alfoned01 1532 282 18 146 5385 1506 0.425
## 10 alicelu01 1031 189 53 47 3971 1341 0.369
## # ... with 1,554 more rows
To modify an existing column, just set the column name for the output of mutate
to an existing column name. For instance, to round our new slg
column to the conventional three digits:
data %>%
select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>%
group_by(player_id) %>%
summarize(h = sum(h),
x2b = sum(x2b),
x3b = sum(x3b),
hr = sum(hr),
ab = sum(ab),
g = sum(g)) %>%
filter(g > 1000) %>%
mutate(slg = (h + x2b + 2 * x3b + 3 * hr) / ab) %>%
mutate(slg = round(slg, 3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1,564 x 8
## player_id h x2b x3b hr ab g slg
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aaronha01 3771 624 98 755 12364 3298 0.555
## 2 abreubo01 2470 574 59 288 8480 2425 0.475
## 3 adairje01 1022 163 19 57 4019 1165 0.347
## 4 adamsbo03 1082 188 49 37 4019 1281 0.368
## 5 adamssp01 1588 249 48 9 5557 1424 0.353
## 6 adcocjo01 1832 295 35 336 6606 1959 0.485
## 7 ageeto01 999 170 27 130 3912 1129 0.412
## 8 ainsmed01 707 108 54 22 3048 1078 0.324
## 9 alfoned01 1532 282 18 146 5385 1506 0.425
## 10 alicelu01 1031 189 53 47 3971 1341 0.369
## # ... with 1,554 more rows
Arrange
arrange
sorts dataframes by one or more columns. It sorts in ascending order by default; to sort descending, wrap a column name in desc()
.
slgdata <- data %>%
select(player_id, year_id, h ,x2b, x3b, hr, ab, g) %>%
group_by(player_id) %>%
summarize(h = sum(h),
x2b = sum(x2b),
x3b = sum(x3b),
hr = sum(hr),
ab = sum(ab),
g = sum(g)) %>%
filter(g>1000) %>%
mutate(slg = (h + x2b + 2*x3b + 3*hr)/ab) %>%
mutate(slg = round(slg, 3)) %>%
arrange(desc(slg))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1,564 x 8
## player_id h x2b x3b hr ab g slg
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ruthba01 2873 506 136 714 8398 2503 0.69
## 2 willite01 2654 525 71 521 7706 2292 0.634
## 3 gehrilo01 2721 534 163 493 8001 2164 0.632
## 4 foxxji01 2646 458 125 534 8134 2317 0.609
## 5 bondsba01 2935 601 77 762 9847 2986 0.607
## 6 greenha01 1628 379 71 331 5193 1394 0.605
## 7 mcgwima01 1626 252 6 583 6187 1874 0.588
## 8 ramirma02 2574 547 20 555 8244 2302 0.585
## 9 dimagjo01 2214 389 131 361 6821 1736 0.579
## 10 hornsro01 2930 541 169 301 8173 2259 0.577
## # ... with 1,554 more rows
2.3 Joins
In many projects, you will not have all of the data you need contained in a single table. You can use dplyr
’s SQL-style join functions to combine tables. This isn’t always intuitive at first, especially with large amounts of data joining on multiple columns, but it’s dramatically easier and quicker than tools like VLOOKUP once you’re used to it.
Player ID Data
Our existing player data only has player ids with partial names. However, we have a dataset with lots of additional player information.
## Parsed with column specification:
## cols(
## .default = col_character(),
## birthYear = col_double(),
## birthMonth = col_double(),
## birthDay = col_double(),
## deathYear = col_double(),
## deathMonth = col_double(),
## deathDay = col_double(),
## weight = col_double(),
## height = col_double(),
## debut = col_date(format = ""),
## finalGame = col_date(format = "")
## )
## See spec(...) for full column specifications.
## # A tibble: 19,105 x 24
## playerID birthYear birthMonth birthDay birthCountry birthState birthCity deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst nameLast nameGiven weight
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 aardsda~ 1981 12 27 USA CO Denver NA NA NA <NA> <NA> <NA> David Aardsma David Al~ 215
## 2 aaronha~ 1934 2 5 USA AL Mobile NA NA NA <NA> <NA> <NA> Hank Aaron Henry Lo~ 180
## 3 aaronto~ 1939 8 5 USA AL Mobile 1984 8 16 USA GA Atlanta Tommie Aaron Tommie L~ 190
## 4 aasedo01 1954 9 8 USA CA Orange NA NA NA <NA> <NA> <NA> Don Aase Donald W~ 190
## 5 abadan01 1972 8 25 USA FL Palm Bea~ NA NA NA <NA> <NA> <NA> Andy Abad Fausto A~ 184
## 6 abadfe01 1985 12 17 D.R. La Romana La Romana NA NA NA <NA> <NA> <NA> Fernando Abad Fernando~ 220
## 7 abadijo~ 1850 11 4 USA PA Philadel~ 1905 5 17 USA NJ Pemberton John Abadie John W. 192
## 8 abbated~ 1877 4 15 USA PA Latrobe 1957 1 6 USA FL Fort Lau~ Ed Abbatic~ Edward J~ 170
## 9 abbeybe~ 1869 11 11 USA VT Essex 1962 6 11 USA VT Colchest~ Bert Abbey Bert Wood 175
## 10 abbeych~ 1866 10 14 USA NE Falls Ci~ 1926 4 27 USA CA San Fran~ Charlie Abbey Charles ~ 169
## # ... with 19,095 more rows, and 7 more variables: height <dbl>, bats <chr>, throws <chr>, debut <date>, finalGame <date>, retroID <chr>, bbrefID <chr>
Let’s create a single player
column from the first and last name columns, and drop everything else besides our player id and name columns.
player_names <- player_info %>%
select(playerID, nameFirst, nameLast) %>%
mutate(player = str_c(nameFirst, " ", nameLast)) %>%
select(-starts_with("name"))
player_names
## # A tibble: 19,105 x 2
## playerID player
## <chr> <chr>
## 1 aardsda01 David Aardsma
## 2 aaronha01 Hank Aaron
## 3 aaronto01 Tommie Aaron
## 4 aasedo01 Don Aase
## 5 abadan01 Andy Abad
## 6 abadfe01 Fernando Abad
## 7 abadijo01 John Abadie
## 8 abbated01 Ed Abbaticchio
## 9 abbeybe01 Bert Abbey
## 10 abbeych01 Charlie Abbey
## # ... with 19,095 more rows
Join the Data
A “left join” on dataframes A and B keeps all of table A and adds columns from table B, matching up rows based on one or more specified joining columns. Here, we want to add the player names from B, matching by player id. However, the player id columns in the two dataframes are not spelled exactly the same way, so we have to tell left_join
precisely which two columns to match up between dataframes.
## # A tibble: 1,564 x 9
## player_id h x2b x3b hr ab g slg player
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ruthba01 2873 506 136 714 8398 2503 0.69 Babe Ruth
## 2 willite01 2654 525 71 521 7706 2292 0.634 Ted Williams
## 3 gehrilo01 2721 534 163 493 8001 2164 0.632 Lou Gehrig
## 4 foxxji01 2646 458 125 534 8134 2317 0.609 Jimmie Foxx
## 5 bondsba01 2935 601 77 762 9847 2986 0.607 Barry Bonds
## 6 greenha01 1628 379 71 331 5193 1394 0.605 Hank Greenberg
## 7 mcgwima01 1626 252 6 583 6187 1874 0.588 Mark McGwire
## 8 ramirma02 2574 547 20 555 8244 2302 0.585 Manny Ramirez
## 9 dimagjo01 2214 389 131 361 6821 1736 0.579 Joe DiMaggio
## 10 hornsro01 2930 541 169 301 8173 2259 0.577 Rogers Hornsby
## # ... with 1,554 more rows
Rename Columns
rename
makes it easy to rename columns without changing anything else in a dataframe. It uses the format newname = oldname
.
## # A tibble: 1,564 x 9
## player_id h doubles triples hr ab g slg player
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ruthba01 2873 506 136 714 8398 2503 0.69 Babe Ruth
## 2 willite01 2654 525 71 521 7706 2292 0.634 Ted Williams
## 3 gehrilo01 2721 534 163 493 8001 2164 0.632 Lou Gehrig
## 4 foxxji01 2646 458 125 534 8134 2317 0.609 Jimmie Foxx
## 5 bondsba01 2935 601 77 762 9847 2986 0.607 Barry Bonds
## 6 greenha01 1628 379 71 331 5193 1394 0.605 Hank Greenberg
## 7 mcgwima01 1626 252 6 583 6187 1874 0.588 Mark McGwire
## 8 ramirma02 2574 547 20 555 8244 2302 0.585 Manny Ramirez
## 9 dimagjo01 2214 389 131 361 6821 1736 0.579 Joe DiMaggio
## 10 hornsro01 2930 541 169 301 8173 2259 0.577 Rogers Hornsby
## # ... with 1,554 more rows
Reorder Columns
## # A tibble: 1,564 x 9
## player_id player h doubles triples hr ab g slg
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ruthba01 Babe Ruth 2873 506 136 714 8398 2503 0.69
## 2 willite01 Ted Williams 2654 525 71 521 7706 2292 0.634
## 3 gehrilo01 Lou Gehrig 2721 534 163 493 8001 2164 0.632
## 4 foxxji01 Jimmie Foxx 2646 458 125 534 8134 2317 0.609
## 5 bondsba01 Barry Bonds 2935 601 77 762 9847 2986 0.607
## 6 greenha01 Hank Greenberg 1628 379 71 331 5193 1394 0.605
## 7 mcgwima01 Mark McGwire 1626 252 6 583 6187 1874 0.588
## 8 ramirma02 Manny Ramirez 2574 547 20 555 8244 2302 0.585
## 9 dimagjo01 Joe DiMaggio 2214 389 131 361 6821 1736 0.579
## 10 hornsro01 Rogers Hornsby 2930 541 169 301 8173 2259 0.577
## # ... with 1,554 more rows
As of dplyr 1.0.0, there is a new relocate
function that moves the specified columns to the front by default.
## # A tibble: 1,564 x 9
## player_id player h doubles triples hr ab g slg
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ruthba01 Babe Ruth 2873 506 136 714 8398 2503 0.69
## 2 willite01 Ted Williams 2654 525 71 521 7706 2292 0.634
## 3 gehrilo01 Lou Gehrig 2721 534 163 493 8001 2164 0.632
## 4 foxxji01 Jimmie Foxx 2646 458 125 534 8134 2317 0.609
## 5 bondsba01 Barry Bonds 2935 601 77 762 9847 2986 0.607
## 6 greenha01 Hank Greenberg 1628 379 71 331 5193 1394 0.605
## 7 mcgwima01 Mark McGwire 1626 252 6 583 6187 1874 0.588
## 8 ramirma02 Manny Ramirez 2574 547 20 555 8244 2302 0.585
## 9 dimagjo01 Joe DiMaggio 2214 389 131 361 6821 1736 0.579
## 10 hornsro01 Rogers Hornsby 2930 541 169 301 8173 2259 0.577
## # ... with 1,554 more rows
We could also then move the slg
column so that it’s right after the player columns.
## # A tibble: 1,564 x 9
## player_id player slg h doubles triples hr ab g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ruthba01 Babe Ruth 0.69 2873 506 136 714 8398 2503
## 2 willite01 Ted Williams 0.634 2654 525 71 521 7706 2292
## 3 gehrilo01 Lou Gehrig 0.632 2721 534 163 493 8001 2164
## 4 foxxji01 Jimmie Foxx 0.609 2646 458 125 534 8134 2317
## 5 bondsba01 Barry Bonds 0.607 2935 601 77 762 9847 2986
## 6 greenha01 Hank Greenberg 0.605 1628 379 71 331 5193 1394
## 7 mcgwima01 Mark McGwire 0.588 1626 252 6 583 6187 1874
## 8 ramirma02 Manny Ramirez 0.585 2574 547 20 555 8244 2302
## 9 dimagjo01 Joe DiMaggio 0.579 2214 389 131 361 6821 1736
## 10 hornsro01 Rogers Hornsby 0.577 2930 541 169 301 8173 2259
## # ... with 1,554 more rows
2.4 Other dplyr Tricks
Count
count
creates tidy-format frequency tables for a given combination of columns. It’s a shortcut for group_by(cols) %>% summarize(n = n())
.
## # A tibble: 18,915 x 2
## player_id n
## <chr> <int>
## 1 aardsda01 9
## 2 aaronha01 23
## 3 aaronto01 7
## 4 aasedo01 13
## 5 abadan01 3
## 6 abadfe01 8
## 7 abadijo01 2
## 8 abbated01 10
## 9 abbeybe01 6
## 10 abbeych01 5
## # ... with 18,905 more rows
## # A tibble: 18,915 x 2
## player_id n
## <chr> <int>
## 1 mcguide01 31
## 2 henderi01 29
## 3 newsobo01 29
## 4 johnto01 28
## 5 kaatji01 28
## 6 ansonca01 27
## 7 baineha01 27
## 8 carltst01 27
## 9 moyerja01 27
## 10 ryanno01 27
## # ... with 18,905 more rows
2.4.1 Grouped Mutate
group_by
doesn’t just allow you to summarize - it also allows you to do mutate
calculations within each group. For example, to calculate the proportion of their career home runs that each player hit during each year of their career, we could do the following. The key here is that the sum
inside the mutate
only sums up home runs within each group.
prop_hr <- data %>%
select(player_id, year_id, hr) %>%
group_by(player_id) %>%
mutate(prop_career_hr = hr/sum(hr))
prop_hr %>%
filter(player_id == "ruthba01")
## # A tibble: 22 x 4
## # Groups: player_id [1]
## player_id year_id hr prop_career_hr
## <chr> <dbl> <dbl> <dbl>
## 1 ruthba01 1914 0 0
## 2 ruthba01 1915 4 0.00560
## 3 ruthba01 1916 3 0.00420
## 4 ruthba01 1917 2 0.00280
## 5 ruthba01 1918 11 0.0154
## 6 ruthba01 1919 29 0.0406
## 7 ruthba01 1920 54 0.0756
## 8 ruthba01 1921 59 0.0826
## 9 ruthba01 1922 35 0.0490
## 10 ruthba01 1923 41 0.0574
## # ... with 12 more rows
Conditional verb variants
dplyr
provides conditional variants of many of the main verbs, ending in _at
, _if
, and _all
. These are most often used with summarize
, mutate
, or rename
to operate on multiple columns at once.
The rather wordy summarize
that we used earlier can be replaced with a summarize_at
that specifies the columns to be summarized and one or more functions to apply to each columns. Anything that would work in select()
can be wrapped in vars()
to select columns, and function names can be bare, base-R-style anonymous functions, or purrr
-style formula notation, which will be covered in section 4.
slg_summary <- data %>%
group_by(player_id) %>%
summarize(h = sum(h),
x2b = sum(x2b),
x3b = sum(x3b),
hr = sum(hr),
ab = sum(ab),
g = sum(g))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 7
## player_id h x2b x3b hr ab g
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aardsda01 0 0 0 0 4 331
## 2 aaronha01 3771 624 98 755 12364 3298
## 3 aaronto01 216 42 6 13 944 437
## 4 aasedo01 0 0 0 0 5 448
## 5 abadan01 2 0 0 0 21 15
## 6 abadfe01 1 0 0 0 9 315
# single function with no additional arguments: bare function name
concise_slg_summary <- data %>%
group_by(player_id) %>%
summarize_at(.vars = vars(h, x2b, x3b, hr, ab, g), .funs = sum)
concise_slg_summary
## # A tibble: 18,915 x 7
## player_id h x2b x3b hr ab g
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aardsda01 0 0 0 0 4 331
## 2 aaronha01 3771 624 98 755 12364 3298
## 3 aaronto01 216 42 6 13 944 437
## 4 aasedo01 0 0 0 0 5 448
## 5 abadan01 2 0 0 0 21 15
## 6 abadfe01 1 0 0 0 9 315
## 7 abadijo01 11 0 0 0 49 12
## 8 abbated01 772 99 43 11 3044 855
## 9 abbeybe01 38 3 3 0 225 79
## 10 abbeych01 492 67 46 19 1751 451
## # ... with 18,905 more rows
Passing a named list of functions as the .funs
argument allows you to run multiple summary functions on each selected column and name the resulting summary columns.
multi_summary <- data %>%
group_by(player_id) %>%
summarize_at(.vars = vars(h, x2b, x3b, hr, ab, g), .funs = list("sum" = sum, "sd" = sd))
multi_summary
## # A tibble: 18,915 x 13
## player_id h_sum x2b_sum x3b_sum hr_sum ab_sum g_sum h_sd x2b_sd x3b_sd hr_sd ab_sd g_sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aardsda01 0 0 0 0 4 331 0 0 0 0 0.726 22.0
## 2 aaronha01 3771 624 98 755 12364 3298 40.6 10.1 3.84 11.2 98.4 18.5
## 3 aaronto01 216 42 6 13 944 437 29.7 7.02 1.21 2.79 125. 45.5
## 4 aasedo01 0 0 0 0 5 448 0 0 0 0 1.39 16.2
## 5 abadan01 2 0 0 0 21 15 1.15 0 0 0 8.72 4
## 6 abadfe01 1 0 0 0 9 315 0.354 0 0 0 2.42 18.0
## 7 abadijo01 11 0 0 0 49 12 6.36 0 0 0 29.0 7.07
## 8 abbated01 772 99 43 11 3044 855 65.7 9.29 4.55 1.20 250. 66.7
## 9 abbeybe01 38 3 3 0 225 79 3.78 0.548 0.837 0 27.3 10.5
## 10 abbeych01 492 67 46 19 1751 451 53.8 8.88 5.40 3.56 170. 41.8
## # ... with 18,905 more rows
The _if
variant allows the user to run a dplyr
operation on all columns for which some function returns TRUE
. For instance, to add the string "_numeric" to the names of all numeric columns, we could use rename_if
. We’ll run this on a subset of columns to simplify the display.
numeric_rename <- data %>%
select(player_id, year_id, lg_id, g, ab) %>%
rename_if(is.numeric, ~str_c(.x, "_numeric"))
numeric_rename
## # A tibble: 102,816 x 5
## player_id year_id_numeric lg_id g_numeric ab_numeric
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 abercda01 1871 <NA> 1 4
## 2 addybo01 1871 <NA> 25 118
## 3 allisar01 1871 <NA> 29 137
## 4 allisdo01 1871 <NA> 27 133
## 5 ansonca01 1871 <NA> 25 120
## 6 armstbo01 1871 <NA> 12 49
## 7 barkeal01 1871 <NA> 1 4
## 8 barnero01 1871 <NA> 31 157
## 9 barrebi01 1871 <NA> 1 5
## 10 barrofr01 1871 <NA> 18 86
## # ... with 102,806 more rows
The _all
variant simply runs the same operation on all columns. We’ll select down to a set of numeric columns, and then multiply all of them by two.
## # A tibble: 102,816 x 3
## hr ab g
## <dbl> <dbl> <dbl>
## 1 0 8 2
## 2 0 236 50
## 3 0 274 58
## 4 4 266 54
## 5 0 240 50
## 6 0 98 24
## 7 0 8 2
## 8 0 314 62
## 9 0 10 2
## 10 0 172 36
## # ... with 102,806 more rows
2.4.2 across()
The newest version of dplyr
(1.0.0) introduced a new, arguably simpler way to achieve the same goals as these conditional function variants: the across
function. across
can be used with many of the dplyr verbs to operate on a specific subset of columns, specified with the same language as select
. For example, across
could be used in similar ways as the conditional variants above:
multi_summary_across <- data %>%
group_by(player_id) %>%
summarize(across(c(h, x2b, x3b, hr, ab, g), .funs = list("sum" = sum, "sd" = sd)))
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## # A tibble: 102,816 x 7
## # Groups: player_id [18,915]
## player_id h x2b x3b hr ab g
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aardsda01 0 0 0 0 0 11
## 2 aardsda01 0 0 0 0 2 45
## 3 aardsda01 0 0 0 0 0 25
## 4 aardsda01 0 0 0 0 1 47
## 5 aardsda01 0 0 0 0 0 73
## 6 aardsda01 0 0 0 0 0 53
## 7 aardsda01 0 0 0 0 0 1
## 8 aardsda01 0 0 0 0 0 43
## 9 aardsda01 0 0 0 0 1 33
## 10 aaronha01 131 27 6 13 468 122
## # ... with 102,806 more rows
# multiply all selected numeric columns by 2
mutate_numeric_across <- data %>%
select(player_id, hr, ab, g) %>%
mutate(across(where(is.numeric), ~ .x * 2))
mutate_numeric_across
## # A tibble: 102,816 x 4
## player_id hr ab g
## <chr> <dbl> <dbl> <dbl>
## 1 abercda01 0 8 2
## 2 addybo01 0 236 50
## 3 allisar01 0 274 58
## 4 allisdo01 4 266 54
## 5 ansonca01 0 240 50
## 6 armstbo01 0 98 24
## 7 barkeal01 0 8 2
## 8 barnero01 0 314 62
## 9 barrebi01 0 10 2
## 10 barrofr01 0 172 36
## # ... with 102,806 more rows
However, across
is very new and occasionally has serious performance issues, so it is not used above. More information can be found (here)[https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/];
2.4.3 Write Data
We will write this file to a folder so we can access it in later lessons / chapters.