4 Functional Programming
In this section, you will learn:
- How to write a basic function.
- How to run this function for many inputs.
In this section, we will use the following libraries and data:
library(tidyverse)
library(purrr)
hitters <- read_csv("data_sources/Batting.csv", guess_max = 10000)
## Parsed with column specification:
## cols(
## .default = col_double(),
## playerID = col_character(),
## teamID = col_character(),
## lgID = col_character()
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
## .default = col_double(),
## playerID = col_character(),
## teamID = col_character(),
## lgID = col_character()
## )
## See spec(...) for full column specifications.
4.1 An Interesting Question
Who played the most games and hit the most home runs in the 90s in the state of Texas? This question is fairly easy to answer with the tools we learned in the Data Manipulation chapter.
data %>%
filter(year_id %in% 1990:1999) %>%
filter(team_id %in% c("HOU","TEX")) %>%
group_by(player_id, team_id) %>%
summarize(g = sum(g),
hr = sum(hr)) %>%
arrange(desc(g, hr))
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## # A tibble: 411 x 4
## # Groups: player_id [398]
## player_id team_id g hr
## <chr> <chr> <dbl> <dbl>
## 1 biggicr01 HOU 1515 136
## 2 bagweje01 HOU 1317 263
## 3 gonzaju03 TEX 1224 339
## 4 rodriiv01 TEX 1169 144
## 5 greerru01 TEX 809 103
## 6 palmera01 TEX 790 146
## 7 caminke01 HOU 772 74
## 8 palmede01 TEX 758 154
## 9 gonzalu01 HOU 745 62
## 10 bellde01 HOU 683 74
## # ... with 401 more rows
4.2 A More General Question
What if wanted to be able to easily answer this question for any range of years, teams, and statistics? We could pull out each of these variables, set them ahead of time, and then run a slightly modified version of the above code that uses our newly created variables.
Note that this version of the code uses across
, mentioned briefly in the Data Manipulation section; it could also be written with summarize_at
, but using across
makes our next step easier.
years <- 1990:1999
teams_chosen <- c("HOU", "TEX")
category <- c("g", "hr")
data %>%
filter(year_id %in% years) %>%
filter(team_id %in% teams_chosen) %>%
group_by(player_id, team_id) %>%
summarize(across(category, sum)) %>% # requires dplyr 1.0.0
arrange(desc(across(all_of(category)))) # requires dplyr 1.0.0
## # A tibble: 411 x 4
## # Groups: player_id [398]
## player_id team_id g hr
## <chr> <chr> <dbl> <dbl>
## 1 biggicr01 HOU 1515 136
## 2 bagweje01 HOU 1317 263
## 3 gonzaju03 TEX 1224 339
## 4 rodriiv01 TEX 1169 144
## 5 greerru01 TEX 809 103
## 6 palmera01 TEX 790 146
## 7 caminke01 HOU 772 74
## 8 palmede01 TEX 758 154
## 9 gonzalu01 HOU 745 62
## 10 bellde01 HOU 683 74
## # ... with 401 more rows
4.3 Even More Generally: Writing a Function
The format above works fine for occasional ad-hoc queries, but if we wanted to answer the question for multiple sets of parameters, we’d have to copy and paste all of this code - and then make sure, every time we edited it in the future, that those edits got made to every single instance in exactly the same way. One way to solve this problem would be to use a for
loop, or nested for
loops, with different sets of parameters, but even this quickly gets clunky. The most flexible solution is to turn our code into its own function.
In the code below, we define a function. We:
- Give it a name (
subset_batting_stats
); - Define the arguments - the inputs - that will be required;
- Write the code that will be run, using the input names we’ve defined;
- Define the value that the function returns;
- Run it with various parameters to test it out.
subset_batting_stats <- function(batting_data, years, teams_chosen, category){
batting_data_subset_summary <- batting_data %>%
filter(year_id %in% years) %>%
filter(team_id %in% teams_chosen) %>%
group_by(player_id, team_id) %>%
summarize(across(category, sum)) %>% # requires dplyr 1.0.0
arrange(desc(across(all_of(category)))) # requires dplyr 1.0.0
# summarize_at(vars(all_of(category)), sum) %>% # older/fancier/more general method
# arrange(desc(!!!rlang::syms(category))) # older/fancier/more general method
return(batting_data_subset_summary)
}
# Texas in the 90s?
subset_batting_stats(batting_data = data, years = 1990:1999, teams_chosen = c("HOU", "TEX"), category = c("g", "hr"))
## # A tibble: 411 x 4
## # Groups: player_id [398]
## player_id team_id g hr
## <chr> <chr> <dbl> <dbl>
## 1 biggicr01 HOU 1515 136
## 2 bagweje01 HOU 1317 263
## 3 gonzaju03 TEX 1224 339
## 4 rodriiv01 TEX 1169 144
## 5 greerru01 TEX 809 103
## 6 palmera01 TEX 790 146
## 7 caminke01 HOU 772 74
## 8 palmede01 TEX 758 154
## 9 gonzalu01 HOU 745 62
## 10 bellde01 HOU 683 74
## # ... with 401 more rows
This is nice and elegant, but it hasn’t accomplished anything different than what we did before. Below you can see the power of this function as we can easily change the parameters to answer a different question.
#Los Angeles in the 2010s?
subset_batting_stats(batting_data = data, years = 2010:2019, teams_chosen = c("LAN", "LAA"), category = c("hr", "g"))
## # A tibble: 388 x 4
## # Groups: player_id [371]
## player_id team_id hr g
## <chr> <chr> <dbl> <dbl>
## 1 troutmi01 LAA 168 811
## 2 pujolal01 LAA 146 721
## 3 kempma01 LAN 121 652
## 4 gonzaad01 LAN 98 664
## 5 trumbma01 LAA 95 460
## 6 ethiean01 LAN 85 853
## 7 calhoko01 LAA 69 522
## 8 hunteto01 LAA 62 448
## 9 puigya01 LAN 57 435
## 10 kendrho01 LAA 56 724
## # ... with 378 more rows
4.4 Iteration with functions: purrr
Now that we have our custom function, we can use functions from the package purrr
to easily run it multiple times with different inputs. (You may have used various flavors of apply
in the past - these also work well, and accomplish most of the same things, but purrr
functions have a more convenient and consistent interface.)
The most basic function that helps us do this is map
. map
allows us to run a function many times, varying one input between each of these runs. It takes two inputs: the list or vector of different values for your one varying input, and the function you want to run repeatedly, with values set for any other arguments.
Functions from purrr
allow us to use a special syntax for writing out the function we want to run, based on the R’s formula syntax. Simply put a tilde (~
) in front of the function name, and then replace the value to be iterated over with .x
. Here, we use this syntax with map
to find the Orioles’ top home run hitters in each multiple decades. Our input is a list of numeric vectors, one for each decade, and our output is a list of dataframes.
decades <- 1950:2019 %>% split(sort(rep(1:7, 10)))
bal_top_hr_decades <- map(decades, ~subset_batting_stats(batting_data = data, years = .x, teams_chosen = "BAL", category = "hr"))
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## [1] "list"
## [1] 7
This is still probably too much data to make use of: maybe it would be more helpful to just have the top home run hitter from each decade. We can take our list of output dataframes and map the function head(1)
over each of them, to get the first row, since they’re already sorted by hr
.
At this point, we’ll have a list of ten one-row dataframes, and we might as well row-bind them. Unlike base R’s rbind
, dplyr::bind_rows
will accept a list of dataframes and bind them together. (Another option would have been to use map_dfr
, which runs bind_rows
on its list-formatted output.)
## # A tibble: 7 x 3
## # Groups: player_id [7]
## player_id team_id hr
## <chr> <chr> <dbl>
## 1 triangu01 BAL 107
## 2 powelbo01 BAL 202
## 3 mayle01 BAL 116
## 4 murraed02 BAL 254
## 5 ripkeca01 BAL 198
## 6 morame01 BAL 158
## 7 davisch02 BAL 199
This is much more useful, but we do have a small problem: we’ve lost the information about which decade each one came from! One way to rectify this problem is to map a mutate
after we run our custom function to add a column that shows which years each row came from. However, we’ll need map2
for this, since we’re going to be iterating over both a list of dataframes and a list of decade vectors. map2
works just like map, except it takes two lists/vectors as inputs, along with a function, and you can specify these inputs in your function as .x
and .y
.
bal_top1_hr_yearcol <- bal_top_hr_decades %>%
map2(decades, ~mutate(.x, years = str_c(first(.y), "-", last(.y)))) %>%
map(~head(.x, 1)) %>%
bind_rows()
bal_top1_hr_yearcol
## # A tibble: 7 x 4
## # Groups: player_id [7]
## player_id team_id hr years
## <chr> <chr> <dbl> <chr>
## 1 triangu01 BAL 107 1950-1959
## 2 powelbo01 BAL 202 1960-1969
## 3 mayle01 BAL 116 1970-1979
## 4 murraed02 BAL 254 1980-1989
## 5 ripkeca01 BAL 198 1990-1999
## 6 morame01 BAL 158 2000-2009
## 7 davisch02 BAL 199 2010-2019
What if we wanted to iterate over more than two arguments? We can use pmap
for this; better yet, pmap
will accept a list or dataframe of input combinations, which makes it easy to set up the right combinations of arguments to iterate over.
input_combos <- tibble(batting_data = list(data),
years = decades[1:4],
teams_chosen = c("BAL", "HOU", "TEX", "NYA"),
category = c("hr", "rbi", "g", "ab"))
input_combos
## # A tibble: 4 x 4
## batting_data years teams_chosen category
## <list> <named list> <chr> <chr>
## 1 <tibble [102,816 x 22]> <int [10]> BAL hr
## 2 <tibble [102,816 x 22]> <int [10]> HOU rbi
## 3 <tibble [102,816 x 22]> <int [10]> TEX g
## 4 <tibble [102,816 x 22]> <int [10]> NYA ab
pmap_out_top3 <- pmap(input_combos, subset_batting_stats) %>%
map2(input_combos$years, ~mutate(.x, years = str_c(first(.y), "-", last(.y)))) %>%
map_dfr(~head(.x, 3))
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## # A tibble: 12 x 7
## # Groups: player_id [12]
## player_id team_id hr years rbi g ab
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 triangu01 BAL 107 1950-1959 NA NA NA
## 2 niemabo01 BAL 62 1950-1959 NA NA NA
## 3 woodlge01 BAL 32 1950-1959 NA NA NA
## 4 wynnji01 HOU NA 1960-1969 441 NA NA
## 5 asprobo01 HOU NA 1960-1969 385 NA NA
## 6 staubru01 HOU NA 1960-1969 370 NA NA
## 7 harrato01 TEX NA 1970-1979 NA 999 NA
## 8 sundbji01 TEX NA 1970-1979 NA 875 NA
## 9 hargrmi01 TEX NA 1970-1979 NA 726 NA
## 10 winfida01 NYA NA 1980-1989 NA NA 4424
## 11 randowi01 NYA NA 1980-1989 NA NA 4249
## 12 mattido01 NYA NA 1980-1989 NA NA 4022
## tidy up...
pmap_out_top3 %>%
group_by(team_id) %>%
mutate(rank = row_number()) %>%
pivot_longer(c(rbi, g, ab, hr), names_to = "stat_type") %>%
filter(!is.na(value))
## # A tibble: 12 x 6
## # Groups: team_id [4]
## player_id team_id years rank stat_type value
## <chr> <chr> <chr> <int> <chr> <dbl>
## 1 triangu01 BAL 1950-1959 1 hr 107
## 2 niemabo01 BAL 1950-1959 2 hr 62
## 3 woodlge01 BAL 1950-1959 3 hr 32
## 4 wynnji01 HOU 1960-1969 1 rbi 441
## 5 asprobo01 HOU 1960-1969 2 rbi 385
## 6 staubru01 HOU 1960-1969 3 rbi 370
## 7 harrato01 TEX 1970-1979 1 g 999
## 8 sundbji01 TEX 1970-1979 2 g 875
## 9 hargrmi01 TEX 1970-1979 3 g 726
## 10 winfida01 NYA 1980-1989 1 ab 4424
## 11 randowi01 NYA 1980-1989 2 ab 4249
## 12 mattido01 NYA 1980-1989 3 ab 4022