4 Functional Programming

In this section, you will learn:

How to write a basic function.
How to run this function for many inputs.

In this section, we will use the following libraries and data:

library(tidyverse)
library(purrr)

hitters <- read_csv("data_sources/Batting.csv", guess_max = 10000)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   playerID = col_character(),
##   teamID = col_character(),
##   lgID = col_character()
## )

## See spec(...) for full column specifications.

data <- read_csv("data_sources/Batting.csv", guess_max = 10000) %>%
  janitor::clean_names()

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   playerID = col_character(),
##   teamID = col_character(),
##   lgID = col_character()
## )
## See spec(...) for full column specifications.

4.1 An Interesting Question

Who played the most games and hit the most home runs in the 90s in the state of Texas? This question is fairly easy to answer with the tools we learned in the Data Manipulation chapter.

data %>% 
  filter(year_id %in% 1990:1999) %>% 
  filter(team_id %in% c("HOU","TEX")) %>% 
  group_by(player_id, team_id) %>% 
  summarize(g = sum(g),
            hr = sum(hr)) %>%
  arrange(desc(g, hr))

## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)

## # A tibble: 411 x 4
## # Groups:   player_id [398]
##    player_id team_id     g    hr
##    <chr>     <chr>   <dbl> <dbl>
##  1 biggicr01 HOU      1515   136
##  2 bagweje01 HOU      1317   263
##  3 gonzaju03 TEX      1224   339
##  4 rodriiv01 TEX      1169   144
##  5 greerru01 TEX       809   103
##  6 palmera01 TEX       790   146
##  7 caminke01 HOU       772    74
##  8 palmede01 TEX       758   154
##  9 gonzalu01 HOU       745    62
## 10 bellde01  HOU       683    74
## # ... with 401 more rows

4.2 A More General Question

What if wanted to be able to easily answer this question for any range of years, teams, and statistics? We could pull out each of these variables, set them ahead of time, and then run a slightly modified version of the above code that uses our newly created variables.

Note that this version of the code uses across, mentioned briefly in the Data Manipulation section; it could also be written with summarize_at, but using across makes our next step easier.

years <- 1990:1999
teams_chosen <- c("HOU", "TEX")
category <- c("g", "hr")

data %>% 
  filter(year_id %in% years) %>% 
  filter(team_id %in% teams_chosen) %>%
  group_by(player_id, team_id) %>% 
  summarize(across(category, sum)) %>% # requires dplyr 1.0.0
  arrange(desc(across(all_of(category)))) # requires dplyr 1.0.0

## # A tibble: 411 x 4
## # Groups:   player_id [398]
##    player_id team_id     g    hr
##    <chr>     <chr>   <dbl> <dbl>
##  1 biggicr01 HOU      1515   136
##  2 bagweje01 HOU      1317   263
##  3 gonzaju03 TEX      1224   339
##  4 rodriiv01 TEX      1169   144
##  5 greerru01 TEX       809   103
##  6 palmera01 TEX       790   146
##  7 caminke01 HOU       772    74
##  8 palmede01 TEX       758   154
##  9 gonzalu01 HOU       745    62
## 10 bellde01  HOU       683    74
## # ... with 401 more rows

4.3 Even More Generally: Writing a Function

The format above works fine for occasional ad-hoc queries, but if we wanted to answer the question for multiple sets of parameters, we’d have to copy and paste all of this code - and then make sure, every time we edited it in the future, that those edits got made to every single instance in exactly the same way. One way to solve this problem would be to use a for loop, or nested for loops, with different sets of parameters, but even this quickly gets clunky. The most flexible solution is to turn our code into its own function.

In the code below, we define a function. We:

Give it a name (subset_batting_stats);
Define the arguments - the inputs - that will be required;
Write the code that will be run, using the input names we’ve defined;
Define the value that the function returns;
Run it with various parameters to test it out.

subset_batting_stats <- function(batting_data, years, teams_chosen, category){
  batting_data_subset_summary <- batting_data %>% 
    filter(year_id %in% years) %>% 
    filter(team_id %in% teams_chosen) %>% 
    group_by(player_id, team_id) %>%
    summarize(across(category, sum)) %>% # requires dplyr 1.0.0
    arrange(desc(across(all_of(category)))) # requires dplyr 1.0.0
    # summarize_at(vars(all_of(category)), sum) %>% # older/fancier/more general method
    # arrange(desc(!!!rlang::syms(category))) # older/fancier/more general method
  
  return(batting_data_subset_summary)
}


# Texas in the 90s?
subset_batting_stats(batting_data = data, years = 1990:1999, teams_chosen = c("HOU", "TEX"), category = c("g", "hr"))

## # A tibble: 411 x 4
## # Groups:   player_id [398]
##    player_id team_id     g    hr
##    <chr>     <chr>   <dbl> <dbl>
##  1 biggicr01 HOU      1515   136
##  2 bagweje01 HOU      1317   263
##  3 gonzaju03 TEX      1224   339
##  4 rodriiv01 TEX      1169   144
##  5 greerru01 TEX       809   103
##  6 palmera01 TEX       790   146
##  7 caminke01 HOU       772    74
##  8 palmede01 TEX       758   154
##  9 gonzalu01 HOU       745    62
## 10 bellde01  HOU       683    74
## # ... with 401 more rows

This is nice and elegant, but it hasn’t accomplished anything different than what we did before. Below you can see the power of this function as we can easily change the parameters to answer a different question.

#Los Angeles in the 2010s?
subset_batting_stats(batting_data = data, years = 2010:2019, teams_chosen = c("LAN", "LAA"), category = c("hr", "g"))

## # A tibble: 388 x 4
## # Groups:   player_id [371]
##    player_id team_id    hr     g
##    <chr>     <chr>   <dbl> <dbl>
##  1 troutmi01 LAA       168   811
##  2 pujolal01 LAA       146   721
##  3 kempma01  LAN       121   652
##  4 gonzaad01 LAN        98   664
##  5 trumbma01 LAA        95   460
##  6 ethiean01 LAN        85   853
##  7 calhoko01 LAA        69   522
##  8 hunteto01 LAA        62   448
##  9 puigya01  LAN        57   435
## 10 kendrho01 LAA        56   724
## # ... with 378 more rows

4.4 Iteration with functions: `purrr`

Now that we have our custom function, we can use functions from the package purrr to easily run it multiple times with different inputs. (You may have used various flavors of apply in the past - these also work well, and accomplish most of the same things, but purrr functions have a more convenient and consistent interface.)

The most basic function that helps us do this is map. map allows us to run a function many times, varying one input between each of these runs. It takes two inputs: the list or vector of different values for your one varying input, and the function you want to run repeatedly, with values set for any other arguments.

Functions from purrr allow us to use a special syntax for writing out the function we want to run, based on the R’s formula syntax. Simply put a tilde (~) in front of the function name, and then replace the value to be iterated over with .x. Here, we use this syntax with map to find the Orioles’ top home run hitters in each multiple decades. Our input is a list of numeric vectors, one for each decade, and our output is a list of dataframes.

decades <- 1950:2019 %>% split(sort(rep(1:7, 10)))

bal_top_hr_decades <- map(decades, ~subset_batting_stats(batting_data = data, years = .x, teams_chosen = "BAL", category = "hr"))

## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)

class(bal_top_hr_decades)

## [1] "list"

length(bal_top_hr_decades)

## [1] 7

This is still probably too much data to make use of: maybe it would be more helpful to just have the top home run hitter from each decade. We can take our list of output dataframes and map the function head(1) over each of them, to get the first row, since they’re already sorted by hr.

At this point, we’ll have a list of ten one-row dataframes, and we might as well row-bind them. Unlike base R’s rbind, dplyr::bind_rows will accept a list of dataframes and bind them together. (Another option would have been to use map_dfr, which runs bind_rows on its list-formatted output.)

bal_top1_hr <- bal_top_hr_decades %>%
  map(~head(.x, 1)) %>%
  bind_rows()

bal_top1_hr

## # A tibble: 7 x 3
## # Groups:   player_id [7]
##   player_id team_id    hr
##   <chr>     <chr>   <dbl>
## 1 triangu01 BAL       107
## 2 powelbo01 BAL       202
## 3 mayle01   BAL       116
## 4 murraed02 BAL       254
## 5 ripkeca01 BAL       198
## 6 morame01  BAL       158
## 7 davisch02 BAL       199

This is much more useful, but we do have a small problem: we’ve lost the information about which decade each one came from! One way to rectify this problem is to map a mutate after we run our custom function to add a column that shows which years each row came from. However, we’ll need map2 for this, since we’re going to be iterating over both a list of dataframes and a list of decade vectors. map2 works just like map, except it takes two lists/vectors as inputs, along with a function, and you can specify these inputs in your function as .x and .y.

bal_top1_hr_yearcol <- bal_top_hr_decades %>%
  map2(decades, ~mutate(.x, years = str_c(first(.y), "-", last(.y)))) %>%
  map(~head(.x, 1)) %>%
  bind_rows()

bal_top1_hr_yearcol

## # A tibble: 7 x 4
## # Groups:   player_id [7]
##   player_id team_id    hr years    
##   <chr>     <chr>   <dbl> <chr>    
## 1 triangu01 BAL       107 1950-1959
## 2 powelbo01 BAL       202 1960-1969
## 3 mayle01   BAL       116 1970-1979
## 4 murraed02 BAL       254 1980-1989
## 5 ripkeca01 BAL       198 1990-1999
## 6 morame01  BAL       158 2000-2009
## 7 davisch02 BAL       199 2010-2019

What if we wanted to iterate over more than two arguments? We can use pmap for this; better yet, pmap will accept a list or dataframe of input combinations, which makes it easy to set up the right combinations of arguments to iterate over.

input_combos <- tibble(batting_data = list(data), 
                       years = decades[1:4],
                       teams_chosen = c("BAL", "HOU", "TEX", "NYA"),
                       category = c("hr", "rbi", "g", "ab"))

input_combos

## # A tibble: 4 x 4
##   batting_data            years        teams_chosen category
##   <list>                  <named list> <chr>        <chr>   
## 1 <tibble [102,816 x 22]> <int [10]>   BAL          hr      
## 2 <tibble [102,816 x 22]> <int [10]>   HOU          rbi     
## 3 <tibble [102,816 x 22]> <int [10]>   TEX          g       
## 4 <tibble [102,816 x 22]> <int [10]>   NYA          ab

pmap_out_top3 <- pmap(input_combos, subset_batting_stats) %>%
  map2(input_combos$years, ~mutate(.x, years = str_c(first(.y), "-", last(.y)))) %>%
  map_dfr(~head(.x, 3))

## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)
## `summarise()` regrouping output by 'player_id' (override with `.groups` argument)

# with decades added, as above, and top 3 taken from each dataframe
pmap_out_top3

## # A tibble: 12 x 7
## # Groups:   player_id [12]
##    player_id team_id    hr years       rbi     g    ab
##    <chr>     <chr>   <dbl> <chr>     <dbl> <dbl> <dbl>
##  1 triangu01 BAL       107 1950-1959    NA    NA    NA
##  2 niemabo01 BAL        62 1950-1959    NA    NA    NA
##  3 woodlge01 BAL        32 1950-1959    NA    NA    NA
##  4 wynnji01  HOU        NA 1960-1969   441    NA    NA
##  5 asprobo01 HOU        NA 1960-1969   385    NA    NA
##  6 staubru01 HOU        NA 1960-1969   370    NA    NA
##  7 harrato01 TEX        NA 1970-1979    NA   999    NA
##  8 sundbji01 TEX        NA 1970-1979    NA   875    NA
##  9 hargrmi01 TEX        NA 1970-1979    NA   726    NA
## 10 winfida01 NYA        NA 1980-1989    NA    NA  4424
## 11 randowi01 NYA        NA 1980-1989    NA    NA  4249
## 12 mattido01 NYA        NA 1980-1989    NA    NA  4022

## tidy up...
pmap_out_top3 %>%
  group_by(team_id) %>%
  mutate(rank = row_number()) %>%
  pivot_longer(c(rbi, g, ab, hr), names_to = "stat_type") %>%
  filter(!is.na(value))

## # A tibble: 12 x 6
## # Groups:   team_id [4]
##    player_id team_id years      rank stat_type value
##    <chr>     <chr>   <chr>     <int> <chr>     <dbl>
##  1 triangu01 BAL     1950-1959     1 hr          107
##  2 niemabo01 BAL     1950-1959     2 hr           62
##  3 woodlge01 BAL     1950-1959     3 hr           32
##  4 wynnji01  HOU     1960-1969     1 rbi         441
##  5 asprobo01 HOU     1960-1969     2 rbi         385
##  6 staubru01 HOU     1960-1969     3 rbi         370
##  7 harrato01 TEX     1970-1979     1 g           999
##  8 sundbji01 TEX     1970-1979     2 g           875
##  9 hargrmi01 TEX     1970-1979     3 g           726
## 10 winfida01 NYA     1980-1989     1 ab         4424
## 11 randowi01 NYA     1980-1989     2 ab         4249
## 12 mattido01 NYA     1980-1989     3 ab         4022

4 Functional Programming

4.1 An Interesting Question

4.2 A More General Question

4.3 Even More Generally: Writing a Function

4.4 Iteration with functions: purrr

4.4 Iteration with functions: `purrr`