Parallel Processing and Bootstrapping

Introduction

Often times, I run simulations on my computer to help educate my students about different statistical topics. In doing this, I often end up running fairly “long” simulations that take up more time than what I’d like to use in the classrooms.

Complaining about this let me to Googling, which, as usual, let me to a solution. Posted below is an example of what I learned with an accompanying ShinyApp which helps me explain bootstrapping to my students.

Gratitude

Much of what I learned I figured out from this nice blog post by Matt Jones.

Problem to Solve

To motivate this problem, I decided I wanted to bootstrap from the diamonds data-set to create a range of coefficients which could be the slope and intercept for the linear relationship between the price and cut/carat. Because this could feasibly take a while, I wanted to use parallel processing to split up the bootstrapping portion on several cores.

My Old Way

This is how the old me would have done it. I ran a ‘for’ loop to create 100 bootstraped linear regression of a sample from the data.

data("diamonds")

diamonds = diamonds %>%
  select(price,carat,cut)

coefdf = NULL
allsamples = NULL
system.time({
  for (i in 1:100) {
    subsample = diamonds %>%
      sample_frac(.001) %>%
      mutate(iteration = i)
    result = lm(price~cut+carat,data = subsample)
    coef = coefficients(result)
    coefdf = coefdf %>%
      bind_rows(coef)
    allsamples = allsamples %>%
      bind_rows(subsample)
  }
})

##    user  system elapsed 
##    0.77    0.01    0.78

I tracked the time and it doesn’t take all that long… but then again its only 100 loops.

After executing this loop, I can do my future analysis with the following data:

head(coefdf)

## # A tibble: 6 x 6
##   `(Intercept)` cut.L  cut.Q cut.C `cut^4` carat
##           <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl>
## 1        -3270.  902.  -343.  291.  -198.  8711.
## 2        -2822. 1608. -1208.  335.   913.  7681.
## 3        -2867. 1834.  -498.  536.   736.  7801.
## 4        -3229. 1222.  -831. 1168.   -35.7 8553.
## 5        -3156. 1686.  -794.  546.  -243.  8297.
## 6        -1886. -573.   559. -208.   216.  7840.

head(allsamples)

## # A tibble: 6 x 4
##   price carat cut       iteration
##   <int> <dbl> <ord>         <int>
## 1  1684  0.54 Ideal             1
## 2  6630  1.1  Ideal             1
## 3  1297  0.45 Ideal             1
## 4  1300  0.39 Premium           1
## 5 16611  2.05 Premium           1
## 6  4704  1    Very Good         1

But you can see how this can quickly balloon into taking too much time. We can solve that problem with parallel processing.

Parallel Processing

This code requires the following libraries:

library(tidyverse)
library(foreach)
library(doParallel)

Typically, all of our work is done is series. Sometimes, if jobs can be done in parallel, we can do this by sending them to different cores in our computer. The amount of time that this saves depends on the speed of your computer, the number of cores in which you can put to work doing jobs, and, of course, the amount of work you need done. Also, there is a ‘start up cost’ to firing up a core - but if you have a lot of repetitive processes that can be spread out and little that must be done for each core, you can save a lot of time.

First thing, we determine how many cores we have available on our computer to do the work.

numCores <- detectCores()
numCores

## [1] 8

registerDoParallel(numCores)

Now I can execute the same code as above with foreach and %dopar%.

data("diamonds")

diamonds = diamonds %>%
  select(price,carat,cut)

allsamples = NULL
trials <- 10000
system.time({
  r <- foreach(i=icount(trials), .combine=rbind, .packages = "tidyverse") %dopar% {
    subsample = diamonds %>%
      sample_frac(.01) 
    result = lm(price~cut+carat+0,data = subsample)
    coefs = data.frame(t(coefficients(result)))
    list(subsample,coefs)
  }
})

##    user  system elapsed 
##    4.47    0.67   16.65

Now we can retrieve the data from our ‘loop’ in the following way:

coefs = 
bind_rows(r[,2],.id = "iteration") %>% as_tibble()
head(coefs)

## # A tibble: 6 x 7
##   iteration cutFair cutGood cutVery.Good cutPremium cutIdeal carat
##   <chr>       <dbl>   <dbl>        <dbl>      <dbl>    <dbl> <dbl>
## 1 result.1   -4350.  -2785.       -2359.     -2687.   -2144. 7969.
## 2 result.2   -3494.  -2775.       -2399.     -2447.   -2187. 7733.
## 3 result.3   -2674.  -2614.       -2157.     -2589.   -1996. 7808.
## 4 result.4   -4353.  -2480.       -2296.     -2158.   -2073. 7675.
## 5 result.5   -3084.  -2766.       -2539.     -2503.   -2185. 7989.
## 6 result.6   -3979.  -2866.       -2325.     -2499.   -2188. 8024.

sampledata = 
bind_rows(r[,1],.id = "iteration") %>% as_tibble()
head(sampledata)

## # A tibble: 6 x 4
##   iteration price carat cut      
##   <chr>     <int> <dbl> <ord>    
## 1 result.1    982  0.4  Very Good
## 2 result.1   6125  1.21 Very Good
## 3 result.1    394  0.3  Very Good
## 4 result.1    492  0.33 Ideal    
## 5 result.1   2954  0.72 Premium  
## 6 result.1   5572  1.04 Ideal

Now the point is to find a 95% confidence interval of what the true \(\beta_0\) and \(\beta_is\) are.

95% Confidence Interval

Factor	low	high
carat	7187.919	7552.785
cutFair	-5449.843	-4580.359
cutGood	-3426.280	-3096.497
cutIdeal	-2451.459	-2275.716
cutPremium	-2937.752	-2702.240
cutVery.Good	-2869.595	-2624.574

Looks like none of the confidence intervals contain 0. Therefore carat does have a relationship with price for each level of cut.

Displaying the Results

The ShinyApp I use to display the results is below.

If you’d rather view the app outside the blog, here is the link

I also include the code for the ShinyApp below the embedded application.

allsamples = NULL
trials <- 10000
  r <-
    foreach(i = icount(trials),.combine = rbind,.packages = "tidyverse") %dopar% {
              subsample = diamonds %>%
                sample_frac(.001)
              result = lm(price ~ cut + carat + 0, data = subsample)
              coefs = data.frame(t(coefficients(result)))
              list(subsample, coefs)
            }

sampledata =
  bind_rows(r[, 1], .id = "iteration")
coefs =
  bind_rows(r[, 2], .id = "iteration")

help =
  sampledata %>% as_tibble() %>% janitor::clean_names()

coefdata =
  coefs %>% as_tibble() %>%
  gather(cuttype, intercept, -carat, -iteration)

shinyApp(
  ui = fluidPage(
    sliderInput(
      "number",
      "Which Iteration to Display",
      min = 1,
      max = trials,
      value = floor(runif(1, 1, trials))
    ),
    plotOutput("diamond")
  ),
  
  server = function(input, output) {
    subhelp = reactive({
      help %>%
        mutate(iteration = as.numeric(str_remove(iteration, "result."))) %>%
        filter(iteration == input$number)
    })
    
    subcoefdata = reactive({
      coefdata %>%
        mutate(iteration = as.numeric(str_remove(iteration, "result."))) %>%
        filter(iteration == input$number)
    })
    
    output$diamond = renderPlot({
      diamonds %>%
        ggplot(aes(x = carat, y = price)) +
        geom_point(alpha = .1) +
        geom_point(data = subhelp(),
                   aes(x = carat, y = price),
                   color = "blue") +
        geom_abline(data = subcoefdata(),
                    aes(intercept = intercept, slope = carat),
                    color = "red") +
        facet_wrap( ~ cuttype, drop = TRUE, nrow = 1) +
        labs(
          x = "Carat",
          y = "Price",
          title = "Price of Diamond as Carat Increases",
          subtitle = "By Cut",
          caption = paste("Currently Showing Bootstrap Sample ", input$number)
        )
    })
  },
  
  options = list(height = 1000)
)