DataRobot API in R

Disclaimer: The appearance of U.S. Department of Defense (DoD) visual information does not imply or constitute DoD endorsement. The views expressed in this presentation are those only of the author and do not represent the official position of the U.S. Army, DoD, or the federal government.

Personal Introduction

Who Am I?

Education
- United States Military Academy 07, Bachelor of Science in Operations Research
- Missouri University of Science and Technology, Master of Science in Engineering Management
- THE Ohio State University, Master of Science in Industrial and Systems Engineering
- THE Ohio State University, Graduate Minor in Applied Statistics
Work
- Schofield Barracks, Hawaii, Engineer Platoon Leader / Executive Officer / Assistant S3 (Operation IRAQI FREEDOM)
- White Sands Missile Range, New Mexico, Engineer A S3/Commander (Operation ENDURING FREEDOM)
- West Point, New York, Assistant Professor
- Fort Belvoir, Virginia, Operations Research Systems Analyst / Data Scientist

General Introduction

What are we doing here?

Introduce DataRobot in GUI (Graphical User Interface)
Show DataRobot through the API (Application Programming Interface) with R
Incorporate some baseball

Who will get the most out of this presentation?

This is for someone who is…

familiar with R,
familiar with DataRobot, or
familiar with general machine learning workflow.

What should you expect to gain from this?

From this presentation, you should gain…

and understanding of how the DataRobot API in R Process works,
why you would want to do this, and
where to go to get more information.

DataRobot Introduction

What is AutoML?

AutoML: Auto Machine Learning is a process that automates the steps of building machine learning models.

These include

Splitting data into test/train partitions
Handling missing data
Centering / Scaling Data
Determine best loss function to minimize
Tuning parameters through cross validation
Selecting different models
Making predictions on new data

Benefits

AutoML can find high performing models quickly
Creating models requires little experience

Drawbacks

Users not understanding recommended model

What is DataRobot?

According to the DataRobot webpage, “DataRobot is the leading end-to-end enterprise AI platform that automates and accelerates every step of your path from data to value.”

PAE (Program Analysis and Evaluation) hosts a version of DataRobot on the cPROBE cloud at https://deep-green.train.cprobe.army.mil/

Practical Example

Data Introduction

Hall of Fame Baseball Data originated from The Lahman Data Set

Data contains:

Career statistics from players who played at least 10 seasons
Last year tracked from 1880-2016
Statistics accurate as of 2016 if player happened to play longer than 2016
Hall-Of-Fame Selection status (0 for no, 1 for yes)
historical - Players who have either made the Hall-Of-Fame or are no longer eligible (careers ended at or before 2003)
eligible - Players who have not made the Hall-Of-Fame who are still eligible

historical <- read_csv("01_data/historical_baseball.csv") ## read in data
eligible <- read_csv("01_data/eligible_baseball.csv") ## read in data

Split into Train/Test

split <- rsample::initial_split(historical, strata = "inducted", prop = .5) ## creates a data split object

training <- rsample::training(split) ## extracts the training data from the data split object
testing  <- rsample::testing(split) ## extracts the testing data from the data split object

Hall-Of-Fame / Non-Hall-Of-Fame Breakdown

# A tibble: 2 × 3
  inducted train  test
     <dbl> <int> <int>
1        0  1496  1501
2        1    84    80

DataRobot on cPROBE

DataRobot on cPROBE provides the traditional interactive tool to build many machine learning models according to preset conditions.

After you upload data, you set up a few modeling parameters.

After modeling is complete, it recommends a ‘best model’.

Select a model, upload data, execute predictions, then download the predictions.

DataRobot Through R API

Connect To Data Robot

library(datarobot)

ConnectToDataRobot(endpoint = "https://deep-green.train.cprobe.army.mil/api/v2",
                              token = Sys.getenv("key")
) ## connects to data robot

Get an API Key Step 1

Get an API Key Step 2

Optional: Place Key in .Renviron File

Start Project

proj_name <- str_c("baseball_hof_",lubridate::today(),"_exhibition") ## create project name

StartProject(
      dataSource = training, ## specifies data - can be dataframe, csv, zip file
      projectName = proj_name, ## name of project
      wait = TRUE, ## keeps console engaged during datarobot execution
      maxWait = 1000, ## how long to wait for completion in seconds
      verbosity = 1, ## how much feedback in the console: 1 - lots. 0 - none.
      checkInterval = 5, ## how often in seconds it provides and update of progress
      target = "inducted", # target column from data
      metric = "LogLoss", # loss function to optimize
      # rmse
      mode = "quick",  #  datarobot's mode of model search
      # "auto", "manual" or "quick"
      targetType = "Binary", # target variable type
      # "Multiclass", "Regression", "Regression",  
      positiveClass = 1, # if binary, what is "yes" in the data
      workerCount = "max" # how many workers execute models in parallel 
    )

Look Up Project

All Projects

project_list <- ListProjects() %>% as.data.frame() ## returns list of projects

project_list %>% DT::datatable() ## shows list of projects

Project of Interest

project_list %>% filter(projectName == proj_name) %>% DT::datatable() ## filter for project of interest

Extract Project ID

proj_id <- 
  project_list %>% 
  filter(projectName == proj_name) %>% ## filter for project of interest
  pull(projectId) ## 'pull' out the column 'projectId' which has our project id

proj_id

[1] "6142228471d20135f56a4159"

Find Project Status

GetProjectStatus(project = proj_id) ## finds the status of the project

$autopilotDone
[1] TRUE

$stageDescription
[1] "Ready for modeling"

$stage
[1] "modeling"

Update Owners

Share(object = GetProject(project = proj_id), c("first.m.last.mil@mail.mil"), role = "OWNER") 
## shares project with other users

Make Predictions

Upload Prediction Data and Make Predictions

UploadPredictionDataset(project = proj_id, dataSource = testing) ## uploads the testing data for prediction

Extract Upload Data Information

dataset_info <-
  ListPredictionDatasets(project = proj_id) %>% as_tibble() ## lists the prediction data information

dataset_info %>% DT::datatable()

Determine Recommended Model

recommended_model <-
  GetRecommendedModel(project = GetProject(project = proj_id), ## specify project id
                      type = RecommendedModelType$RecommendedForDeployment ## specify which model to request
                      ) ## extracts the name of the recommended model

# RecommendedModelType$MostAccurate  
# RecommendedModelType$FastAccurate  
# RecommendedModelType$RecommendedForDeployment

Get Projections

predict_job_id <-
  RequestPredictions(
    project = proj_id, ## provide project id
    modelId = recommended_model$modelId, ## provide recommended model id
    datasetId = dataset_info$id ## specify dataset to run prediction from best model
  ) ## kicks off predictions in data robot

predictions <-
  GetPredictions(project = proj_id, ## specify the project id
                            predictId = predict_job_id,  ## specify the prediction job from previous code
                            type = "raw" ## raw specifies we want to predictions to be probabilities
                            ) ## extracts the predictions from datarobot

predictions %>% 
  mutate(positiveProbability = round(positiveProbability, 2)) %>% ## round predictions to two decimals
  DT::datatable() ## view the predictions

Assess Performance

Join Predictions with Testing Data

metric_data <-
testing %>% ## take the testing data
  mutate(prob = predictions$positiveProbability, ## add in column of probabilities from predictions
         class = predictions$prediction) %>% ## add in column of class predictions (hof or not)
  select(player_id, inducted, prob, class) %>% ## select columns of interest
  mutate(inducted = fct_rev(as.factor(inducted))) %>%  ## reorder the factors - they defaulted wrong upon download
  mutate(class = fct_rev(as.factor(class)))  %>% ## ## reorder the factors - they defaulted wrong upon download
  mutate(prob = round(prob, 2)) %>% ## round probabilities to two decimals
    left_join(
    read_csv("01_data/master.csv") %>% janitor::clean_names() %>% mutate(name = str_c(name_first," ", name_last)) %>% select(player_id, name)
    ) %>% ## join in name data so we can see which players below to which player_id
  relocate(player_id, name) 


DT::datatable(metric_data)

Calculate Performance Metrics

metric_data %>% 
  yardstick::metrics(truth = inducted, estimate = class) ## calculate performance

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.969
2 kap      binary         0.637

metric_data %>% 
  yardstick::roc_auc(inducted, prob) ## calculate area under the receiver operator characteristic

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.968

Get Projections for Future Data

Make Model on all Historic Data

project_name_full <- str_c("baseball_hof_",lubridate::today(),"_exhibition_full_model") ## create project name

StartProject(
      dataSource = historical, ## specifies data - can be dataframe, csv, zip file
      projectName = project_name_full, ## name of project
      wait = TRUE, ## keeps console engaged during datarobot execution
      maxWait = 1000, ## how long to wait for completion in seconds
      verbosity = 1, ## how much feedback in the console: 1 - lots. 0 - none.
      checkInterval = 5, ## how often in seconds it provides and update of progress
      target = "inducted", # target column from data
      metric = "LogLoss", # loss function to optimize
      # rmse
      mode = "quick",  #  datarobot's mode of model search
      # "auto", "manual" or "quick"
      targetType = "Binary", # target variable type
      # "Multiclass", "Regression", "Regression",  
      positiveClass = 1, # if binary, what is "yes" in the data
      workerCount = "max" # how many workers execute models in parallel 
    )

project_list <- ListProjects() %>% as.data.frame() ## returns list of projects

proj_id <- 
  project_list %>% 
  filter(projectName == project_name_full) %>% ## filter for project of interest
  pull(projectId) ## 'pull' out the column 'projectId' which has our project id


recommended_model <-
  GetRecommendedModel(project = GetProject(project = proj_id), ## specify project id
                      type = RecommendedModelType$RecommendedForDeployment ## specify which model to request
                      ) ## extracts the name of the recommended model

Use Model All Historical Data to Predict on Future Data

UploadPredictionDataset(project = proj_id, dataSource = eligible) ## uploads the testing data for prediction

dataset_info <-
  ListPredictionDatasets(project = proj_id) %>% as_tibble() ## lists the prediction data information

predict_job_id <-
  RequestPredictions(
    project = proj_id, ## provide project id
    modelId = recommended_model$modelId, ## provide recommended model id
    datasetId = dataset_info$id ## specify dataset to run prediction from best model
  ) ## kicks off predictions in data robot

predictions <-
  GetPredictions(project = proj_id, ## specify the project id
                            predictId = predict_job_id,  ## specify the prediction job from previous code
                            type = "raw" ## raw specifies we want to predictions to be probabilities
                            ) ## extracts the predictions from datarobot

Look at Results

Other Useful Functions (non-exhaustive)

ListModels(project = proj_id) 
## this shows all different models available that datarobot created

ListBlueprints(project = proj_id) 
## this lists all types of models that datarobot can build for type of model required by the data

Parting Thoughts

Why would I want to do this?

Automate a model in production that updates with new data everyday
COVID work example
DataRobot can handle larger data sizes with API than GUI