DataRobot API in R
Disclaimer: The appearance of U.S. Department of Defense (DoD) visual information does not imply or constitute DoD endorsement. The views expressed in this presentation are those only of the author and do not represent the official position of the U.S. Army, DoD, or the federal government.
Personal Introduction
Who Am I?
- Education
- United States Military Academy 07, Bachelor of Science in Operations Research
- Missouri University of Science and Technology, Master of Science in Engineering Management
- THE Ohio State University, Master of Science in Industrial and Systems Engineering
- THE Ohio State University, Graduate Minor in Applied Statistics
- Work
- Schofield Barracks, Hawaii, Engineer Platoon Leader / Executive Officer / Assistant S3 (Operation IRAQI FREEDOM)
- White Sands Missile Range, New Mexico, Engineer A S3/Commander (Operation ENDURING FREEDOM)
- West Point, New York, Assistant Professor
- Fort Belvoir, Virginia, Operations Research Systems Analyst / Data Scientist
General Introduction
What are we doing here?
- Introduce DataRobot in GUI (Graphical User Interface)
- Show DataRobot through the API (Application Programming Interface) with R
- Incorporate some baseball
Who will get the most out of this presentation?
This is for someone who is…
- familiar with R,
- familiar with DataRobot, or
- familiar with general machine learning workflow.
What should you expect to gain from this?
From this presentation, you should gain…
- and understanding of how the DataRobot API in R Process works,
- why you would want to do this, and
- where to go to get more information.
DataRobot Introduction
What is AutoML?
AutoML: Auto Machine Learning is a process that automates the steps of building machine learning models.
These include
- Splitting data into test/train partitions
- Handling missing data
- Centering / Scaling Data
- Determine best loss function to minimize
- Tuning parameters through cross validation
- Selecting different models
- Making predictions on new data
Benefits
- AutoML can find high performing models quickly
- Creating models requires little experience
Drawbacks
- Users not understanding recommended model
What is DataRobot?
According to the DataRobot webpage, “DataRobot is the leading end-to-end enterprise AI platform that automates and accelerates every step of your path from data to value.”
PAE (Program Analysis and Evaluation) hosts a version of DataRobot on the cPROBE cloud at https://deep-green.train.cprobe.army.mil/
Practical Example
Data Introduction
Hall of Fame Baseball Data originated from The Lahman Data Set
Data contains:
- Career statistics from players who played at least 10 seasons
- Last year tracked from 1880-2016
- Statistics accurate as of 2016 if player happened to play longer than 2016
- Hall-Of-Fame Selection status (0 for no, 1 for yes)
historical
- Players who have either made the Hall-Of-Fame or are no longer eligible (careers ended at or before 2003)eligible
- Players who have not made the Hall-Of-Fame who are still eligible
<- read_csv("01_data/historical_baseball.csv") ## read in data
historical <- read_csv("01_data/eligible_baseball.csv") ## read in data eligible
Split into Train/Test
<- rsample::initial_split(historical, strata = "inducted", prop = .5) ## creates a data split object
split
<- rsample::training(split) ## extracts the training data from the data split object
training <- rsample::testing(split) ## extracts the testing data from the data split object testing
Hall-Of-Fame / Non-Hall-Of-Fame Breakdown
# A tibble: 2 × 3
inducted train test
<dbl> <int> <int>
1 0 1496 1501
2 1 84 80
DataRobot on cPROBE
DataRobot on cPROBE provides the traditional interactive tool to build many machine learning models according to preset conditions.
After you upload data, you set up a few modeling parameters.
After modeling is complete, it recommends a ‘best model’.
Select a model, upload data, execute predictions, then download the predictions.
DataRobot Through R API
Connect To Data Robot
library(datarobot)
ConnectToDataRobot(endpoint = "https://deep-green.train.cprobe.army.mil/api/v2",
token = Sys.getenv("key")
## connects to data robot )
Get an API Key Step 1
Get an API Key Step 2
Optional: Place Key in .Renviron File
Start Project
<- str_c("baseball_hof_",lubridate::today(),"_exhibition") ## create project name proj_name
StartProject(
dataSource = training, ## specifies data - can be dataframe, csv, zip file
projectName = proj_name, ## name of project
wait = TRUE, ## keeps console engaged during datarobot execution
maxWait = 1000, ## how long to wait for completion in seconds
verbosity = 1, ## how much feedback in the console: 1 - lots. 0 - none.
checkInterval = 5, ## how often in seconds it provides and update of progress
target = "inducted", # target column from data
metric = "LogLoss", # loss function to optimize
# rmse
mode = "quick", # datarobot's mode of model search
# "auto", "manual" or "quick"
targetType = "Binary", # target variable type
# "Multiclass", "Regression", "Regression",
positiveClass = 1, # if binary, what is "yes" in the data
workerCount = "max" # how many workers execute models in parallel
)
Look Up Project
All Projects
<- ListProjects() %>% as.data.frame() ## returns list of projects
project_list
%>% DT::datatable() ## shows list of projects project_list
Project of Interest
%>% filter(projectName == proj_name) %>% DT::datatable() ## filter for project of interest project_list
Extract Project ID
<-
proj_id %>%
project_list filter(projectName == proj_name) %>% ## filter for project of interest
pull(projectId) ## 'pull' out the column 'projectId' which has our project id
proj_id
[1] "6142228471d20135f56a4159"
Find Project Status
GetProjectStatus(project = proj_id) ## finds the status of the project
$autopilotDone
[1] TRUE
$stageDescription
[1] "Ready for modeling"
$stage
[1] "modeling"
Update Owners
Share(object = GetProject(project = proj_id), c("first.m.last.mil@mail.mil"), role = "OWNER")
## shares project with other users
Make Predictions
Upload Prediction Data and Make Predictions
UploadPredictionDataset(project = proj_id, dataSource = testing) ## uploads the testing data for prediction
Extract Upload Data Information
<-
dataset_info ListPredictionDatasets(project = proj_id) %>% as_tibble() ## lists the prediction data information
%>% DT::datatable() dataset_info
Determine Recommended Model
<-
recommended_model GetRecommendedModel(project = GetProject(project = proj_id), ## specify project id
type = RecommendedModelType$RecommendedForDeployment ## specify which model to request
## extracts the name of the recommended model
)
# RecommendedModelType$MostAccurate
# RecommendedModelType$FastAccurate
# RecommendedModelType$RecommendedForDeployment
Get Projections
<-
predict_job_id RequestPredictions(
project = proj_id, ## provide project id
modelId = recommended_model$modelId, ## provide recommended model id
datasetId = dataset_info$id ## specify dataset to run prediction from best model
## kicks off predictions in data robot
)
<-
predictions GetPredictions(project = proj_id, ## specify the project id
predictId = predict_job_id, ## specify the prediction job from previous code
type = "raw" ## raw specifies we want to predictions to be probabilities
## extracts the predictions from datarobot
)
%>%
predictions mutate(positiveProbability = round(positiveProbability, 2)) %>% ## round predictions to two decimals
::datatable() ## view the predictions DT
Assess Performance
Join Predictions with Testing Data
<-
metric_data %>% ## take the testing data
testing mutate(prob = predictions$positiveProbability, ## add in column of probabilities from predictions
class = predictions$prediction) %>% ## add in column of class predictions (hof or not)
select(player_id, inducted, prob, class) %>% ## select columns of interest
mutate(inducted = fct_rev(as.factor(inducted))) %>% ## reorder the factors - they defaulted wrong upon download
mutate(class = fct_rev(as.factor(class))) %>% ## ## reorder the factors - they defaulted wrong upon download
mutate(prob = round(prob, 2)) %>% ## round probabilities to two decimals
left_join(
read_csv("01_data/master.csv") %>% janitor::clean_names() %>% mutate(name = str_c(name_first," ", name_last)) %>% select(player_id, name)
%>% ## join in name data so we can see which players below to which player_id
) relocate(player_id, name)
::datatable(metric_data) DT
Calculate Performance Metrics
%>%
metric_data ::metrics(truth = inducted, estimate = class) ## calculate performance yardstick
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.969
2 kap binary 0.637
%>%
metric_data ::roc_auc(inducted, prob) ## calculate area under the receiver operator characteristic yardstick
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.968
Get Projections for Future Data
Make Model on all Historic Data
<- str_c("baseball_hof_",lubridate::today(),"_exhibition_full_model") ## create project name
project_name_full
StartProject(
dataSource = historical, ## specifies data - can be dataframe, csv, zip file
projectName = project_name_full, ## name of project
wait = TRUE, ## keeps console engaged during datarobot execution
maxWait = 1000, ## how long to wait for completion in seconds
verbosity = 1, ## how much feedback in the console: 1 - lots. 0 - none.
checkInterval = 5, ## how often in seconds it provides and update of progress
target = "inducted", # target column from data
metric = "LogLoss", # loss function to optimize
# rmse
mode = "quick", # datarobot's mode of model search
# "auto", "manual" or "quick"
targetType = "Binary", # target variable type
# "Multiclass", "Regression", "Regression",
positiveClass = 1, # if binary, what is "yes" in the data
workerCount = "max" # how many workers execute models in parallel
)
<- ListProjects() %>% as.data.frame() ## returns list of projects
project_list
<-
proj_id %>%
project_list filter(projectName == project_name_full) %>% ## filter for project of interest
pull(projectId) ## 'pull' out the column 'projectId' which has our project id
<-
recommended_model GetRecommendedModel(project = GetProject(project = proj_id), ## specify project id
type = RecommendedModelType$RecommendedForDeployment ## specify which model to request
## extracts the name of the recommended model )
Use Model All Historical Data to Predict on Future Data
UploadPredictionDataset(project = proj_id, dataSource = eligible) ## uploads the testing data for prediction
<-
dataset_info ListPredictionDatasets(project = proj_id) %>% as_tibble() ## lists the prediction data information
<-
predict_job_id RequestPredictions(
project = proj_id, ## provide project id
modelId = recommended_model$modelId, ## provide recommended model id
datasetId = dataset_info$id ## specify dataset to run prediction from best model
## kicks off predictions in data robot
)
<-
predictions GetPredictions(project = proj_id, ## specify the project id
predictId = predict_job_id, ## specify the prediction job from previous code
type = "raw" ## raw specifies we want to predictions to be probabilities
## extracts the predictions from datarobot )
Look at Results
Other Useful Functions (non-exhaustive)
ListModels(project = proj_id)
## this shows all different models available that datarobot created
ListBlueprints(project = proj_id)
## this lists all types of models that datarobot can build for type of model required by the data
Parting Thoughts
Why would I want to do this?
- Automate a model in production that updates with new data everyday
- COVID work example
- DataRobot can handle larger data sizes with API than GUI