Lesson 25 - Multiple Linear Regression II

Lesson Administration

Calendar

Day 1

Day 2

SIL 2 Points

Milestone 5

Exploration Exercise 4.4

  • Lesson 26
  • 12-13 November
  • Link: TBD

Milestone 6: Draft Tech Report

Project Presentations

  • Need to know who will not be here on presentation day because of CPRC

Milestone 7

Milestone 8

TEE Times

Date Start End
Wed, 17 Dec 2025 1300 1630
Thu, 18 Dec 2025 0730 1100

DMath Basketball!!

Math 1 vs…

1-0

Cal

Reese

I got nothing… So enjoy these

Multiple Linear Regression

A Reminder

Remember this from last class?

mod1 <- lm(formula = mpg ~ wt + as.factor(cyl), data = mtcars)
summary(mod1)

Call:
lm(formula = mpg ~ wt + as.factor(cyl), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5890 -1.2357 -0.5159  1.3845  5.7915 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      33.9908     1.8878  18.006  < 2e-16 ***
wt               -3.2056     0.7539  -4.252 0.000213 ***
as.factor(cyl)6  -4.2556     1.3861  -3.070 0.004718 ** 
as.factor(cyl)8  -6.0709     1.6523  -3.674 0.000999 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.557 on 28 degrees of freedom
Multiple R-squared:  0.8374,    Adjusted R-squared:   0.82 
F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

The fitted regression equation is:

\[ \widehat{mpg} = 33.99 - 3.21(\text{wt}) - 4.26(\text{cyl6}) - 6.07(\text{cyl8}) \]

To predict fuel efficiency for a 3000-lb, 6-cylinder car:

\[ \begin{aligned} \widehat{mpg} &= 33.99 - 3.21(3.0) - 4.26(1) - 6.07(0) \\[6pt] &= 33.99 - 9.63 - 4.26 \\[6pt] &= 20.10 \end{aligned} \]

So the model predicts:

\[ \boxed{\widehat{mpg} = 20.1} \]

Interactions

mod2 <- lm(formula = mpg ~ wt + as.factor(cyl) + hp + hp:as.factor(cyl), data = mtcars)
summary(mod2)

Call:
lm(formula = mpg ~ wt + as.factor(cyl) + hp + hp:as.factor(cyl), 
    data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1864 -1.4098 -0.4022  1.0186  4.3920 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         41.87732    3.23293  12.953 1.37e-12 ***
wt                  -3.05994    0.68275  -4.482 0.000143 ***
as.factor(cyl)6     -9.98213    5.76950  -1.730 0.095931 .  
as.factor(cyl)8    -11.72793    4.22507  -2.776 0.010276 *  
hp                  -0.09947    0.03487  -2.853 0.008576 ** 
as.factor(cyl)6:hp   0.07809    0.05236   1.492 0.148335    
as.factor(cyl)8:hp   0.08602    0.03703   2.323 0.028601 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.3 on 25 degrees of freedom
Multiple R-squared:  0.8826,    Adjusted R-squared:  0.8544 
F-statistic: 31.32 on 6 and 25 DF,  p-value: 1.831e-10

We can write the fitted model as

\[ \begin{aligned} \widehat{mpg} &= 41.877 - 3.060\,(\text{wt}) - 9.982\,(\text{cyl6}) - 11.728\,(\text{cyl8}) - 0.09947\,(\text{hp}) + 0.07809\,(\text{hp}\times \text{cyl6}) + 0.08602\,(\text{hp}\times \text{cyl8}) \end{aligned} \]

Comparing Models

Which one is better?

Lets turn to \(R^2\)

Model 1: .8374

Model 2: .8826

Validity Conditions

Linearity — the relationship between predictors and the response is roughly linear
Independence — observations are independent of each other
Normal Distribution — residuals are approximately normal
Equal Variance — variability of residuals is consistent across fitted values

Linearity

Check whether the relationship between predictors and the response is roughly linear.

Use a residuals vs fitted plot — you want to see a random scatter (no pattern).

plot(mod1, which = 1)

Independence

We can’t test this from the model alone — it depends on how the data were collected.
You must verify that each observation is independent (e.g., different cars, people, or trials).

Normal Distribution

Check whether residuals are approximately normally distributed using a Q-Q plot.

plot(mod1, which = 2)

Equal Variance

Check for constant variance (homoscedasticity) — residuals should have similar spread across fitted values.

plot(mod1, which = 1)

Adventure One - Example of what to do for Milestone 5

library(tidyverse)

data("diamonds")

diamonds |> 
  sample_n(size = 1000) |> 
  pivot_longer(cols = -c(cut, clarity, price, color)) |> 
  ggplot(aes(x = value, y = price, colour = color)) +
  geom_point() +
  facet_wrap(~name, scales = "free")

mod1 <- lm(formula = price ~ carat + depth + table + x + y + z + cut + color + clarity, data = diamonds)
summary(mod1)

Call:
lm(formula = price ~ carat + depth + table + x + y + z + cut + 
    color + clarity, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-21376.0   -592.4   -183.5    376.4  10694.2 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  5753.762    396.630   14.507  < 2e-16 ***
carat       11256.978     48.628  231.494  < 2e-16 ***
depth         -63.806      4.535  -14.071  < 2e-16 ***
table         -26.474      2.912   -9.092  < 2e-16 ***
x           -1008.261     32.898  -30.648  < 2e-16 ***
y               9.609     19.333    0.497  0.61918    
z             -50.119     33.486   -1.497  0.13448    
cut.L         584.457     22.478   26.001  < 2e-16 ***
cut.Q        -301.908     17.994  -16.778  < 2e-16 ***
cut.C         148.035     15.483    9.561  < 2e-16 ***
cut^4         -20.794     12.377   -1.680  0.09294 .  
color.L     -1952.160     17.342 -112.570  < 2e-16 ***
color.Q      -672.054     15.777  -42.597  < 2e-16 ***
color.C      -165.283     14.725  -11.225  < 2e-16 ***
color^4        38.195     13.527    2.824  0.00475 ** 
color^5       -95.793     12.776   -7.498 6.59e-14 ***
color^6       -48.466     11.614   -4.173 3.01e-05 ***
clarity.L    4097.431     30.259  135.414  < 2e-16 ***
clarity.Q   -1925.004     28.227  -68.197  < 2e-16 ***
clarity.C     982.205     24.152   40.668  < 2e-16 ***
clarity^4    -364.918     19.285  -18.922  < 2e-16 ***
clarity^5     233.563     15.752   14.828  < 2e-16 ***
clarity^6       6.883     13.715    0.502  0.61575    
clarity^7      90.640     12.103    7.489 7.06e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared:  0.9198,    Adjusted R-squared:  0.9198 
F-statistic: 2.688e+04 on 23 and 53916 DF,  p-value: < 2.2e-16
mod2 <- lm(formula = price ~ carat + depth + table + x + z + cut + color + clarity, data = diamonds)
summary(mod2)

Call:
lm(formula = price ~ carat + depth + table + x + z + cut + color + 
    clarity, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-21378.8   -592.5   -183.5    376.3  10694.1 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  5768.782    395.474   14.587  < 2e-16 ***
carat       11257.752     48.602  231.630  < 2e-16 ***
depth         -64.003      4.517  -14.168  < 2e-16 ***
table         -26.501      2.911   -9.103  < 2e-16 ***
x           -1000.354     28.795  -34.740  < 2e-16 ***
z             -47.925     33.194   -1.444  0.14880    
cut.L         584.600     22.476   26.010  < 2e-16 ***
cut.Q        -302.211     17.983  -16.805  < 2e-16 ***
cut.C         148.446     15.461    9.601  < 2e-16 ***
cut^4         -20.619     12.371   -1.667  0.09559 .  
color.L     -1952.179     17.342 -112.572  < 2e-16 ***
color.Q      -672.075     15.777  -42.599  < 2e-16 ***
color.C      -165.277     14.725  -11.224  < 2e-16 ***
color^4        38.193     13.526    2.824  0.00475 ** 
color^5       -95.780     12.776   -7.497 6.64e-14 ***
color^6       -48.452     11.614   -4.172 3.02e-05 ***
clarity.L    4097.613     30.256  135.431  < 2e-16 ***
clarity.Q   -1925.133     28.226  -68.205  < 2e-16 ***
clarity.C     982.322     24.150   40.676  < 2e-16 ***
clarity^4    -364.976     19.285  -18.926  < 2e-16 ***
clarity^5     233.635     15.751   14.833  < 2e-16 ***
clarity^6       6.871     13.715    0.501  0.61640    
clarity^7      90.622     12.103    7.487 7.13e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1130 on 53917 degrees of freedom
Multiple R-squared:  0.9198,    Adjusted R-squared:  0.9198 
F-statistic: 2.81e+04 on 22 and 53917 DF,  p-value: < 2.2e-16
## Candidate model
mod3 <- lm(formula = price ~ carat + depth + table + x + cut + color + clarity, data = diamonds)
summary(mod3)

Call:
lm(formula = price ~ carat + depth + table + x + cut + color + 
    clarity, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-21385.0   -592.4   -183.7    376.5  10694.6 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  5935.107    378.328   15.688  < 2e-16 ***
carat       11256.968     48.600  231.626  < 2e-16 ***
depth         -66.769      4.091  -16.322  < 2e-16 ***
table         -26.457      2.911   -9.089  < 2e-16 ***
x           -1029.478     20.549  -50.098  < 2e-16 ***
cut.L         584.717     22.476   26.015  < 2e-16 ***
cut.Q        -302.037     17.983  -16.795  < 2e-16 ***
cut.C         148.065     15.459    9.578  < 2e-16 ***
cut^4         -21.253     12.364   -1.719  0.08562 .  
color.L     -1952.128     17.342 -112.568  < 2e-16 ***
color.Q      -672.207     15.777  -42.608  < 2e-16 ***
color.C      -165.451     14.724  -11.236  < 2e-16 ***
color^4        38.261     13.526    2.829  0.00468 ** 
color^5       -95.816     12.776   -7.500 6.50e-14 ***
color^6       -48.441     11.614   -4.171 3.04e-05 ***
clarity.L    4096.912     30.253  135.423  < 2e-16 ***
clarity.Q   -1924.681     28.224  -68.192  < 2e-16 ***
clarity.C     982.004     24.149   40.664  < 2e-16 ***
clarity^4    -364.870     19.285  -18.920  < 2e-16 ***
clarity^5     233.449     15.751   14.822  < 2e-16 ***
clarity^6       6.973     13.715    0.508  0.61114    
clarity^7      90.738     12.103    7.497 6.63e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1130 on 53918 degrees of freedom
Multiple R-squared:  0.9198,    Adjusted R-squared:  0.9198 
F-statistic: 2.944e+04 on 21 and 53918 DF,  p-value: < 2.2e-16
diamonds |> 
  select(-c(cut, color, clarity)) |> 
  sample_n(1000) |> 
  pairs()

mod4 <- lm(formula = price ~ carat + depth + table + x + cut + color + clarity + carat:color, data = diamonds)
summary(mod4)

Call:
lm(formula = price ~ carat + depth + table + x + cut + color + 
    clarity + carat:color, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-16732.9   -513.8   -136.6    358.5  10550.5 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10393.339    359.347  28.923  < 2e-16 ***
carat         12646.011     49.034 257.900  < 2e-16 ***
depth           -97.355      3.860 -25.223  < 2e-16 ***
table           -28.297      2.735 -10.348  < 2e-16 ***
x             -1624.864     20.623 -78.791  < 2e-16 ***
cut.L           594.594     21.114  28.161  < 2e-16 ***
cut.Q          -274.247     16.896 -16.232  < 2e-16 ***
cut.C           132.434     14.523   9.119  < 2e-16 ***
cut^4           -12.347     11.613  -1.063  0.28770    
color.L         329.781     34.252   9.628  < 2e-16 ***
color.Q         513.842     31.643  16.239  < 2e-16 ***
color.C        -119.817     29.087  -4.119 3.81e-05 ***
color^4          69.759     26.171   2.665  0.00769 ** 
color^5         250.440     24.647  10.161  < 2e-16 ***
color^6         125.580     22.195   5.658 1.54e-08 ***
clarity.L      4113.880     28.459 144.556  < 2e-16 ***
clarity.Q     -1981.152     26.554 -74.607  < 2e-16 ***
clarity.C       991.093     22.698  43.665  < 2e-16 ***
clarity^4      -353.021     18.117 -19.485  < 2e-16 ***
clarity^5       208.673     14.801  14.099  < 2e-16 ***
clarity^6        26.858     12.884   2.085  0.03711 *  
clarity^7       107.303     11.372   9.436  < 2e-16 ***
carat:color.L -2454.649     34.150 -71.878  < 2e-16 ***
carat:color.Q -1020.521     31.031 -32.887  < 2e-16 ***
carat:color.C   212.081     29.261   7.248 4.29e-13 ***
carat:color^4    19.164     27.152   0.706  0.48031    
carat:color^5  -396.193     26.071 -15.196  < 2e-16 ***
carat:color^6  -192.062     24.204  -7.935 2.14e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1061 on 53912 degrees of freedom
Multiple R-squared:  0.9293,    Adjusted R-squared:  0.9292 
F-statistic: 2.623e+04 on 27 and 53912 DF,  p-value: < 2.2e-16
## Candidate model
mod5 <- lm(formula = price ~ carat + depth + table + x + cut + color + clarity + carat:color + clarity:carat, data = diamonds)
summary(mod5)

Call:
lm(formula = price ~ carat + depth + table + x + cut + color + 
    clarity + carat:color + clarity:carat, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-22707.3   -301.6     -0.9    271.9   8784.4 

Coefficients:
                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)     21244.877    290.112   73.230  < 2e-16 ***
carat           16453.723     45.710  359.960  < 2e-16 ***
depth            -163.331      3.068  -53.232  < 2e-16 ***
table             -39.541      2.158  -18.324  < 2e-16 ***
x               -3095.798     18.360 -168.620  < 2e-16 ***
cut.L             514.155     16.659   30.863  < 2e-16 ***
cut.Q            -163.316     13.343  -12.239  < 2e-16 ***
cut.C              71.287     11.459    6.221 4.98e-10 ***
cut^4               7.700      9.160    0.841  0.40059    
color.L           667.031     27.200   24.523  < 2e-16 ***
color.Q           367.411     25.005   14.694  < 2e-16 ***
color.C            15.497     22.987    0.674  0.50022    
color^4           104.494     20.654    5.059 4.22e-07 ***
color^5           200.370     19.445   10.305  < 2e-16 ***
color^6            24.333     17.523    1.389  0.16495    
clarity.L       -3011.134     48.745  -61.774  < 2e-16 ***
clarity.Q        1767.264     44.991   39.280  < 2e-16 ***
clarity.C       -1411.691     37.984  -37.166  < 2e-16 ***
clarity^4         863.630     30.089   28.703  < 2e-16 ***
clarity^5        -454.165     24.029  -18.901  < 2e-16 ***
clarity^6         112.514     20.512    5.485 4.15e-08 ***
clarity^7         -91.019     17.879   -5.091 3.58e-07 ***
carat:color.L   -2936.145     27.224 -107.853  < 2e-16 ***
carat:color.Q    -861.382     24.531  -35.114  < 2e-16 ***
carat:color.C      33.212     23.141    1.435  0.15124    
carat:color^4     -64.953     21.442   -3.029  0.00245 ** 
carat:color^5    -293.108     20.575  -14.246  < 2e-16 ***
carat:color^6     -24.459     19.136   -1.278  0.20119    
carat:clarity.L  7881.781     51.240  153.820  < 2e-16 ***
carat:clarity.Q -2165.582     45.783  -47.301  < 2e-16 ***
carat:clarity.C  1714.611     41.112   41.706  < 2e-16 ***
carat:clarity^4  -945.245     36.053  -26.218  < 2e-16 ***
carat:clarity^5   488.264     31.404   15.548  < 2e-16 ***
carat:clarity^6    28.838     27.156    1.062  0.28827    
carat:clarity^7   279.667     21.976   12.726  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 837 on 53905 degrees of freedom
Multiple R-squared:  0.956, Adjusted R-squared:  0.956 
F-statistic: 3.445e+04 on 34 and 53905 DF,  p-value: < 2.2e-16
#Linear
plot(mod5, which = 1)

#Inedpendence

# N 
plot(mod5, which = 2)

#E 
plot(mod5, which = 1)

In Class Exercise - Adventure Two - Will revisit lesson 26

library(tidyverse)
diamonds
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
  1. In addition to carat size, what other variables might be associated with the price of a diamond?

  2. Create a comparative box plot of price (response variable on the y-axis) by cut (explanatory variable on the x-axis).
    Which cut category tends to have higher prices? Is this what you would expect?

  3. Fit a simple linear model for price using carat as the explanatory variable and price as the response variable.

  1. Write out the regression equation with intercept, coefficients, and variable names.
  2. Interpret the coefficient of carat in the context of this model.
  3. What is the strength of evidence that carat is related to price?
  1. Create a scatterplot of price versus carat, colored by cut.
    Do higher-quality cuts tend to cluster at different price or carat ranges?
    How might this influence your interpretation of the relationship between carat and price?

  2. Fit a multiple regression model for price using both carat and cut as explanatory variables.

  1. Write out the regression equation with intercept, coefficients, and variable names.
  2. Interpret the coefficient for cut while controlling for carat.
  1. How much total variation in diamond price is explained by this model (i.e., (R^2))?

  2. The model in #5 assumes that the effect of carat on price is the same across all cuts.
    How can we check whether this assumption is valid?

  3. Fit a multiple regression model for price using both carat and cut, including an interaction between them.
    Write out the regression equation with intercept, coefficients, and variable names.

  4. Among the Ideal cut diamonds, how much does price increase for a one-unit increase in carat?

  5. Among the Fair cut diamonds, how much does price increase for a one-unit increase in carat?

  6. Is the interaction between carat and cut statistically significant?
    State your null and alternative hypotheses for the interaction term and report the test statistic and p-value.

  7. To what population are you willing to generalize your results?
    Can you draw a cause-and-effect conclusion about carat size and diamond price? Why or why not?

Partial Solution

Ask a Research Question

  1. In addition to carat size, what other variables might be associated with the price of a diamond?

Design a Study and Explore the Data

The diamonds dataset in R contains information about 53,940 diamonds, including their price (in U.S. dollars), carat, cut, color, clarity, and several physical measurements (x, y, z, depth, table).

We’ll examine how carat size and quality characteristics relate to diamond price.

  1. Use R to create a comparative box plot of price (Response Variable on the y-axis) by cut (Explanatory Variable on the x-axis).
    • Which cut category tends to have higher prices?
    • Is this what you would expect?
ggplot(diamonds, aes(x = cut, y = price)) +
  geom_boxplot() +
  labs(title = "Diamond Price by Cut Quality")

  1. Create a simple linear model for price using carat as the Explanatory Variable and price as the Response Variable.
mod1 <- lm(price ~ carat, data = diamonds)
summary(mod1)

Call:
lm(formula = price ~ carat, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-18585.3   -804.8    -18.9    537.4  12731.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2256.36      13.06  -172.8   <2e-16 ***
carat        7756.43      14.07   551.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared:  0.8493,    Adjusted R-squared:  0.8493 
F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
  1. Write out the simple linear regression equation with intercept, coefficients, and variable names.

\[ \widehat{price} = b_0 + b_1(\text{carat}) \]

  1. Interpret the coefficient of carat in the context of this model.
  2. Based on the p-value for the slope, what is the strength of evidence that carat is related to price?
  1. Generate a scatterplot of price (y-axis) versus carat (x-axis), colored by cut.
ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
  geom_point(alpha = 0.5) +
  labs(title = "Diamond Price vs. Carat, by Cut")

  • Do higher-quality cuts tend to cluster at different price or carat ranges?
  • How might this influence your interpretation of the relationship between carat and price?
  1. Fit a multiple regression model for price using both carat and cut as Explanatory Variables.
mod2 <- lm(price ~ carat + cut, data = diamonds)
summary(mod2)

Call:
lm(formula = price ~ carat + cut, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-17540.7   -791.6    -37.6    522.1  12721.4 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -2701.38      15.43 -175.061  < 2e-16 ***
carat        7871.08      13.98  563.040  < 2e-16 ***
cut.L        1239.80      26.10   47.502  < 2e-16 ***
cut.Q        -528.60      23.13  -22.851  < 2e-16 ***
cut.C         367.91      20.21   18.201  < 2e-16 ***
cut^4          74.59      16.24    4.593 4.37e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1511 on 53934 degrees of freedom
Multiple R-squared:  0.8565,    Adjusted R-squared:  0.8565 
F-statistic: 6.437e+04 on 5 and 53934 DF,  p-value: < 2.2e-16
  1. Write out the multiple regression equation with intercept, coefficients, and variable names.
  2. Interpret the coefficient for cut while controlling for carat.
  1. How much total variation in diamond price is explained by this model (i.e., (R^2))?

  2. This model assumes that the effect of carat on price is the same across all cuts.

    • How can we check whether this assumption is valid?
  3. Create a multiple regression model for price using both carat and cut, including an interaction between them.

mod3 <- lm(price ~ carat * cut, data = diamonds)
summary(mod3)

Call:
lm(formula = price ~ carat * cut, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-14878.3   -793.0    -23.0    546.3  12706.2 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -2271.95      20.94 -108.513  < 2e-16 ***
carat        7468.05      19.49  383.200  < 2e-16 ***
cut.L        -278.21      57.17   -4.866 1.14e-06 ***
cut.Q         363.22      50.51    7.191 6.50e-13 ***
cut.C        -172.96      42.81   -4.041 5.34e-05 ***
cut^4          67.55      33.40    2.022   0.0431 *  
carat:cut.L  1538.10      50.96   30.183  < 2e-16 ***
carat:cut.Q  -781.89      45.89  -17.037  < 2e-16 ***
carat:cut.C   509.65      41.36   12.321  < 2e-16 ***
carat:cut^4    69.70      34.38    2.027   0.0426 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1498 on 53930 degrees of freedom
Multiple R-squared:  0.8591,    Adjusted R-squared:  0.859 
F-statistic: 3.653e+04 on 9 and 53930 DF,  p-value: < 2.2e-16

Write out the regression equation with intercept, coefficients, and variable names.

\[ \widehat{price} = b_0 + b_1(\text{carat}) + b_2(\text{cut}) + b_3(\text{carat} \times \text{cut}) \]

  1. Among the Ideal cut diamonds, how much does price increase for a one-unit increase in carat?

  2. Among the Fair cut diamonds, how much does price increase for a one-unit increase in carat?

  3. Is the interaction between carat and cut statistically significant?

    • State your null and alternative hypotheses for the interaction term.
    • Report the test statistic and p-value.
  4. To what population are you willing to generalize your results?

    • Can you draw a cause-and-effect conclusion about carat size and diamond price?
    • Why or why not?
  5. Check each of the four Validity Conditions for the multiple regression you ran in #8.

    • Include all three validity plots for your regression model.
    • Justify each of the four conditions: Linearity, Independence, Normality, and Equal Variance.

Linearity

plot(mod3, which = 1)

Independence

Must be justified based on data collection (not tested statistically).

Normal Distribution

plot(mod3, which = 2)

Equal Variance

plot(mod3, which = 1)

Before you leave

Today:

  • Any questions for me?