Exploration Exercise 3.2

Overview

This exercise uses data from the 2024 General Social Survey (GSS), in which 2,015 adult Americans were surveyed and 31.1% reported living in a different state from where they were born.
We treat this as a categorical variable (“different state” vs. “same state”) and use z-tests and normal approximations to the binomial.


Question 1 – Identify Population and Sample

Q1. Identify the population and sample in this survey.

Population: All adult Americans aged 18 and older.
Sample: The 2,015 adults surveyed in the 2024 GSS.


Question 2 – Representativeness

Q2. Is it reasonable to believe that the sample of 2,015 adult Americans is representative of the population of all adult Americans? Justify your answer in terms of how the data were collected.

It is reasonable to believe the sample is representative if the GSS used random selection.
The GSS employs probability-based sampling across U.S. households, which supports generalization to the broader adult U.S. population.


Question 3 – Statistic or Parameter

Q3. Is the value 31.1% a statistic or a parameter? Which symbol is typically used to represent this quantity?

  • 31.1% (0.311) is a sample statistic, not a parameter.
  • The symbol for a sample proportion is \(\hat{p}\)

Question 4 – Population Parameter

Q4. Identify (in words) the population parameter that the General Social Survey is attempting to estimate.

Population parameter is \(\pi\), the true proportion of all adult Americans who currently live in a different state from where they were born.


Question 5 – Interpreting the Estimate

Q5. Is it reasonable to conclude that exactly 31.1% of all adult Americans currently live in a different state from where they were born? Explain why or why not.

It is not reasonable to conclude that exactly 31.1% of all adults live in another state.
Sampling variation means that if another random sample of 2,015 were drawn, \(\hat{p}\) would likely differ.
Instead, we estimate a plausible range for \(\pi\) using confidence intervals.


Question 6 – Hypothesis Test for \(\pi = 0.362\)

Q6. Although we expect \(\pi\) to be close to 0.311, suppose we test
\(H_0: \pi = 0.362\) vs. \(H_A: \pi \neq 0.362\).
Use R to calculate the standardized test statistic and the two-sided p-value. Based on your result, would you reject or fail to reject the null hypothesis at \(\alpha = 0.05\)?

# Given values
n <- 2015
phat <- 0.311
null <- 0.362

SD_pi <- sqrt(null * (1 - null) / n)
z_stat <- (phat - null) / SD_pi
p_value <- 2 * (1 - pnorm(abs(z_stat)))

SD_pi
[1] 0.010706
z_stat
[1] -4.763685
p_value
[1] 1.900887e-06

Interpretation: The z-statistic measures how many standard deviations \(\hat{p}\) is from the null value.
If the p-value < 0.05, reject \(H_0\).
Based on the small p-value, we reject \(H_0\).


Question 7 – Interpret SD under Null

Q7. Interpret the standard deviation under the null hypothesis that you found in Question #6. Explain, in context, what this value tells you.

\(SD_{\pi}\) or SD_pi is the expected variability in \(\hat{p}\) if the population proportion were truly 0.362.
It tells us how much \(\hat{p}\) would vary across repeated samples of size 2,015 if \(H_0\) were true.


Question 8 – Test for \(\pi = 0.50\)

Q8. Now consider \(\pi = 0.50\). Is this a plausible value for the population proportion \(\pi\)?
Test \(H_0: \pi = 0.50\) vs. \(H_A: \pi \neq 0.50\).
Report your test statistic, p-value, and conclusion given \(\alpha = 0.05\).

null2 <- 0.50
SD_pi2 <- sqrt(null2 * (1 - null2) / n)
z_stat2 <- (phat - null2) / SD_pi2
p_value2 <- 2 * (1 - pnorm(abs(z_stat2)))

SD_pi2
[1] 0.01113865
z_stat2
[1] -16.96795
p_value2
[1] 0

Conclusion: The p-value is very small, so we reject \(H_0\).
A true proportion of 0.50 does not seem plausible.
Note: the \(p\)-value isn’t literally zero, it’s just extremely small.


Question 9 – Calculate Standard Error

Q9. Calculate the standard error (SE\(_{\hat{p}}\)) for this study.
How does it compare to the SD\(_\pi\) values from Questions #7 and #8? Explain why they differ.

SE_phat <- sqrt(phat * (1 - phat) / n)
SE_phat
[1] 0.01031222

SE_phat differs from SD_pi because it substitutes \(\hat{p}\) for \(\pi\).
It is slightly smaller because \(\hat{p}=0.311\) yields less variability than 0.50 or 0.362.


Question 10 – 95% Confidence Interval

Q10. Calculate and interpret a 95% confidence interval for \(\pi\).
Explain what the interval means in the context of this study.

confidence <- 0.95
alpha<- 1- confidence
z_crit_95 <- qnorm(p=1-alpha/2)
ci_lower_95 <- phat - z_crit_95 * SE_phat
ci_upper_95 <- phat + z_crit_95 * SE_phat
c(ci_lower_95, ci_upper_95)
[1] 0.2907884 0.3312116

Interpretation: We are 95% confident that the true proportion lies between these bounds.
About 95% of such intervals from repeated samples would contain the true \(\pi\).


Question 11 – 99% Confidence Interval

Q11. Calculate a 99% confidence interval for \(\pi\).
Compare it to your 95% interval. How do the midpoint and margin of error change?

confidence<-0.99
alpha<- 1 - confidence
z_crit_99 <- qnorm(1-alpha/2)
ci_lower_99 <- phat - z_crit_99 * SE_phat
ci_upper_99 <- phat + z_crit_99 * SE_phat
c(ci_lower_99, ci_upper_99)
[1] 0.2844375 0.3375625

Comparison: The midpoint (0.311) stays the same, but the interval widens.
A higher confidence level increases the margin of error.


Question 12 – Effect of Smaller Sample Size (n = 215)

Q12. Suppose that the GSS had only taken a sample size of \(n = 215\).
How would this change your confidence interval?

n_small <- 215
SE_small <- sqrt(phat * (1 - phat) / n_small)
z_star <- 1.96
lower_small <- phat - z_star * SE_small
upper_small <- phat + z_star * SE_small
c(lower_small, upper_small)
[1] 0.2491234 0.3728766

Interpretation: Smaller \(n\) increases variability, leading to a wider confidence interval and less precise estimation.


Summary of Key Concepts

Concept Formula When Used
Standard deviation under null \(\sqrt{\pi(1-\pi)/n}\) Hypothesis tests (\(H_0\) assumed true)
Standard error (estimated) \(\sqrt{\hat{p}(1-\hat{p})/n}\) Confidence intervals (sample-based)
z-statistic \((\hat{p}-\pi_0)/SD_\pi\) Tests population proportion hypotheses
Confidence interval \(\hat{p} \pm z^* SE_{\hat{p}}\) Estimates plausible range for \(\pi\) from data