Lesson 2: Sampling & Study Design

Cal

Reese

DMath Basketball!!

Math vs TBD

Previously 6-1

6-2

Admin

Don’t Forget

Mandatory and Important Briefings

A Note on AI

AI is welcome and encouraged on WebAssign. Screenshot the question… maybe.
But when WPR time comes - no AI. So use it as a tool for learning, not a crutch.
WebAssign is 15% of the course. You should get most/all those points.
Don’t get those points at the peril of your WPR grades - that’s 60% of the course.
WebAssign is forgiving: 5 tries per sub-question.
Don’t forget documentation. Just tell me what you did.

Graded Assignments

Assignment	Points
WebAssign Homework	150
WPR I	175
WPR II	175
Exploratory Data Analysis	25
Tech Report	125
Project Presentation	50
TEE	300
Total	1000

Request a Vantage Account

Action Required

Before we continue, everyone needs to request an account on Vantage - the Army’s data science platform where we’ll be using R this semester.

Go to https://vantage.army.mil/
Click Request Account
Fill out the required information
For Commander email, use: jonathan.l.day3.mil@army.mil
Wait for approval (may take 1-2 days)

We’ll be using Vantage for the project throughout the course.

Lesson 1 Review

The Big Picture

Why do we collect data? Because we want to learn about something bigger than what we can directly observe.

Population: The entire group we want to learn about
Sample: The subset we actually observe
Process: An ongoing mechanism that generates data over time

The whole course builds on this: we use samples to make claims about populations. That’s inference.

Parameters vs Statistics

	Population	Sample
What we have	Usually unknown	Observable data
Notation	Greek letters (\(\mu\), \(\sigma\), \(\pi\))	Latin letters (\(\bar{x}\), \(s\), \(\hat{p}\))

A statistic estimates a parameter. This is the foundation of Blocks 2 and 3.

Types of Variables

Categorical: Labels or categories (e.g., branch, major, yes/no)
Quantitative: Numbers with meaningful arithmetic (e.g., height, GPA, time)

Why does this matter? The type of variable determines:

How you summarize it (proportions vs means)
How you visualize it (bar charts vs histograms)
Which inference method you use (z-test for proportions vs t-test for means)

Lesson 2 Content

Objectives

Construct and interpret stem-and-leaf displays
Create and interpret dotplots
Build and analyze histograms and frequency distributions
Describe distributions in terms of shape, center, spread, and outliers

Required Reading

Devore, Section 1.2: Pictorial and Tabular Methods in Descriptive Statistics

Why Visualize Data?

Before calculating any numbers, we should look at the data. Visual displays help us:

Identify the shape of the distribution
Find a typical value (center)
See how much variability exists (spread)
Spot outliers and gaps
Check for symmetry or skewness

Four Types of Displays

Comparing the Four Display Types

Display	Description	Best Sample Size	Preserves Exact Values?
Stem-and-leaf	Split each number into stem (leading digits) and leaf (trailing digit)	Small (n < 50)	Yes
Dotplot	Place a dot for each observation on a number line; stack repeated values	Small (n < 30)	Yes
Histogram (Frequency)	Divide range into bins; bar height = count in each bin	Any size	No
Histogram (Relative Freq)	Divide range into bins; bar height = proportion in each bin	Any size	No

What Can You Learn From Each?

Display	Shape	Center	Spread	Outliers	Exact Values	Compare Groups
Stem-and-leaf	Yes	Yes	Yes	Yes	Yes	Limited
Dotplot	Yes	Yes	Yes	Yes	Yes	Good
Histogram (Frequency)	Yes	Approximate	Yes	Yes	No	Difficult
Histogram (Relative Freq)	Yes	Approximate	Yes	Yes	No	Best

When to Use Each Display

Display	Use When…	Example
Stem-and-leaf	Small dataset, want to preserve exact values	Quiz scores for your 18-person section
Dotplot	Small dataset with repeated values	Number of absences per cadet
Histogram (Frequency)	Larger dataset, want raw counts per bin	APFT scores for an entire battalion
Histogram (Relative Freq)	Comparing groups of different sizes	Run times: Company A (120 soldiers) vs Company B (85 soldiers)

Describing Distributions

The Four Key Features

When describing any distribution, always address:

Shape: Symmetric, skewed left, skewed right, unimodal, bimodal?
Center: Where is the “typical” value?
Spread: How much variability is there?
Outliers: Any unusual observations?

Memory Aid

S-C-S-O: Shape, Center, Spread, Outliers

Shape: Symmetry and Skewness

Shape	Description	Relationship
Symmetric	Left and right sides are mirror images	Mean ≈ Median
Skewed Right	Long tail extends to the right	Mean > Median
Skewed Left	Long tail extends to the left	Mean < Median

Examples:

Skewed right: Income, home prices
Skewed left: Age at retirement, exam scores with a ceiling

Shape: Modality

Unimodal: One peak (most common)
Bimodal: Two distinct peaks (may indicate two subgroups)
Multimodal: Multiple peaks
Uniform: Roughly flat, no clear peak

Identifying Outliers

Outlier

An observation that falls far from the rest of the data. Could indicate:

Data entry error
Measurement error
A genuinely unusual observation
A different population

Always investigate outliers - don’t automatically remove them!

Example: Cadet Heights

Let’s use simulated height data for 30 cadets to demonstrate all four display types.

Heights (inches): 65, 66, 67, 67, 68, 68, 69, 69, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 73, 73, 73, 73, 74, 74, 75

Stem-and-Leaf Display

Stem | Leaves

-----+--------

  6 | 5 6 7 7 8 8 9 9
  7 | 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 3 3 3 3 4 4 5

Reading it: The row “6 | 5 6 7 7 8 8 9 9” represents heights of 65, 66, 67, 67, 68, 68, 69, 69 inches.

Interpreting with S-C-S-O:

Shape: Roughly symmetric (similar number of leaves on each stem), unimodal
Center: The 7 stem has many more leaves, so center is around 70 inches
Spread: Values range from 65 to 75 inches (range = 10 inches)
Outliers: No values stand apart from the rest

Unique advantage: We can recover every exact data value! We know there are exactly two cadets who are 67 inches tall.

Dotplot

Interpreting with S-C-S-O:

Shape: Roughly symmetric with one clear peak (unimodal); dots cluster in the middle
Center: The tallest stack of dots is at 70 inches - this is our center
Spread: Dots extend from about 65 to 75 inches
Outliers: No dots are isolated far from the others

Unique advantage: Easy to see exact values AND repeated values (stacked dots). The height of 70 inches appears 9 times - we can count the dots!

Histogram (Frequency)

Interpreting with S-C-S-O:

Shape: Roughly symmetric (bars rise then fall), unimodal (one peak)
Center: Tallest bar is at 70-72 inches, so center is around 70-71
Spread: Data spans from about 64 to 76 inches
Outliers: No bars are isolated from the main distribution

Unique advantage: Shows counts - we can say “15 cadets are between 70-72 inches tall.” Good for understanding raw numbers.

Histogram (Relative Frequency)

Interpreting with S-C-S-O:

Shape: Same as frequency histogram - symmetric, unimodal
Center: Peak is at 70-72 inches
Spread: Same range, 64 to 76 inches
Outliers: None visible

Unique advantage: Shows proportions - we can say “about 50% of cadets are between 70-72 inches tall.” Essential for comparing groups of different sizes (e.g., comparing this section of 30 to another section of 18).

Side-by-Side Comparison

Describing the Cadet Heights Distribution

Using S-C-S-O:

Shape: Roughly symmetric, unimodal
Center: Around 70 inches
Spread: Ranges from about 64 to 76 inches
Outliers: None obvious

Practice Problem

The following data represents the number of hours cadets studied for a WPR:

3, 5, 4, 8, 6, 5, 12, 4, 5, 7, 6, 5, 4, 6, 5, 3, 5, 6, 4, 5

Questions

Create a stem-and-leaf display for this data
Describe the distribution (shape, center, spread, outliers)
What is the relative frequency of cadets who studied 5 or more hours?

Answers

Stem-and-leaf display:

Stem | Leaves
   0 | 3 3 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 8
   1 | 2

Shape: Roughly symmetric with a possible outlier; unimodal Center: Around 5 hours Spread: 3 to 12 hours (range of 9) Outliers: 12 hours appears to be an outlier
Count of values ≥ 5: 13 out of 20 Relative frequency = 13/20 = 0.65 or 65%

Before You Leave

Today

Stem-and-leaf displays: preserve data while showing shape
Dotplots: simple visual for small datasets
Histograms: frequency distributions for larger datasets
Describing distributions: Shape, Center, Spread, Outliers (S-C-S-O)

Any questions?

Next Lesson

Lesson 3: Measures of Location

Mean, median, and mode
Percentiles and quartiles
Comparing measures of center for different distributions

Upcoming Graded Events

WebAssign 1.3 - Due before Lesson 3
Exploratory Data Analysis - Due Lesson 9
WPR I - Lesson 16