Descriptive Statistics: Summarizing and Visualizing Data
Introduction to Descriptive Statistics
Descriptive statistics provides the tools and techniques for summarizing, organizing, and presenting data in meaningful ways. Before we can make inferences or draw conclusions about populations, we must first understand what our data tells us through careful description and visualization.
Descriptive statistics transforms raw data into comprehensible information, revealing patterns, trends, and characteristics that might otherwise remain hidden in long lists of numbers.
Purpose of Descriptive Statistics
═══════════════════════════════
Primary Goals:
• Summarize large datasets with key measures
• Identify patterns and trends in data
• Detect outliers and unusual observations
• Communicate findings clearly and effectively
• Prepare data for further statistical analysis
Key Questions Answered:
• What is the typical or central value?
• How spread out or variable are the data?
• What is the shape of the data distribution?
• Are there any unusual or extreme values?
• How do different groups or variables compare?
Tools and Techniques:
• Measures of central tendency (mean, median, mode)
• Measures of variability (range, variance, standard deviation)
• Measures of position (percentiles, quartiles, z-scores)
• Data visualization (histograms, box plots, scatter plots)
• Summary tables and frequency distributions
Applications:
• Business reporting and dashboards
• Scientific data analysis and presentation
• Quality control and process monitoring
• Market research and consumer analysis
• Educational assessment and evaluation
Organizing Data
Frequency Distributions
Frequency distributions organize data by showing how often each value or range of values occurs.
Types of Frequency Distributions
══════════════════════════════
For Qualitative Data:
Simple frequency table showing categories and their counts
Example: Student Majors
Major | Frequency | Relative Frequency | Percentage
---------------|-----------|-------------------|------------
Computer Sci | 45 | 0.30 | 30%
Mathematics | 30 | 0.20 | 20%
Engineering | 38 | 0.25 | 25%
Business | 22 | 0.15 | 15%
Other | 15 | 0.10 | 10%
Total | 150 | 1.00 | 100%
For Quantitative Data:
Grouped frequency distribution using class intervals
Example: Test Scores (0-100)
Class Interval | Frequency | Relative Frequency | Cumulative Frequency
---------------|-----------|-------------------|--------------------
60-69 | 5 | 0.10 | 5
70-79 | 12 | 0.24 | 17
80-89 | 18 | 0.36 | 35
90-99 | 15 | 0.30 | 50
Total | 50 | 1.00 | 50
Key Concepts:
• Frequency: Number of observations in each category/class
• Relative frequency: Proportion of total (frequency ÷ total)
• Cumulative frequency: Running total of frequencies
• Class width: Size of each interval (should be equal)
• Class midpoint: Middle value of each interval
Guidelines for Creating Classes:
• Use 5-20 classes (typically 5-15)
• Make class widths equal when possible
• Avoid overlapping classes
• Include all data points
• Use convenient class boundaries
Stem-and-Leaf Plots
Stem-and-Leaf Displays
═════════════════════
Purpose:
• Show distribution shape while preserving actual data values
• Quick way to organize and visualize small to moderate datasets
• Useful for identifying outliers and gaps
Construction:
1. Separate each number into stem (leading digits) and leaf (trailing digit)
2. List stems vertically in order
3. List leaves horizontally for each stem in order
4. Include key explaining the format
Example: Test Scores
Data: 67, 72, 73, 75, 78, 81, 83, 85, 87, 89, 91, 93, 95, 97
Stem | Leaf
-----|----------
6 | 7
7 | 2 3 5 8
8 | 1 3 5 7 9
9 | 1 3 5 7
Key: 6|7 = 67
Advantages:
• Preserves original data values
• Shows distribution shape
• Easy to construct by hand
• Identifies outliers clearly
Disadvantages:
• Limited to relatively small datasets
• Not suitable for very large or very small numbers
• Less flexible than histograms
Variations:
• Split stems: Use 6* for 60-64, 6• for 65-69
• Back-to-back: Compare two distributions
• Truncated: Drop decimal places for simplicity
Measures of Central Tendency
The Mean
The arithmetic mean is the most commonly used measure of central tendency.
Arithmetic Mean
══════════════
Population Mean: μ = (Σx)/N = (x₁ + x₂ + ... + xₙ)/N
Sample Mean: x̄ = (Σx)/n = (x₁ + x₂ + ... + xₙ)/n
Where:
• Σx = sum of all values
• N = population size
• n = sample size
Example: Test Scores
Data: 85, 92, 78, 88, 95, 82, 90
x̄ = (85 + 92 + 78 + 88 + 95 + 82 + 90)/7 = 610/7 = 87.14
Properties of the Mean:
• Uses all data values in calculation
• Unique value for any dataset
• Affected by extreme values (outliers)
• Sum of deviations from mean equals zero: Σ(x - x̄) = 0
• Minimizes sum of squared deviations: Σ(x - x̄)²
When to Use:
✓ Data is roughly symmetric
✓ No extreme outliers present
✓ Interval or ratio level data
✓ Further statistical analysis planned
When Not to Use:
✗ Highly skewed distributions
✗ Presence of extreme outliers
✗ Ordinal data
✗ Open-ended classes in grouped data
Weighted Mean:
When observations have different importance or frequency
x̄w = (Σwᵢxᵢ)/(Σwᵢ)
Example: Course Grade Calculation
Component | Score | Weight | Weighted Score
-------------|-------|--------|---------------
Homework | 85 | 0.20 | 17.0
Midterm | 78 | 0.30 | 23.4
Final Exam | 92 | 0.50 | 46.0
Total | | 1.00 | 86.4
Weighted mean = 86.4
The Median
The median is the middle value when data is arranged in order.
Median Calculation
═════════════════
For Odd Number of Values:
Median = middle value when arranged in order
Example: 12, 15, 18, 22, 25, 28, 30
Median = 22 (4th value out of 7)
For Even Number of Values:
Median = average of two middle values
Example: 12, 15, 18, 22, 25, 28
Median = (18 + 22)/2 = 20
Position Formula:
Position of median = (n + 1)/2
For n = 7: Position = (7 + 1)/2 = 4th value
For n = 8: Position = (8 + 1)/2 = 4.5 (average of 4th and 5th values)
Properties of the Median:
• Not affected by extreme values (robust)
• Divides data into two equal halves
• Unique value for any dataset
• Can be used with ordinal data
• May not use all data values
When to Use:
✓ Skewed distributions
✓ Presence of outliers
✓ Ordinal level data
✓ Open-ended distributions
✓ Income and housing price data
Example: Income Data
$25,000, $28,000, $32,000, $35,000, $38,000, $42,000, $250,000
Mean = $64,286 (pulled up by high income)
Median = $35,000 (better represents typical income)
Quartiles and Percentiles:
• Q₁ (25th percentile): 25% of data below this value
• Q₂ (50th percentile): Median
• Q₃ (75th percentile): 75% of data below this value
Finding Quartiles:
1. Find median (Q₂)
2. Q₁ = median of lower half
3. Q₃ = median of upper half
The Mode
The mode is the most frequently occurring value in a dataset.
Mode Identification
══════════════════
Definition:
Value that appears most frequently in the dataset
Examples:
Unimodal (One Mode):
Data: 2, 3, 4, 4, 4, 5, 6, 7
Mode = 4 (appears 3 times)
Bimodal (Two Modes):
Data: 1, 2, 2, 3, 4, 5, 5, 6
Modes = 2 and 5 (each appears twice)
Multimodal (Multiple Modes):
Data: 1, 1, 2, 2, 3, 3, 4
Modes = 1, 2, and 3 (each appears twice)
No Mode:
Data: 1, 2, 3, 4, 5, 6, 7
No mode (all values appear once)
For Grouped Data:
Modal class = class interval with highest frequency
Example: Test Scores
Class Interval | Frequency
---------------|----------
60-69 | 3
70-79 | 8
80-89 | 12 ← Modal class
90-99 | 7
Properties of the Mode:
• Can be used with any level of data
• Not affected by extreme values
• May not exist or may not be unique
• Doesn't use all data values
• Easy to identify in frequency distributions
When to Use:
✓ Nominal (categorical) data
✓ Finding most popular item
✓ Highly skewed distributions
✓ Quick rough estimate needed
Applications:
• Most popular product size
• Most common defect type
• Peak hours for service
• Most frequent customer complaint
Relationship Between Mean, Median, and Mode:
Symmetric Distribution:
Mean = Median = Mode
Right-Skewed (Positively Skewed):
Mode < Median < Mean
Left-Skewed (Negatively Skewed):
Mean < Median < Mode
This relationship helps identify distribution shape.
Measures of Variability
Range and Interquartile Range
Measures of variability describe how spread out the data values are.
Range Measures
═════════════
Range:
Difference between largest and smallest values
Range = Maximum - Minimum
Example: Test Scores
Data: 65, 72, 78, 85, 88, 92, 95
Range = 95 - 65 = 30
Properties:
• Simple to calculate and understand
• Uses only two values (extremes)
• Heavily influenced by outliers
• Doesn't describe internal variability
Interquartile Range (IQR):
Difference between third and first quartiles
IQR = Q₃ - Q₁
Advantages:
• Not affected by outliers
• Describes spread of middle 50% of data
• Useful for identifying outliers
Outlier Detection Rule:
• Lower outlier: Below Q₁ - 1.5(IQR)
• Upper outlier: Above Q₃ + 1.5(IQR)
Example: Calculating IQR
Data: 12, 15, 18, 22, 25, 28, 30, 35, 40, 45, 50
Step 1: Find quartiles
Q₁ = 18 (25th percentile)
Q₂ = 28 (median)
Q₃ = 40 (75th percentile)
Step 2: Calculate IQR
IQR = Q₃ - Q₁ = 40 - 18 = 22
Step 3: Check for outliers
Lower fence: 18 - 1.5(22) = 18 - 33 = -15
Upper fence: 40 + 1.5(22) = 40 + 33 = 73
No outliers in this dataset
Semi-Interquartile Range:
Also called quartile deviation
Semi-IQR = (Q₃ - Q₁)/2 = IQR/2
Used when median is reported as central tendency measure
Variance and Standard Deviation
Variance and Standard Deviation
═════════════════════════════
Population Variance:
σ² = Σ(x - μ)²/N
Population Standard Deviation:
σ = √[Σ(x - μ)²/N]
Sample Variance:
s² = Σ(x - x̄)²/(n - 1)
Sample Standard Deviation:
s = √[Σ(x - x̄)²/(n - 1)]
Note: Sample formulas use (n-1) for unbiased estimation
Calculation Example:
Data: 2, 4, 6, 8, 10
Step 1: Calculate mean
x̄ = (2 + 4 + 6 + 8 + 10)/5 = 30/5 = 6
Step 2: Calculate deviations and squared deviations
x | (x - x̄) | (x - x̄)²
---|---------|----------
2 | -4 | 16
4 | -2 | 4
6 | 0 | 0
8 | 2 | 4
10 | 4 | 16
| 0 | 40
Step 3: Calculate variance
s² = 40/(5-1) = 40/4 = 10
Step 4: Calculate standard deviation
s = √10 = 3.16
Properties:
• Uses all data values
• Measures average distance from mean
• Same units as original data (for standard deviation)
• Larger values indicate more variability
• Always non-negative
Computational Formula (easier for hand calculation):
s² = [Σx² - (Σx)²/n]/(n - 1)
Using previous example:
Σx = 30, Σx² = 220, n = 5
s² = [220 - (30)²/5]/(5-1) = [220 - 180]/4 = 40/4 = 10
Interpretation:
• About 68% of data within 1 standard deviation of mean
• About 95% of data within 2 standard deviations of mean
• About 99.7% of data within 3 standard deviations of mean
(For approximately normal distributions)
When to Use:
✓ Interval or ratio level data
✓ Roughly symmetric distributions
✓ Further statistical analysis planned
✓ Comparing variability between groups
Coefficient of Variation
Coefficient of Variation
══════════════════════
Definition:
Relative measure of variability expressed as percentage
CV = (s/x̄) × 100%
Purpose:
• Compare variability between datasets with different units
• Compare variability between datasets with different means
• Determine relative consistency
Example 1: Comparing Test Scores
Class A: x̄ = 85, s = 10
Class B: x̄ = 75, s = 8
CV_A = (10/85) × 100% = 11.8%
CV_B = (8/75) × 100% = 10.7%
Class B has less relative variability despite similar absolute variability.
Example 2: Comparing Different Measurements
Height: x̄ = 68 inches, s = 3 inches
Weight: x̄ = 150 pounds, s = 20 pounds
CV_height = (3/68) × 100% = 4.4%
CV_weight = (20/150) × 100% = 13.3%
Weight shows more relative variability than height.
Interpretation Guidelines:
• CV < 15%: Low variability
• 15% ≤ CV < 35%: Moderate variability
• CV ≥ 35%: High variability
When to Use:
✓ Comparing datasets with different units
✓ Comparing datasets with very different means
✓ Quality control applications
✓ Financial risk assessment
Limitations:
• Not meaningful when mean is close to zero
• Can be misleading with negative values
• Assumes ratio-level data
Measures of Position
Percentiles and Quartiles
Percentiles
══════════
Definition:
The kth percentile is the value below which k% of the data falls
Common Percentiles:
• 25th percentile (Q₁): First quartile
• 50th percentile (Q₂): Median
• 75th percentile (Q₃): Third quartile
• 90th percentile: Commonly used in testing
• 95th percentile: Often used as cutoff points
Calculating Percentiles:
1. Arrange data in ascending order
2. Find position: L = (k/100) × n
3. If L is whole number, percentile = average of values at positions L and L+1
4. If L is not whole number, round up to next integer position
Example: Finding 30th Percentile
Data: 12, 15, 18, 22, 25, 28, 30, 35, 40 (n = 9)
L = (30/100) × 9 = 2.7
Round up to position 3
30th percentile = 18
Five-Number Summary:
1. Minimum value
2. First quartile (Q₁)
3. Median (Q₂)
4. Third quartile (Q₃)
5. Maximum value
Example:
Data: 5, 8, 12, 15, 18, 22, 25, 28, 30, 35, 40
Five-number summary:
Min = 5
Q₁ = 12
Q₂ = 22
Q₃ = 30
Max = 40
Applications:
• Standardized test scores (SAT, GRE)
• Growth charts for children
• Income distribution analysis
• Performance benchmarking
• Quality control limits
Z-Scores (Standard Scores)
Z-Scores
═══════
Definition:
Number of standard deviations a value is from the mean
z = (x - μ)/σ (population)
z = (x - x̄)/s (sample)
Interpretation:
• z = 0: Value equals the mean
• z > 0: Value is above the mean
• z < 0: Value is below the mean
• |z| = 1: Value is 1 standard deviation from mean
Example: Test Scores
Class average: x̄ = 75, standard deviation: s = 10
Student score: x = 85
z = (85 - 75)/10 = 1.0
The student scored 1 standard deviation above the mean.
Properties of Z-Scores:
• Mean of z-scores = 0
• Standard deviation of z-scores = 1
• Shape of distribution unchanged
• Unitless (standardized)
Uses:
• Compare scores from different distributions
• Identify outliers (|z| > 2 or 3)
• Calculate probabilities with normal distribution
• Standardize data for analysis
Example: Comparing Performance
Math test: x = 85, x̄ = 75, s = 10
z_math = (85 - 75)/10 = 1.0
English test: x = 92, x̄ = 88, s = 6
z_english = (92 - 88)/6 = 0.67
Better relative performance in math despite lower absolute score.
Outlier Detection:
• Mild outliers: 2 < |z| < 3
• Extreme outliers: |z| > 3
Modified Z-Score (using median):
More robust for skewed data
Modified z = 0.6745(x - median)/MAD
where MAD = median absolute deviation
Data Visualization
Histograms
Histogram Construction
════════════════════
Purpose:
• Show distribution shape
• Identify patterns and outliers
• Compare distributions
• Estimate probabilities
Construction Steps:
1. Determine number of classes (5-20, typically)
2. Calculate class width = (max - min)/number of classes
3. Create class boundaries
4. Count frequencies for each class
5. Draw bars with heights equal to frequencies
Example: Test Scores
Data: 65, 68, 72, 75, 78, 80, 82, 85, 88, 90, 92, 95
Classes:
60-69: 2 students
70-79: 3 students
80-89: 4 students
90-99: 3 students
Histogram Features:
• Bars touch each other (continuous data)
• Height represents frequency
• Area represents relative frequency
• No gaps between bars (unless no data)
Distribution Shapes:
• Normal (bell-shaped): Symmetric, single peak
• Right-skewed: Tail extends to right
• Left-skewed: Tail extends to left
• Uniform: All bars approximately same height
• Bimodal: Two distinct peaks
Guidelines:
• Use equal class widths when possible
• Avoid too few or too many classes
• Label axes clearly
• Include title and sample size
• Consider relative frequency for comparisons
Variations:
• Relative frequency histogram (proportions)
• Density histogram (area = 1)
• Cumulative frequency histogram
• Back-to-back histogram (comparing groups)
Box Plots
Box Plot Construction
═══════════════════
Components:
• Box: Extends from Q₁ to Q₃ (contains middle 50%)
• Median line: Line inside box at Q₂
• Whiskers: Lines extending to furthest non-outlier points
• Outliers: Points beyond whiskers (circles or asterisks)
Construction Steps:
1. Calculate five-number summary
2. Identify outliers using IQR rule
3. Draw box from Q₁ to Q₃
4. Draw median line at Q₂
5. Extend whiskers to furthest non-outlier points
6. Plot outliers as individual points
Example:
Data: 12, 15, 18, 22, 25, 28, 30, 35, 40, 45, 60
Five-number summary:
Min = 12, Q₁ = 18, Q₂ = 28, Q₃ = 40, Max = 60
IQR = 40 - 18 = 22
Lower fence: 18 - 1.5(22) = -15
Upper fence: 40 + 1.5(22) = 73
No outliers, so whiskers extend to 12 and 60.
Advantages:
• Shows distribution shape
• Identifies outliers clearly
• Compact display
• Good for comparing groups
• Shows median and quartiles
Disadvantages:
• Less detail than histogram
• Doesn't show sample size
• May hide multimodal distributions
• Requires understanding of quartiles
Variations:
• Notched box plot: Shows confidence interval for median
• Variable width: Box width proportional to sample size
• Violin plot: Combines box plot with density curve
Interpretation:
• Symmetric: Median near center of box, equal whiskers
• Right-skewed: Median toward left side, longer right whisker
• Left-skewed: Median toward right side, longer left whisker
Scatter Plots
Scatter Plot Analysis
═══════════════════
Purpose:
• Show relationship between two quantitative variables
• Identify correlation patterns
• Detect outliers and unusual points
• Assess linearity of relationships
Construction:
• x-axis: Independent (explanatory) variable
• y-axis: Dependent (response) variable
• Each point represents one observation
• Plot all (x, y) pairs
Relationship Patterns:
Positive Linear:
• Points form upward-sloping pattern
• As x increases, y tends to increase
• Example: Height vs. weight
Negative Linear:
• Points form downward-sloping pattern
• As x increases, y tends to decrease
• Example: Price vs. demand
No Relationship:
• Points scattered randomly
• No clear pattern
• Example: Shoe size vs. GPA
Nonlinear:
• Points form curved pattern
• May be exponential, quadratic, etc.
• Example: Age vs. reaction time
Strength Assessment:
• Strong: Points close to pattern line
• Moderate: Points somewhat scattered around pattern
• Weak: Points widely scattered, pattern unclear
Outliers:
• Points that don't fit the general pattern
• May indicate data errors or special cases
• Can strongly influence correlation
Example Analysis:
Study time (hours) vs. Test score
Observations:
• Positive relationship: More study time → higher scores
• Moderately strong: Points fairly close to line
• One outlier: Student with high study time but low score
• Generally linear relationship
Enhancements:
• Add trend line to show relationship
• Use different colors/symbols for groups
• Add marginal histograms
• Include correlation coefficient
• Size points by third variable (bubble plot)
Summary Statistics and Reports
Creating Effective Summaries
Statistical Summary Reports
═════════════════════════
Essential Components:
• Sample size (n)
• Measures of central tendency
• Measures of variability
• Distribution shape indicators
• Outlier identification
• Confidence intervals (when appropriate)
Standard Summary Format:
Variable: Test Scores
n = 50
Mean = 82.4
Median = 84.0
Mode = 85
Standard Deviation = 8.2
Range = 35 (65 to 100)
IQR = 12 (Q₁ = 78, Q₃ = 90)
Outliers: 2 (scores of 65 and 67)
Choosing Appropriate Statistics:
For Symmetric Distributions:
• Central tendency: Mean
• Variability: Standard deviation
• Position: Z-scores
For Skewed Distributions:
• Central tendency: Median
• Variability: IQR
• Position: Percentiles
For Categorical Data:
• Central tendency: Mode
• Variability: Not applicable
• Position: Frequencies and percentages
Comparative Summaries:
When comparing groups, include:
• Side-by-side statistics
• Relative measures (CV, percentages)
• Visual comparisons (box plots)
• Effect size measures
Example: Comparing Two Classes
Class A Class B
n 25 30
Mean 85.2 78.6
Median 86.0 80.0
Std Dev 6.8 9.2
Range 28 35
IQR 10 14
Interpretation: Class A performed better on average with less variability.
Report Writing Guidelines:
• Start with context and data source
• Present statistics in logical order
• Use appropriate precision (2-3 significant digits)
• Include units of measurement
• Highlight key findings
• Discuss limitations and assumptions
• Use tables and graphs effectively
Summary and Key Concepts
Descriptive statistics provides the foundation for understanding and communicating what data reveals, serving as the essential first step in any statistical analysis.
Chapter Summary
══════════════
Essential Skills Mastered:
✓ Organizing data with frequency distributions and displays
✓ Calculating measures of central tendency (mean, median, mode)
✓ Computing measures of variability (range, variance, standard deviation)
✓ Finding measures of position (percentiles, quartiles, z-scores)
✓ Creating and interpreting data visualizations
✓ Writing effective statistical summaries
Key Concepts:
• Central tendency describes typical values
• Variability measures spread of data
• Position measures locate individual values
• Distribution shape affects choice of statistics
• Outliers can significantly impact results
• Visualization reveals patterns not apparent in numbers
Fundamental Measures:
• Mean: Arithmetic average, sensitive to outliers
• Median: Middle value, robust to outliers
• Mode: Most frequent value, useful for categories
• Standard deviation: Average distance from mean
• IQR: Spread of middle 50% of data
• Percentiles: Position relative to other values
Problem-Solving Framework:
• Examine data type and distribution shape
• Choose appropriate measures of center and spread
• Identify and investigate outliers
• Create meaningful visualizations
• Summarize findings clearly and accurately
Visualization Tools:
• Histograms: Show distribution shape and frequency
• Box plots: Display five-number summary and outliers
• Scatter plots: Reveal relationships between variables
• Stem-and-leaf: Preserve actual data values
• Frequency tables: Organize categorical data
Next Steps:
Descriptive statistics skills prepare you for:
- Probability theory and distributions
- Inferential statistics and hypothesis testing
- Regression analysis and correlation
- Quality control and process improvement
- Data analysis in research and business
Descriptive statistics represents the essential foundation of statistical analysis, providing the tools to transform raw data into meaningful information. The techniques covered in this chapter—from basic measures of center and spread to sophisticated visualization methods—enable you to explore data systematically, identify important patterns, and communicate findings effectively.
Understanding descriptive statistics is crucial not only for further statistical study but also for making informed decisions in any field that involves data. Whether you’re analyzing business performance, evaluating research results, or simply trying to make sense of information in daily life, these descriptive tools provide the framework for clear, accurate, and insightful data analysis.
The skills developed in this chapter form the building blocks for more advanced statistical methods. As you progress to inferential statistics, you’ll use these descriptive techniques to understand sample data before making generalizations about populations. The visualization and summary skills you’ve learned will remain essential throughout your statistical journey, helping you communicate results and validate assumptions in more complex analyses.