Q&A 21 How do you summarize numerical and categorical variables?
21.1 Explanation
Before visualizing or modeling, it’s important to summarize the dataset:
- Numerical variables: Check distributions, ranges, and outliers
→ Common summaries: mean, median, min, max, standard deviation - Categorical variables: Understand group frequencies
→ Common summaries: counts, proportions, levels
These summaries help identify anomalies, guide data transformations, and inform appropriate plots.
21.2 Python Code
# ✅ Import libraries
import pandas as pd
# Load dataset
df = pd.read_csv("data/iris.csv")
# Ensure correct types
df["species"] = df["species"].astype("category")
# Summary of numerical variables
print(df.describe())
# Summary of categorical variable
print(df["species"].value_counts())
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
21.3 R Code
# ✅ Load modern tools
library(tidyverse)
# Load dataset
df <- read_csv("data/iris.csv", show_col_types = FALSE) %>%
mutate(species = as.factor(species))
# Summary of numerical variables
summary(select(df, -species))
sepal_length sepal_width petal_length petal_width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# A tibble: 3 × 2
species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50