Q&A 21 How do you summarize numerical and categorical variables?

21.1 Explanation

Before visualizing or modeling, it’s important to summarize the dataset:

  • Numerical variables: Check distributions, ranges, and outliers
    → Common summaries: mean, median, min, max, standard deviation
  • Categorical variables: Understand group frequencies
    → Common summaries: counts, proportions, levels

These summaries help identify anomalies, guide data transformations, and inform appropriate plots.

21.2 Python Code

# ✅ Import libraries
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Ensure correct types
df["species"] = df["species"].astype("category")

# Summary of numerical variables
print(df.describe())

# Summary of categorical variable
print(df["species"].value_counts())
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

21.3 R Code

# ✅ Load modern tools
library(tidyverse)

# Load dataset
df <- read_csv("data/iris.csv", show_col_types = FALSE) %>%
  mutate(species = as.factor(species))

# Summary of numerical variables
summary(select(df, -species))
  sepal_length    sepal_width     petal_length    petal_width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
# Summary of categorical variable
count(df, species)
# A tibble: 3 × 2
  species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50