Q&A 27 How do you use boxplots to compare groups across a categorical variable?

27.1 Explanation

Boxplots are ideal for visualizing the distribution of a numerical variable across groups. Each box shows the median, interquartile range (IQR), and potential outliers.

They help answer questions like: - Are group medians different? - Is one group more variable than others? - Are there any outliers?

Boxplots are most effective when comparing a few groups and when you’re interested in summary statistics.

27.2 Python Code

# ✅ Load libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv("data/iris.csv")

# Boxplot: Sepal length by species
plt.figure(figsize=(6, 5))
sns.boxplot(data=df, x="species", y="sepal_length", palette="viridis", hue="species", legend=False)
plt.title("Sepal Length by Species")
plt.xlabel("Species")
plt.ylabel("Sepal Length")
plt.legend([], [], frameon=False)  # Suppress duplicate legend
plt.tight_layout()
plt.show()

27.3 R Code

# ✅ Load libraries
library(tidyverse)

# Load dataset
df <- read_csv("data/iris.csv", show_col_types = FALSE)

# Boxplot: Sepal length by species
ggplot(df, aes(x = species, y = sepal_length, fill = species)) +
  geom_boxplot() +
  labs(title = "Sepal Length by Species", x = "Species", y = "Sepal Length") +
  theme_minimal()