Q&A 10 How do you get summary statistics for numeric variables in Python and R?
10.1 Explanation
Summary statistics provide a quick overview of your numeric data. They help you understand:
- Central tendency (mean, median)
- Spread (min, max, standard deviation, quartiles)
- Distribution shape and potential outliers
Both Python and R offer built-in functions to calculate summary statistics for each column in a dataset. These are essential when assessing data quality and preparing for visualization or modeling.
10.2 Python Code
import pandas as pd
# Load the dataset
df = pd.read_csv("data/iris.csv")
# Get summary statistics for all numeric columns
summary = df.describe()
print(summary)
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
💡
df.describe()
returns count, mean, std, min, 25%, 50% (median), 75%, and max for each numeric column.
10.3 R Code
library(readr)
# Load the dataset
df <- read_csv("data/iris.csv")
# Get summary statistics
summary(df)
sepal_length sepal_width petal_length petal_width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
species
Length:150
Class :character
Mode :character
💡
summary()
in R returns min, 1st quartile, median, mean, 3rd quartile, and max.
✅ These summaries give you a solid first look at the data distribution and can guide further steps like filtering, normalization, or visualization.