Q&A 14 How do you detect and remove duplicate rows in Python and R?
14.1 Explanation
Duplicate rows can arise from data entry errors, merging datasets, or exporting data multiple times. Identifying and removing them is an important step in data cleaning to ensure that your analysis isn’t biased or inflated.
In both Python and R, we can:
- Detect duplicates
- Count them
- Drop them if needed
14.2 Python Code
import pandas as pd
# Load dataset
df = pd.read_csv("data/iris.csv")
# Check for duplicate rows
duplicates = df.duplicated()
print("Any duplicates?", duplicates.any())
# Count duplicate rows
print("Number of duplicates:", duplicates.sum())
# Remove duplicates
df_cleaned = df.drop_duplicates()
# Confirm removal
print("New shape:", df_cleaned.shape)
Any duplicates? True
Number of duplicates: 1
New shape: (149, 5)
14.3 R Code
library(readr)
library(dplyr)
# Load dataset
df <- read_csv("data/iris.csv")
# Check for duplicate rows
duplicates <- duplicated(df)
cat("Any duplicates?", any(duplicates), "\n")
Any duplicates? TRUE
Number of duplicates: 1
# Remove duplicates
df_cleaned <- df %>%
distinct()
# Confirm new size
cat("New number of rows:", nrow(df_cleaned), "\n")
New number of rows: 149
✅ Cleaning duplicates ensures your results reflect true observations and not duplicated data points.