Q&A 14 How do you detect and remove duplicate rows in Python and R?

14.1 Explanation

Duplicate rows can arise from data entry errors, merging datasets, or exporting data multiple times. Identifying and removing them is an important step in data cleaning to ensure that your analysis isn’t biased or inflated.

In both Python and R, we can:

Detect duplicates
Count them
Drop them if needed

14.2 Python Code

import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Check for duplicate rows
duplicates = df.duplicated()
print("Any duplicates?", duplicates.any())

# Count duplicate rows
print("Number of duplicates:", duplicates.sum())

# Remove duplicates
df_cleaned = df.drop_duplicates()

# Confirm removal
print("New shape:", df_cleaned.shape)

Any duplicates? True
Number of duplicates: 1
New shape: (149, 5)

14.3 R Code

library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Check for duplicate rows
duplicates <- duplicated(df)
cat("Any duplicates?", any(duplicates), "\n")

Any duplicates? TRUE

cat("Number of duplicates:", sum(duplicates), "\n")

Number of duplicates: 1

# Remove duplicates
df_cleaned <- df %>%
  distinct()

# Confirm new size
cat("New number of rows:", nrow(df_cleaned), "\n")

New number of rows: 149

✅ Cleaning duplicates ensures your results reflect true observations and not duplicated data points.