Q&A 20 How do you convert variable types in a dataset?

20.1 Explanation

In earlier steps, we created a small test dataset to explore variable types. Now, let’s switch to a more realistic dataset: data/iris.csv, which was loaded and inspected in the Exploratory Data Analysis (EDA) section.

This dataset contains numerical features (sepal/petal dimensions) and a categorical target (species). We’ll demonstrate how to:

  • Convert the species column to a categorical or factor type
  • Confirm numeric columns are correctly typed
  • Prepare variables for modeling and visualization

20.2 Python Code

# ✅ Import libraries
import pandas as pd

# Load the dataset
df = pd.read_csv("data/iris.csv")

# Convert species to a categorical variable
df["species"] = df["species"].astype("category")

# Confirm types
print("\nVariable/Feauture type\n",df.dtypes)
print("\nSpecies type\n",df["species"].cat.categories)
Variable/Feauture type
 sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

Species type
 Index(['setosa', 'versicolor', 'virginica'], dtype='object')

20.3 R Code

# ✅ Load modern tools
library(tidyverse)

# Load iris dataset
df <- read_csv("data/iris.csv", show_col_types = FALSE)

# Convert species to factor
df <- df %>%
  mutate(species = as.factor(species))

# Inspect types
glimpse(df)
Rows: 150
Columns: 5
$ sepal_length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ sepal_width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ petal_length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ petal_width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
levels(df$species)
[1] "setosa"     "versicolor" "virginica"