Q&A 8 How do you examine the structure and types of variables in Python and R?

8.1 Explanation

Understanding the structure of your dataset — including data types — is a key step in exploratory data analysis. It helps you:

Know what transformations are needed
Identify categorical vs. numerical variables
Prepare your data for modeling or visualization

Each column in your dataset has a specific data type. These types influence how operations behave, how memory is allocated, and how functions treat your data.

8.1.1 ✅ Common Data Types in Python and R

Concept	Python (`pandas`)	R (`base`)	Notes
Integer	`int`	`integer`	Use `astype(int)` or `as.integer()`
Decimal Number	`float`	`numeric`, `double`	`numeric` in R defaults to `double`
Text / String	`str`, `object`	`character`	Use `astype(str)` or `as.character()`
Logical / Boolean	`bool`	`logical`	`True`/`False` in Python, `TRUE`/`FALSE` in R
Date / Time	`datetime64[ns]`	`Date`, `POSIXct`	Use `pd.to_datetime()` or `as.Date()`
Category	`category`	`factor`	Useful for grouping and modeling
Missing Values	`NaN` (`numpy`)	`NA`	Use `pd.isna()` or `is.na()`
Complex Numbers	`complex`	`complex`	Rare in typical EDA workflows
List	`list`	`list`	R lists allow mixed data types
Dictionary	`dict`	`named list`	R lists with names can mimic Python dictionaries
Tuple	`tuple`	`c()`, `list()`	No direct equivalent; use vectors or lists in R

8.2 Python Code

import pandas as pd

# Load the standardized dataset
df = pd.read_csv("data/iris.csv")

# View column names
print("Column names:", df.columns.tolist())

# Check data types
print("\nData types:")
print(df.dtypes)

# Optional: Use .info() for a more detailed summary
print("\nStructure info:")
df.info()

Column names: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

Structure info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

8.3 R Code

library(readr)

# Load the standardized dataset
df <- read_csv("data/iris.csv")

# View column names
names(df)

[1] "sepal_length" "sepal_width"  "petal_length" "petal_width"  "species"

# Check data types (structure)
str(df)

spc_tbl_ [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ sepal_length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ sepal_width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ petal_length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ petal_width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ species     : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
 - attr(*, "spec")=
  .. cols(
  ..   sepal_length = col_double(),
  ..   sepal_width = col_double(),
  ..   petal_length = col_double(),
  ..   petal_width = col_double(),
  ..   species = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

# Optionally print class of each variable
sapply(df, class)

sepal_length  sepal_width petal_length  petal_width      species 
   "numeric"    "numeric"    "numeric"    "numeric"  "character"

✅ Once you’re familiar with variable types, you can decide how to clean, filter, or transform your data — and which variables are ready for plotting or modeling.