Q&A 4 What are Common Sources of Datasets for Python and R?

4.1 Explanation

Before you can analyze data, you need to get it. Python and R both provide built-in datasets and offer access to many high-quality public data sources online. These datasets are used for practice, learning, benchmarking, and real-world analysis.

In this question, we’ll look at:

Built-in datasets available through standard libraries
Trusted online sources for downloading CSV or Excel files
How to access and load example datasets directly from Python or R

4.2 Built-in or Package-Based Datasets

These datasets are included in common libraries, so you can load them directly without needing to download files.

4.3 ✅ Python

Seaborn:

import seaborn as sns
df = sns.load_dataset("iris")

Scikit-learn:

from sklearn import datasets
iris = datasets.load_iris()
print(iris.data[:5])

Statsmodels:

import statsmodels.api as sm
df = sm.datasets.get_rdataset("Guerry", "HistData").data

4.4 R

datasets package:

df <- datasets::iris
head(df)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

ggplot2:

data("diamonds", package = "ggplot2")
head(diamonds)

  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

palmerpenguins (if installed):

library(palmerpenguins)
head(penguins)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

4.5 Online Public Data Sources

Source	Link
UCI Machine Learning Repo	https://archive.ics.uci.edu/ml/
Kaggle Datasets	https://www.kaggle.com/datasets
data.gov (US Government)	https://www.data.gov
Awesome Public Datasets	https://github.com/awesomedata/awesome-public-datasets
World Bank Open Data	https://data.worldbank.org/

💡 Tip: Always save downloaded datasets in your data/ folder and reference them using relative paths like data/filename.csv.

✅ Now that you know where to find data, let’s learn how to load and preview it in your Python or R environment.

General Data Science – Free Edition