Q&A 4 What are Common Sources of Datasets for Python and R?

4.1 Explanation

Before you can analyze data, you need to get it. Python and R both provide built-in datasets and offer access to many high-quality public data sources online. These datasets are used for practice, learning, benchmarking, and real-world analysis.

In this question, we’ll look at:

  • Built-in datasets available through standard libraries
  • Trusted online sources for downloading CSV or Excel files
  • How to access and load example datasets directly from Python or R

4.2 Built-in or Package-Based Datasets

These datasets are included in common libraries, so you can load them directly without needing to download files.

4.3 ✅ Python

  • Seaborn:

    import seaborn as sns
    df = sns.load_dataset("iris")
  • Scikit-learn:

    from sklearn import datasets
    iris = datasets.load_iris()
    print(iris.data[:5])
  • Statsmodels:

    import statsmodels.api as sm
    df = sm.datasets.get_rdataset("Guerry", "HistData").data

4.4 R

  • datasets package:

    df <- datasets::iris
    head(df)
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1          5.1         3.5          1.4         0.2  setosa
    2          4.9         3.0          1.4         0.2  setosa
    3          4.7         3.2          1.3         0.2  setosa
    4          4.6         3.1          1.5         0.2  setosa
    5          5.0         3.6          1.4         0.2  setosa
    6          5.4         3.9          1.7         0.4  setosa
  • ggplot2:

    data("diamonds", package = "ggplot2")
    head(diamonds)
      carat       cut color clarity depth table price    x    y    z
    1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
    2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
    3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
    4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
    5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
    6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
  • palmerpenguins (if installed):

    library(palmerpenguins)
    head(penguins)
    # A tibble: 6 × 8
      species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
    1 Adelie  Torgersen           39.1          18.7               181        3750
    2 Adelie  Torgersen           39.5          17.4               186        3800
    3 Adelie  Torgersen           40.3          18                 195        3250
    4 Adelie  Torgersen           NA            NA                  NA          NA
    5 Adelie  Torgersen           36.7          19.3               193        3450
    6 Adelie  Torgersen           39.3          20.6               190        3650
    # ℹ 2 more variables: sex <fct>, year <int>

4.5 Online Public Data Sources

Source Link
UCI Machine Learning Repo https://archive.ics.uci.edu/ml/
Kaggle Datasets https://www.kaggle.com/datasets
data.gov (US Government) https://www.data.gov
Awesome Public Datasets https://github.com/awesomedata/awesome-public-datasets
World Bank Open Data https://data.worldbank.org/

💡 Tip: Always save downloaded datasets in your data/ folder and reference them using relative paths like data/filename.csv.


✅ Now that you know where to find data, let’s learn how to load and preview it in your Python or R environment.

4.6 Python Code