A Convenient Python Package to Download Open Datasets

4 concrete examples are used to show how to use a Python package to download open datasets from GitHub, Google Drive, Kaggle and other online sources

In previous several posts, I displayed how to download dataset from different sources, such as GitHub repository, Google Drive, and kaggle. I suggest you reading these posts to see the codes if you are interested in programming. In this post, I introduce you to a Python package, which can help you write less code and conveniently download dataset from different online sources, GitHub, Google Drive, Kaggle and other websites.

1. Install the Package

The package is called opendatasets, and you can find it on PyPI and GitHub. First, open your favorite command line Shell, Windows prompt, PowerShell, or Terminal, and then install it using pip as follows:

pip install opendatasets --upgrade

The installation process looks as follows:

Installation process of Python opendatasets package
Fig. 1, Installation process of Python opendatasets package

From the last line of the above screenshot, we can see the package version 0.1.22 has successfully installed.

2. Applications of the Package

Now, let’s see how to use it with 4 concrete examples.

(1) Download datasets from GitHub

Just go to GitHub and search the datasets you need. Here, I use a dataset in my GitHub repository, for instance, and this dataset is the historical USD to CNY exchange daily dataset during September 24, 2012 to September 24, 2023. The URL of the dataset is:

https://github.com/shoukewei/data/blob/main/data-wpt/USD_CNY%20Historical%20Data.csv

Pay attention here, we have to use the link of the raw dataset, which has been introduced in the post. It had better read that post to understand how to find the URL of the raw dataset if you are not familiar with it.

https://raw.githubusercontent.com/shoukewei/data/main/data-wpt/USD_CNY%20Historical%20Data.csv

Then, let’s see how to download this dataset using the package.

First, import the package:

import opendatasets as od

Then, download the data using the following code.

dataset_url = 'https://raw.githubusercontent.com/shoukewei/data/main/data-wpt/USD_CNY%20Historical%20Data.csv'
od.download(dataset_url)

Then you will see the following message, which tells you the data of USD_CNY%20Historical%20Data.csv was downloaded into your current working directory and the download speed.

running results

(2) Download datasets from Google Drive

Google Drive is not a dataset source, but it is very convenient for us to store and share dataset. We can store datasets in it, and download it or share it to others when we need. To download datasets, you have to create a share link, you get a shared link from others to download their datasets. If you are not familiar with how to create a share link, please read this post first.

For example, I have global temperature data in my Google Drive with the following shared link:

https://drive.google.com/file/d/17LjfBwzFGcv7IHQ0KCLYQavLsR6aEert/view?usp=share_link

Let’s use the package to download it into your current working directory.

# import the package
import opendatasets as od
# shared link of a dataset
dataset_url = 'https://drive.google.com/file/d/17LjfBwzFGcv7IHQ0KCLYQavLsR6aEert/view?usp=share_link'
# download the dataset
od.download(dataset_url)

You will see some message like follows, and the dataset named global_tempeture.csv was saved into your current working directory.

running results

(3) Download datasets from Kaggle

To download datasets from Kaggle, we have to get the kaggle API token, search the dataset you need on Kaggle, and then download it with this package using the dataset link.

(i) Get the Kaggle API Token

This package uses the Kaggle’s public API to download the Kaggle dataset, so you have to register a Kaggle account and get the API token first if you want to download datasets from Kaggle. The method to get the Kaggle API token is introduced in this post. The main difference between using this package and the method introduced in that post is that you need to put the downloaded kaggle.json in your currently working directory rather than the folder it requires.

(ii) Search the dataset

For example, we want an economic dataset for modelling or just testing a method. We search it on Kaggle and the search results look as follows:

Fig. 2. Example of searching results of economic dataset
Fig. 2. Example of searching results of economic datasets

For example, let’s download the Worldbank Economics/Demographics Data (2020). Just click the dataset to get the download link.

Fig.3.  An example of the linkage of Kaggle dataset
Fig.3. An example of the linkage of Kaggle dataset

(iii) Download the dataset

Now, let’s download the dataset using the package.

import opendatasets as od

dataset_url = 'https://www.kaggle.com/datasets/sagarnildass/worldbank-economicsdemographics-data'
od.download(dataset_url)

After running the code, you will see the following result, which shows that the zipped dataset was downloaded first and then unzip into the current working directory.

running results

(4) Download datasets from an online link

For example, we want to download the dataset of New_York_City_Population_By_Community_Districts.csv, and its URL is:

https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv?accessType=DOWNLOAD

import opendatasets as od

dataset_url = 'https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv?accessType=DOWNLOAD'
od.download(dataset_url)

Similarly, you will the running result as following, and the dataset is saved in your working directory.

running results

So you can see that this package is very easy to use to download datasets from different online sources.


Bookmark
Please login to bookmarkClose
0 - 0

Thank You For Your Vote!

Sorry You have Already Voted!

Please follow and like me:

Leave a Reply

Your email address will not be published. Required fields are marked *