4 concrete examples are used to show how to use a Python package to download open datasets from GitHub, Google Drive, Kaggle and other online sources
In previous several posts, I displayed how to download dataset from different sources, such as GitHub repository, Google Drive, and kaggle. I suggest you reading these posts to see the codes if you are interested in programming. In this post, I introduce you to a Python package, which can help you write less code and conveniently download dataset from different online sources, GitHub, Google Drive, Kaggle and other websites.
1. Install the Package
The package is called opendatasets
, and you can find it on PyPI and GitHub. First, open your favorite command line Shell, Windows prompt, PowerShell, or Terminal, and then install it using pip
as follows:
pip install opendatasets --upgrade
The installation process looks as follows:
From the last line of the above screenshot, we can see the package version 0.1.22 has successfully installed.
2. Applications of the Package
Now, let’s see how to use it with 4 concrete examples.
(1) Download datasets from GitHub
Just go to GitHub and search the datasets you need. Here, I use a dataset in my GitHub repository, for instance, and this dataset is the historical USD to CNY exchange daily dataset during September 24, 2012 to September 24, 2023. The URL of the dataset is:
https://github.com/shoukewei/data/blob/main/data-wpt/USD_CNY%20Historical%20Data.csv
Pay attention here, we have to use the link of the raw dataset, which has been introduced in the post. It had better read that post to understand how to find the URL of the raw dataset if you are not familiar with it.
https://raw.githubusercontent.com/shoukewei/data/main/data-wpt/USD_CNY%20Historical%20Data.csv
Then, let’s see how to download this dataset using the package.
First, import the package:
import opendatasets as od
Then, download the data using the following code.
dataset_url = 'https://raw.githubusercontent.com/shoukewei/data/main/data-wpt/USD_CNY%20Historical%20Data.csv'
od.download(dataset_url)
Then you will see the following message, which tells you the data of USD_CNY%20Historical%20Data.csv
was downloaded into your current working directory and the download speed.
(2) Download datasets from Google Drive
Google Drive is not a dataset source, but it is very convenient for us to store and share dataset. We can store datasets in it, and download it or share it to others when we need. To download datasets, you have to create a share link, you get a shared link from others to download their datasets. If you are not familiar with how to create a share link, please read this post first.
For example, I have global temperature data in my Google Drive with the following shared link:
https://drive.google.com/file/d/17LjfBwzFGcv7IHQ0KCLYQavLsR6aEert/view?usp=share_link
Let’s use the package to download it into your current working directory.
# import the package
import opendatasets as od
# shared link of a dataset
dataset_url = 'https://drive.google.com/file/d/17LjfBwzFGcv7IHQ0KCLYQavLsR6aEert/view?usp=share_link'
# download the dataset
od.download(dataset_url)
You will see some message like follows, and the dataset named global_tempeture.csv
was saved into your current working directory.
(3) Download datasets from Kaggle
To download datasets from Kaggle, we have to get the kaggle API token, search the dataset you need on Kaggle, and then download it with this package using the dataset link.
(i) Get the Kaggle API Token
This package uses the Kaggle’s public API to download the Kaggle dataset, so you have to register a Kaggle account and get the API token first if you want to download datasets from Kaggle. The method to get the Kaggle API token is introduced in this post. The main difference between using this package and the method introduced in that post is that you need to put the downloaded kaggle.json
in your currently working directory rather than the folder it requires.
(ii) Search the dataset
For example, we want an economic dataset for modelling or just testing a method. We search it on Kaggle and the search results look as follows:
For example, let’s download the Worldbank Economics/Demographics Data (2020). Just click the dataset to get the download link.
(iii) Download the dataset
Now, let’s download the dataset using the package.
import opendatasets as od
dataset_url = 'https://www.kaggle.com/datasets/sagarnildass/worldbank-economicsdemographics-data'
od.download(dataset_url)
After running the code, you will see the following result, which shows that the zipped dataset was downloaded first and then unzip into the current working directory.
(4) Download datasets from an online link
For example, we want to download the dataset of New_York_City_Population_By_Community_Districts.csv
, and its URL is:
https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv?accessType=DOWNLOAD
import opendatasets as od
dataset_url = 'https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv?accessType=DOWNLOAD'
od.download(dataset_url)
Similarly, you will the running result as following, and the dataset is saved in your working directory.
So you can see that this package is very easy to use to download datasets from different online sources.