How to Read Dataset from GitHub and Save it with Python Pandas

To display how easily and convenient to read a dataset from GitHub into Pandas DataFrame and save it in local computer

GitHub is a good source of data, and I usually store my projects and datasets in GitHub. In this article, I display how easily and convenient to read a dataset from GitHub into Pandas DataFrame and save it as a .CSV file in your computer. In this example, it uses the Jupyter note besides Pandas. You can use JupyterLab or any other Python IDE.

From this article, I will start to write a continuous series on data analysis using one real-world dataset, and this series includes at least the following parts:

1. Install Packages

If you use Anaconda, Pandas and Jupyter notebook/lab have been preinstalled. If you have not installed them, and just use your favorite command-line shell to install them as follows:

pip install notebook
pip install pandas

2. Import required package

Since we use Jupyter notebook, so start the Jupyter notebook and create a Jupyter notebook or open an existing one.

import pandas as pd

3. Read data

In this example, let’s read a real-world dataset, Chinese GDP directly from a GitHub repository of mine by using the Pandas pd.read_csv() function because this dataset is a CSV file.

There are a number of pandas commands to read and write other data formats, such as:

pd.read_excel('filename.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA'])
pd.read_stata('filename.dta')
pd.read_sas('filename.sas7bdat')
pd.read_hdf('filename.h5','df')

...

Note: The above commands have many optional arguments to fine-tune the data import process. More information can be referred to https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

(1) An error to read the dataset

For the datasets in GitHub repository, We cannot use the direct URL of the dataset, https://github.com/shoukewei/data/blob/main/data-pydm/gdp_china_clean.csv in this example, or it will cause an error.

url ='https://github.com/shoukewei/data/blob/main/data-pydm/gdp_china_clean.csv'
df = pd.read_csv(url)

The error message looks as:

---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_7344\973812801.py in <module>
1 url ='https://github.com/shoukewei/data/blob/main/data-pydm/gdp_china_clean.csv'
----> 2 df = pd.read_csv(url)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper

.
.
.
C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 28, saw 367

(2) Correct URL to read the dataset

We should use URL of the raw dateset rather than the direct dataset URL. Thus, click the raw menu on dataset page and go to the raw data page.

Then copy the URL of that page and use it in the code snippet below.

(2) read the dataset

url ='https://raw.githubusercontent.com/shoukewei/data/main/data-pydm/gdp_china_clean.csv'
df = pd.read_csv(url)

(3) display the first 5 rows

df.head()
png

The first five rows of the output table show that there is a table caption at the beginning. You can confirm this by go to the online dataset in my GitHub.

(4) display the last ten rows

df.tail(10)
png

The last five rows of the output table display that there is a source description of the dataset at the end. You can see this clearly by go to the online data table in the GitHub.

4. Skip some lines

However, we only need the dataset for analysis without the caption and source. Thus while reading the dataset, we can skip these text lines, where one caption line and one space line at the beginning and one space line and six lines for source description at the end.

(1) Skip rows from top

If you want to skip some rows from the top, you can use skiprows=Numbers of the rows, say skiprows=2 in our example, in which we also specify the engine as Python.

df = pd.read_csv(url,skiprows=2,engine='python')

(2) Skip rows from footer

Similarly, if we skip some rows from the bottom or footer, we can use skipfooter=Numbers of the rows, say skipfooter=7 in our example.

df = pd.read_csv(url,skipfooter =7,engine='python')

(3) Skip rows from both top and footer

In our case, we will skip both the first 2 rows and the last 7 rows, so we can do this using the following code.

df = pd.read_csv(url,skiprows=2,skipfooter =7,engine='python')

(4) Check the imported dataset

You can check the first and last few rows again to see if the text lines have been removed.

# read the first three rows
df.head(3)
png
# read the last three rows
df.tail(3)
png

5. Save the dataset

Lastly, let’s save the dataset in the data folder of working directory in this example. We use index=False to save the data without the index.

df.to_csv('./data/gdp_china_clean.csv', index=False)

6. Online course

If you are interested in learning Python data analysis in details, you are welcome to enroll one of my course:

Master Python Data Analysis and Modelling Essentials

Bookmark
ClosePlease login
0 - 0

Thank You For Your Vote!

Sorry You have Already Voted!

Please follow and like me:

Leave a Reply

Your email address will not be published. Required fields are marked *