Real-world data and concrete examples to display the methods to calculate aggregate statistics
This article will display how to easily calculate aggregate statistics of DataFrame with a real-world dataset using Python Pandas.
First, I would like to explain briefly, “What is summary statistics?” and “What is aggregation statistics?”
1. Summary statistics vs. Aggregation statistics
- Summary statistics: provide a quick and simple description of the data, including measures of data location, or central tendency of the data, such as mean, median, mode, minimum value, maximum value, range, standard deviation, etc.
- Data aggregation: the process where raw data is sliced, grouped, ordered into subsets and expressed in a summary form in order to further analyze.
- Aggregation statistics: the process of compute summary statistics of the aggregated data.
Since a DateFrame usually contains more than one column, and we usually slice, order, or group the dataset into subsets for statistical analysis, the calculations of summary statistics actually refer to aggregation statistics.
2. Built-in Pandas aggregations
The built-in aggregations in Python Pandas are summarized in the following table.
3. read the data
We will continue using the same dataset in the last few articles, please read how to download the dataset in one last article.
# Load the required packages
import pandas as pd
# Read the data
df = pd.read_csv('./data/gdp_china_renamed.csv')
# diplay the first 5 rows
2. Aggregating statistics
(1) One column
For example, let’s calculate the mean of
(2) Multiple columns
pop columns to take their median for instance.
3. Describe() method
(1) Whole DataFrame
We have already discussed this method in the previous article, Different Methods to Access General Information of Dataset with Python Pandas.
(2) Multiple columns
Actually, we can also use it for one or more columns, for example, for
pop two columns.
4. agg() method
agg() function provide specific combinations of aggregating statistics for given columns.
agg is an alias for
aggregate, and they have no difference, but
agg is simple.
(1) Aggregate statistic functions over all the rows
For example, calculate max and min of all rows as follows/
(2) Aggregate over the columns.
When it is applied for columns, it needs to drop the string column(s), or it causes a future warning.
Thus, we have to remove the string columns first.
df_new = df.drop(['prov','gdpr'],axis=1)
(3) Different aggregations per column
We can also calculate different aggregations for the selected columns.
(4) Different aggregations over the columns with renamed results
Let’s see how to calculate different aggregations over the columns and also change the names of results.
df.agg(x=('gdp', max), y=('pop', 'min'), z=('trade', 'mean'))
5. Aggregating statistics grouped by category
(1) Average GDP for each of the 5 provinces
(2) Average of each column for each of the 5 provinces
6. Count total numbers of column items by category
7. Online Course
If you are interested in learning Python data analysis in details, you are welcome to enroll one of my course: