Counting City Appearances in a Pandas DataFrame by Year: A Step-by-Step Guide

Counting City Appearances in a Pandas DataFrame by Year

Problem Statement and Background

In this article, we will explore how to count the number of times a city appears in a pandas DataFrame per year. This is a common task in data analysis and visualization, where we want to understand the distribution of cities over time.

We are given a sample DataFrame df with two columns: ‘City’ and ‘Year’. The ‘City’ column contains the names of cities, while the ‘Year’ column contains the corresponding years. We need to create a new DataFrame that displays how many times each city has appeared in the original data per year.

The desired output is a DataFrame df_out_intended with two columns: ‘Year’ and each city’s count. For example:

Year	PARIS	MADRI	RIO	LISBOA
2015	0	1	0	0
2016	0	0	0	0
2017	1	0	0	0
…	…	…	…	…

Introduction to Pandas

Pandas is a popular Python library for data manipulation and analysis. It provides an efficient and flexible way to work with structured data, including tabular data like our ‘City’ and ‘Year’ columns.

To get started with pandas, we need to import the necessary libraries and create a sample DataFrame:

import numpy as np
import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'City': ['PARIS', 'MADRI', 'RIO', 'RIO',
                            'PARIS', 'RIO', 'LISBOA', 'RIO'],
                   'Year': [2018, 2015, 2020, 2020,
                            2017, 2021, 2022, 2022]})

print(df)

This will output the sample DataFrame:

City	Year
PARIS	2018
MADRI	2015
RIO	2020
RIO	2020
PARIS	2017
RIO	2021
LISBOA	2022
RIO	2022

GroupBy and Aggregation

In pandas, we can use the groupby function to group our data by one or more columns. The groupby function returns a DataFrameGroupBy object, which allows us to perform aggregation operations on each group.

To count the number of times each city appears in the original data per year, we can use the size method on the resulting DataFrameGroupBy object:

# Group by City and Year, then count the number of occurrences for each group
df_test = df.groupby(['City', 'Year']).size().reset_index(name='Count')

print(df_test)

This will output the DataFrame:

City	Year	Count
LISBOA	2022	1
MADRI	2015	1
PARIS	2017	1
PARIS	2018	1
RIO	2020	2
RIO	2021	1

Missing Results and Expected Output

However, the output is not what we expected. The size method returns a count of unique combinations of ‘City’ and ‘Year’, rather than the total number of occurrences for each city per year.

To achieve the desired output, we need to use the crosstab function from pandas, which creates a contingency table (a 2D array) showing counts for each group.

Crosstab Function

The crosstab function takes two lists (or arrays) as arguments: the first list specifies the columns to group by, and the second list specifies the values to aggregate:

# Use crosstab to count city appearances per year
df_test = pd.crosstab(df['Year'], df['City'])

print(df_test)

This will output the desired DataFrame:

Year	PARIS	MADRI	RIO
2015	0	1	0
2017	1	0	0
2018	1	0	0
2020	0	0	2
2021	0	0	1
2022	0	0	1

Conclusion

In this article, we explored how to count city appearances in a pandas DataFrame per year. We used the crosstab function from pandas to achieve the desired output.

By understanding how to work with grouped data and aggregation functions in pandas, you can efficiently analyze and visualize your data to gain insights into patterns and trends.

Last modified on 2024-04-19