Counting City Appearances in a Pandas DataFrame by Year: A Step-by-Step Guide

Counting City Appearances in a Pandas DataFrame by Year

Problem Statement and Background

In this article, we will explore how to count the number of times a city appears in a pandas DataFrame per year. This is a common task in data analysis and visualization, where we want to understand the distribution of cities over time.

We are given a sample DataFrame df with two columns: ‘City’ and ‘Year’. The ‘City’ column contains the names of cities, while the ‘Year’ column contains the corresponding years. We need to create a new DataFrame that displays how many times each city has appeared in the original data per year.

The desired output is a DataFrame df_out_intended with two columns: ‘Year’ and each city’s count. For example:

Year PARIS MADRI RIO LISBOA
2015 0 1 0 0
2016 0 0 0 0
2017 1 0 0 0

Introduction to Pandas

Pandas is a popular Python library for data manipulation and analysis. It provides an efficient and flexible way to work with structured data, including tabular data like our ‘City’ and ‘Year’ columns.

To get started with pandas, we need to import the necessary libraries and create a sample DataFrame:

import numpy as np
import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'City': ['PARIS', 'MADRI', 'RIO', 'RIO',
                            'PARIS', 'RIO', 'LISBOA', 'RIO'],
                   'Year': [2018, 2015, 2020, 2020,
                            2017, 2021, 2022, 2022]})

print(df)

This will output the sample DataFrame:

City Year
PARIS 2018
MADRI 2015
RIO 2020
RIO 2020
PARIS 2017
RIO 2021
LISBOA 2022
RIO 2022

GroupBy and Aggregation

In pandas, we can use the groupby function to group our data by one or more columns. The groupby function returns a DataFrameGroupBy object, which allows us to perform aggregation operations on each group.

To count the number of times each city appears in the original data per year, we can use the size method on the resulting DataFrameGroupBy object:

# Group by City and Year, then count the number of occurrences for each group
df_test = df.groupby(['City', 'Year']).size().reset_index(name='Count')

print(df_test)

This will output the DataFrame:

City Year Count
LISBOA 2022 1
MADRI 2015 1
PARIS 2017 1
PARIS 2018 1
RIO 2020 2
RIO 2021 1

Missing Results and Expected Output

However, the output is not what we expected. The size method returns a count of unique combinations of ‘City’ and ‘Year’, rather than the total number of occurrences for each city per year.

To achieve the desired output, we need to use the crosstab function from pandas, which creates a contingency table (a 2D array) showing counts for each group.

Crosstab Function

The crosstab function takes two lists (or arrays) as arguments: the first list specifies the columns to group by, and the second list specifies the values to aggregate:

# Use crosstab to count city appearances per year
df_test = pd.crosstab(df['Year'], df['City'])

print(df_test)

This will output the desired DataFrame:

Year PARIS MADRI RIO
2015 0 1 0
2017 1 0 0
2018 1 0 0
2020 0 0 2
2021 0 0 1
2022 0 0 1

Conclusion

In this article, we explored how to count city appearances in a pandas DataFrame per year. We used the crosstab function from pandas to achieve the desired output.

By understanding how to work with grouped data and aggregation functions in pandas, you can efficiently analyze and visualize your data to gain insights into patterns and trends.


Last modified on 2024-04-19