Pandas

Post image

Pandas is one of the most widely used Python libraries for data analysis and manipulation. Built on top of NumPy, it provides fast and flexible data structures — primarily DataFrame and Series — that make working with structured data intuitive and efficient. This post serves as a quick reference for common operations.

Loading Data

Pandas supports reading data from a variety of file formats. The most common are CSV and Excel:

import pandas as pd

# Load from CSV
df = pd.read_csv('pokemon_data.csv')

# Load from Excel
excel_data = pd.read_excel('pokemon_data.xlsx')

# Load from JSON
json_data = pd.read_json('pokemon_data.json')

# Preview the data
print(df)

You can also preview just the first or last few rows, which is useful for quickly inspecting the structure:

df.head(30)   # First 30 rows
df.tail(30)   # Last 30 rows

To get a quick overview of the DataFrame’s structure, including column types and non-null counts:

df.info()
df.shape      # Returns (rows, columns)
df.dtypes     # Data type of each column

Handling Large Datasets

For large files that don’t fit in memory, you can read the data in chunks and process each one individually:

for chunk in pd.read_csv('pokemon_data.csv', chunksize=5):
	# Process each chunk
	pass

Reading Data

Once your data is loaded into a DataFrame, you’ll need ways to access specific portions of it. Pandas offers two main indexing mechanisms: iloc for integer-based positional access (like array indices), and loc for label-based access using the actual index values and column names. Understanding the difference between these two is essential for avoiding off-by-one errors and unexpected results.

# List all column names
df.columns

# Select specific columns by name
df[['Name', 'Type 1']]

# Select specific rows by positional index
df.iloc[2:30]

# Access a single value by row and column position
df.iloc[4, 2]

# Select rows by label-based index
df.loc[0:5, 'Name']

Iterating Over Rows

Sometimes you need to process rows one at a time. The iterrows() method returns each row as an index-value pair, making it easy to loop through a DataFrame. However, it’s important to know that iterrows() is slow for large datasets because it converts each row into a Series object. Whenever possible, prefer vectorized operations or apply(), which run significantly faster by leveraging NumPy’s optimized C-based internals under the hood.

for idx, row in df.iterrows():
	print(idx, row['Name'])

Better alternatives for performance-sensitive code:

# Using apply for row-wise operations
df['name_length'] = df['Name'].apply(len)

# Vectorized operations are the fastest approach
df['Total'] = df['HP'] + df['Attack'] + df['Defense']

Querying

Before diving into filtering and transformation, it’s often useful to get a high-level understanding of your data. Pandas provides several methods for quick exploratory analysis. describe() gives you summary statistics (mean, std, min, max, quartiles) for all numeric columns. unique() and value_counts() help you understand the distribution of categorical data.

# Get summary statistics for all numeric columns
df.describe()

# Get unique values in a column
df['Type 1'].unique()

# Count unique values
df['Type 1'].nunique()

# Value counts — shows how many times each value appears
df['Type 1'].value_counts()

Filtering

Filtering is one of the most frequently used operations in data analysis. It lets you select a subset of rows based on conditions. Pandas uses boolean indexing: you write a condition that produces a True/False mask, and only the rows where the condition is True are returned. You can combine multiple conditions using & (AND), | (OR), and ~ (NOT). Note that each condition must be wrapped in parentheses when combining them.

# Filter rows by column value
df.loc[df['Type 1'] == 'Fire']

# Exclude rows matching a pattern (NOT)
df.loc[~df['Name'].str.contains('Mega')]

# Case-insensitive regex matching
df.loc[df['Type 1'].str.contains('fire|grass', flags=re.I, regex=True)]

# Combine conditions (AND)
new_df = df.loc[(df['Type 1'] == 'Fire') & (df['Type 2'] != 'Fire')]
new_df = new_df.reset_index()

# Combine conditions (OR)
df.loc[(df['Type 1'] == 'Fire') | (df['Type 1'] == 'Water')]

# Filter using isin — cleaner than chaining multiple OR conditions
df.loc[df['Type 1'].isin(['Fire', 'Water', 'Grass'])]

Sorting

Sorting allows you to reorder rows based on one or more columns. By default, sort_values() sorts in ascending order. You can pass a list of columns and a corresponding list of ascending/descending flags to sort by multiple criteria — for example, sorting by name alphabetically and then by HP in descending order within each name group.

# Sort by a single column
df.sort_values('Name')

# Sort by multiple columns with different directions
df.sort_values(['Name', 'HP'], ascending=[True, False])

# Sort by the DataFrame's index
df.sort_index()

Aggregate Statistics

Grouping and aggregating is how you summarize data across categories. The groupby() method splits the DataFrame into groups based on one or more columns, then applies an aggregate function (like mean, sum, or count) to each group. This is conceptually similar to SQL’s GROUP BY clause. You can apply multiple aggregations at once using agg(), which accepts a dictionary mapping column names to the functions you want to compute.

# Calculate the mean of all numeric columns, grouped by type
df.groupby(['Type 1']).mean()

# Apply multiple aggregations to different columns
df.groupby(['Type 1']).agg({
    'HP': ['mean', 'max', 'min'],
    'Attack': 'mean'
})

# Count entries per group
df.groupby(['Type 1']).count()

Modifying Columns

DataFrames are mutable, so you can add, remove, rename, and retype columns as needed. Adding a new column is as simple as assigning a value to a new column name. When computing a column from multiple other columns, you can either list them explicitly or use iloc with sum(axis=1) to sum across a range of columns. Dropping columns removes them from the DataFrame, and rename() lets you change column names without modifying the underlying data.

# Add a new column by combining existing ones
df['Total'] = df['HP'] + df['Defense'] + df['Attack']
df['Total'] = df.iloc[:, 4:10].sum(axis=1)

# Delete a column
df = df.drop(columns=['Total'])

# Rename columns
df = df.rename(columns={'Type 1': 'Primary_Type', 'Type 2': 'Secondary_Type'})

# Change a column's data type
df['HP'] = df['HP'].astype(float)

Modifying Data

Beyond structural changes to columns, you’ll often need to clean and transform the data itself. This includes updating values based on conditions, replacing specific values, handling missing data, and removing duplicates. Missing values (represented as NaN in Pandas) are a common issue in real-world datasets. fillna() lets you replace them with a default value, while dropna() removes any row that contains at least one missing value.

# Update values conditionally
df.loc[df['Type 1'] == 'Fire', 'Type 1'] = 'Flamer'

# Replace specific values across a column
df['Type 1'] = df['Type 1'].replace('Fire', 'Flamer')

# Fill missing values with a default
df['Type 2'] = df['Type 2'].fillna('None')

# Drop rows that contain any missing values
df = df.dropna()

# Remove duplicate rows
df = df.drop_duplicates()

Merging and Joining

When your data is spread across multiple DataFrames, you’ll need to combine them. Pandas provides merge() for SQL-style joins based on shared columns, and concat() for stacking DataFrames together. The how parameter in merge() controls the join type: inner keeps only matching rows, outer keeps all rows from both sides, and left/right keep all rows from one side while matching from the other.

# Merge two DataFrames on a common column
merged = pd.merge(df1, df2, on='Name', how='inner')  # inner, outer, left, right

# Concatenate DataFrames vertically (stack rows)
combined = pd.concat([df1, df2], ignore_index=True)

# Concatenate horizontally (add columns side by side)
combined = pd.concat([df1, df2], axis=1)

Pivot Tables

Pivot tables let you reshape your data by turning unique values from one column into new columns, while aggregating the values from another column. This is useful for creating summary views — for example, seeing the average HP for every combination of primary and secondary type. If you’re familiar with Excel pivot tables, the concept is the same.

# Create a pivot table showing average HP by type combination
pivot = df.pivot_table(
    values='HP',
    index='Type 1',
    columns='Type 2',
    aggfunc='mean'
)

Saving Data

Once you’ve finished processing your data, you can export it back to a file. Pandas supports writing to all the same formats it can read. Passing index=False prevents Pandas from writing the DataFrame’s index as an extra column in the output file, which is usually what you want.

# Save to CSV (without the index column)
df.to_csv('output.csv', index=False)

# Save to Excel
df.to_excel('output.xlsx', index=False)

# Save to JSON
df.to_json('output.json')

Resources

You May Also Like