Python for Fraud Analysts

Pandas is the Python library most fraud analysts use to load, clean, and analyze tabular data that is too large or too messy for Excel and too exploratory for a single SQL query. It works with millions of rows in memory, transforms inconsistent formats, groups and aggregates flexibly, and plots results inline. This article walks through the transition from SQL to pandas: the syntax map, the common investigation workflows, and when to reach for each tool.

The Pivot Table That Broke Excel

Kwame had been a fraud analyst for three years, and SQL was his native language. He could write five-table joins in his sleep. He could spot a suspicious velocity pattern in a query result set before his coffee got cold.

Then his manager dropped a CSV file on his desk. Fourteen million rows. Transaction data from a partner bank. "We need to find coordinated refund abuse across both platforms. Cross-reference our internal data with theirs and build a report."

Kwame opened the file in Excel. It crashed. He tried again with a filtered import. Excel loaded the first million rows and silently dropped the rest.

He spent the next hour writing SQL to import the CSV into a temp table, but the partner's column formats didn't match his schema. Dates were strings. Amounts had currency symbols embedded. Half the ZIP codes had leading zeros stripped. He was three hours in and hadn't started the actual analysis.

His teammate Priya walked over, opened her laptop, and typed four lines of code:

import pandas as pd
df = pd.read_csv("partner_transactions.csv", dtype=str)
df["amount"] = pd.to_numeric(
    df["amount"].str.replace(r"[\$,]", "", regex=True),
    errors="coerce",
)
df.head()

pd.to_numeric(..., errors="coerce") turns anything it can't parse (like "N/A" or "-") into NaN instead of crashing, which is what you want on the messy data you actually get from partner systems.

The file loaded in under a minute. Dirty data handled inline. "Want me to show you the rest?" she asked.

By the end of the week, Kwame had found the refund abuse ring. But the real discovery was that Python let him do in an afternoon what would have taken him days in SQL alone.

This story is fictional, but the patterns are real.

Why This Matters

You've built strong SQL skills through the exercises in this platform and the SQL crash course in fraud-basics. SQL is excellent for querying databases: filtering transactions, joining tables, aggregating data. It's the backbone of fraud investigation.

But SQL has limits. It doesn't handle messy data well. It can't produce visualizations. It struggles with ad-hoc analysis where you're exploring data without knowing exactly what you're looking for. And it can't run machine learning models.

Python with the pandas library↗^[1] fills those gaps. It's not a replacement for SQL. It's the tool you reach for when SQL isn't enough. This article covers the transition: how to map your SQL thinking to pandas, and when to use which tool.

Your First Pandas Workflow

DataFrames: Tables You Can Manipulate

If you think in SQL, you already think in DataFrames. A DataFrame is a table. Rows and columns. That's it.

SQL Concept	Pandas Equivalent
Table	DataFrame
Row	Row (accessed by index)
Column	Series
`SELECT *`	`df`
`SELECT col1, col2`	`df[["col1", "col2"]]`
`WHERE condition`	`df[df["col"] > value]`
`ORDER BY col`	`df.sort_values("col")`
`LIMIT 10`	`df.head(10)`
`COUNT(*)`	`len(df)`
`GROUP BY`	`df.groupby("col")`

Loading Data

SQL queries a database. Pandas reads files directly via pd.read_csv↗^[2] and its siblings.

import pandas as pd

# Read a CSV (the most common format you'll encounter)
df = pd.read_csv("transactions.csv")

# Read an Excel file
df = pd.read_excel("report.xlsx")

# Read JSON (common from API responses)
df = pd.read_json("api_response.json")

# Read directly from a SQL database
df = pd.read_sql("SELECT * FROM transactions WHERE date > '2025-01-01'", connection)

That last one is the bridge between worlds. You can pull data from your SQL database directly into pandas for further analysis.

Filtering: The WHERE Clause Equivalent

In SQL, you write WHERE amount > 1000. In pandas:

# SQL: SELECT * FROM transactions WHERE amount > 1000
high_value = df[df["amount"] > 1000]

# SQL: SELECT * FROM transactions WHERE status = 'refund' AND amount > 500
suspicious = df[(df["status"] == "refund") & (df["amount"] > 500)]

# SQL: SELECT * FROM transactions WHERE merchant IN ('Store A', 'Store B')
target_merchants = df[df["merchant"].isin(["Store A", "Store B"])]

The syntax is different, but the logic is identical. You're filtering rows based on conditions.

Grouping and Aggregation

This is where pandas starts to shine. SQL's GROUP BY is powerful but rigid. Pandas groupby↗^[3] gives you more flexibility.

# SQL: SELECT customer_id, COUNT(*), SUM(amount)
#      FROM transactions GROUP BY customer_id
summary = df.groupby("customer_id").agg(
    transaction_count=("amount", "count"),
    total_spent=("amount", "sum")
)

# SQL: SELECT customer_id, COUNT(*) FROM transactions
#      WHERE status = 'chargeback' GROUP BY customer_id HAVING COUNT(*) > 3
repeat_chargebacks = (
    df[df["status"] == "chargeback"]
    .groupby("customer_id")
    .size()
    .reset_index(name="chargeback_count")
    .query("chargeback_count > 3")
)

Adding Calculated Columns

In SQL, you'd use a CASE statement or a computed column. In pandas, you operate directly on columns:

# Flag transactions over $10,000 as a high-value review trigger.
# Note: BSA's CTR rule (31 CFR 1010.311) applies only to currency (cash)
# transactions; if your dataframe mixes wires, ACH, and cards, you'd
# narrow this to cash before reaching for "CTR." SAR filing is a
# separate, suspicion-based regime with its own $5,000 threshold.
df["high_value"] = df["amount"] > 10000

# Convert the timestamp column from string to datetime first
# (CSVs load everything as strings; .dt accessors and .diff() need real datetimes)
df["timestamp"] = pd.to_datetime(df["timestamp"])

# Calculate time between transactions per customer
df = df.sort_values(["customer_id", "timestamp"])
df["time_since_last"] = df.groupby("customer_id")["timestamp"].diff()

# Once converted, you can pull out parts of the timestamp
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.day_name()

# Create a risk score based on multiple factors
df["risk_score"] = (
    (df["amount"] > 5000).astype(int) * 2 +
    (df["is_new_device"]).astype(int) * 3 +
    (df["country"] != df["home_country"]).astype(int) * 2
)

That last example would be awkward in SQL. In pandas, you're combining boolean conditions into a weighted score in three lines. The high_value flag in the example above is the natural pandas hook for downstream review queues. For cash transactions specifically, the same $10,000 cutoff is the CTR reporting threshold, and unusual patterns above it (or structured below it) feed SAR filing workflows.

Working in Jupyter Notebooks

Why do notebooks matter for fraud work?

A Jupyter↗^[4] notebook is an interactive environment where you write code in cells and see results immediately. It's like a scratchpad that shows your work.

For fraud analysis, this matters because investigations are iterative. You don't know the answer before you start. You explore, filter, plot, refine. Notebooks let you build an investigation step by step, keeping every intermediate result visible.

The Investigation Workflow

A typical fraud investigation in a notebook follows a pattern:

Cell 1: Load and inspect the data

import pandas as pd

df = pd.read_csv("flagged_transactions.csv")
print(f"Loaded {len(df):,} transactions")
df.head()

Cell 2: Filter to the suspicious subset

suspicious = df[
    df["velocity_flag"] &
    (df["amount"] > 1000)
]
print(f"{len(suspicious):,} flagged transactions")

Cell 3: Look for patterns

suspicious.groupby("device_fingerprint").agg(
    accounts=("customer_id", "nunique"),
    total_amount=("amount", "sum"),
    transaction_count=("amount", "count")
).sort_values("accounts", ascending=False).head(20)

Cell 4: Visualize what you found

suspicious.groupby("hour")["amount"].count().plot(
    kind="bar", title="Suspicious Transactions by Hour"
)

Each cell builds on the last. You can go back and modify earlier cells, re-run them, and see how the results change. The notebook is both your analysis tool and your documentation, and it pairs naturally with the kind of pattern-based investigation that turns up business-logic abuse: load the logs, filter to the suspicious subset, group on shared infrastructure, plot the time distribution.

Notebooks as evidence: A well-structured notebook documents your entire investigation process: what data you looked at, what filters you applied, what patterns you found. This can be exported as a PDF and included in case files.

Visualization: Seeing the Pattern

Fraud patterns are often invisible in tables but obvious in charts. A column of numbers showing login attempts per hour looks unremarkable. The same data as a bar chart reveals a spike at 3 AM that screams "bot."

Essential Charts for Fraud Analysis

Time-series plots show activity over time. Useful for spotting spikes, unusual hours, or coordinated timing.

import matplotlib.pyplot as plt  # the standard Python plotting library

# Transactions per hour
df.groupby(df["timestamp"].dt.hour)["amount"].count().plot(
    kind="line", title="Transactions by Hour of Day"
)
plt.xlabel("Hour")
plt.ylabel("Count")
plt.show()

matplotlib↗^[5] handles the plotting; pandas wires its DataFrames into it.

Histograms show the distribution of values. Useful for spotting amount clustering (e.g., many transactions at exactly $9,999 to stay under reporting thresholds).

df["amount"].hist(bins=50)
plt.title("Transaction Amount Distribution")
plt.xlabel("Amount ($)")
plt.show()

Scatter plots reveal relationships between two variables. Useful for spotting outliers.

df.plot.scatter(x="account_age_days", y="transaction_count", alpha=0.3)
plt.title("Account Age vs. Transaction Volume")
plt.show()

Heatmaps show patterns across two dimensions. Useful for spotting coordinated behavior.

import seaborn as sns  # statistical visualization built on matplotlib

pivot = df.pivot_table(
    values="amount", index="day_of_week",
    columns="hour", aggfunc="count"
)
sns.heatmap(pivot, cmap="YlOrRd")
plt.title("Transaction Frequency: Day vs. Hour")
plt.show()

seaborn↗^[6] sits on top of matplotlib and adds higher-level statistical charts (heatmaps, KDE plots, joint distributions) that are useful for spotting coordinated behavior across two variables.

When should you use SQL versus pandas?

Neither tool is universally better. They serve different purposes.

Use SQL When	Use Pandas When
Querying a production database	Working with files (CSV, Excel, JSON)
Pulling large datasets with complex joins	Exploring and transforming messy data
Scheduled reports that run automatically	Ad-hoc investigation with iterative refinement
Other people need to run the same query	You need visualizations or statistical analysis
Performance on very large datasets matters	Combining data from multiple sources

The most effective fraud analysts use both. Pull the data you need with SQL, then switch to pandas for exploration, cleaning, and visualization. The same hybrid pattern shows up in operational fraud teams: investigation techniques in money-movement often start as a SQL query and end as a pandas notebook.

Common Pitfalls

Loading everything into memory. Pandas loads entire files into RAM. A 10 GB file will crash your machine if you only have 16 GB of RAM. Filter in SQL first, then analyze the subset in pandas.

Chained indexing. Writing df[df["col"] > 5]["other_col"] = "flag" looks right but doesn't always work. Use .loc instead: df.loc[df["col"] > 5, "other_col"] = "flag".

Forgetting about data types. CSV files load everything as strings by default. Amounts stored as "$1,234.56" won't sort numerically until you convert them with Series.str.replace↗^[7] (note: regex=True is required because the default is regex=False, the opposite of what you'd guess) followed by .astype(float). Timestamps stay as plain text until you run pd.to_datetime(df["timestamp"]). Without that conversion, .dt.hour, .diff(), and date-based comparisons all fail. Always check df.dtypes after loading.

Not handling missing data. Real-world data has gaps. df["amount"].mean() will silently ignore NaN values. That might be what you want, or it might skew your results. Always check df.isnull().sum() early in your analysis.

Key Takeaways

Pandas DataFrames are tables. If you can write SQL, you can learn pandas. The concepts map directly: SELECT is column selection, WHERE is boolean filtering, GROUP BY is .groupby().
Notebooks support iterative investigation. Unlike SQL scripts, Jupyter notebooks let you explore data step by step, see intermediate results, and document your reasoning as you go.
Visualization exposes patterns that tables hide. Time-series plots, histograms, and heatmaps make fraud patterns visible that would be easy to miss in raw numbers.
Use both tools together. SQL is best for querying databases and pulling large datasets. Pandas is best for exploration, transformation, and visualization. The strongest analysts combine them.
Watch out for memory and data types. Pandas loads data into RAM, and CSV files default to strings. Check your data types and filter before loading when working with large datasets.

What's next: ML for Fraud Detection covers how machine learning models use the patterns you find in data to automatically score transactions for fraud risk. For broader framing on where pandas fits in a fraud team's tooling, see Fraud 101.

References

1. pandas — User guide↗ - Official documentation for the DataFrame library.

2. pandas — pd.read_csv API reference↗ - The canonical CSV loader; covers dtypes, chunking, and encoding options.

3. pandas — Group by user guide↗ - Split-apply-combine patterns; the pandas analog of SQL GROUP BY.

4. Project Jupyter↗ - The official site for Jupyter notebooks and JupyterLab.

5. matplotlib — User guide↗ - The standard Python plotting library that pandas plotting wires into.

6. seaborn↗ - Higher-level statistical visualizations built on top of matplotlib.

7. pandas — Series.str.replace API reference↗ - Note regex=False is the default in current pandas; pass regex=True for regex pattern matching.

Key Terms

Term	Definition
pandas	A Python library for data manipulation and analysis, built around the DataFrame data structure
DataFrame	A two-dimensional data structure (rows and columns) similar to a SQL table or spreadsheet
Series	A single column from a DataFrame
Jupyter notebook	An interactive coding environment where you write and execute code in cells, seeing results inline
matplotlib	A Python library for creating charts and visualizations
seaborn	A Python library built on matplotlib that provides higher-level statistical visualizations
NaN	"Not a Number," used in pandas to represent missing or null values
`.groupby()`	The pandas method for grouping rows by column values, equivalent to SQL's GROUP BY
`.agg()`	The pandas method for applying aggregation functions (sum, count, mean) to grouped data

All Categories

The Pivot Table That Broke Excel

Why This Matters

Your First Pandas Workflow

DataFrames: Tables You Can Manipulate

Loading Data

Filtering: The WHERE Clause Equivalent

Grouping and Aggregation

Adding Calculated Columns

Working in Jupyter Notebooks

Why do notebooks matter for fraud work?

The Investigation Workflow

Visualization: Seeing the Pattern

Essential Charts for Fraud Analysis

When should you use SQL versus pandas?

Common Pitfalls

Key Takeaways

References

Key Terms

Test Your Knowledge

All Categories

The Pivot Table That Broke Excel

Why This Matters

Your First Pandas Workflow

DataFrames: Tables You Can Manipulate

Loading Data

Filtering: The WHERE Clause Equivalent

Grouping and Aggregation

Adding Calculated Columns

Working in Jupyter Notebooks

Why do notebooks matter for fraud work?

The Investigation Workflow

Visualization: Seeing the Pattern

Essential Charts for Fraud Analysis

When should you use SQL versus pandas?

Common Pitfalls

Key Takeaways

References

Key Terms

Test Your Knowledge

Continue learning