Python for Fraud Analysts
Transition from SQL to pandas for data exploration, visualization, and ad-hoc fraud investigation
By Benjamin, Fraud Attacks · Updated
Pandas is the Python library most fraud analysts use to load, clean, and analyze tabular data that is too large or too messy for Excel and too exploratory for a single SQL query. It works with millions of rows in memory, transforms inconsistent formats, groups and aggregates flexibly, and plots results inline. This article walks through the transition from SQL to pandas: the syntax map, the common investigation workflows, and when to reach for each tool.
The Pivot Table That Broke Excel
Kwame had been a fraud analyst for three years, and SQL was his native language. He could write five-table joins in his sleep. He could spot a suspicious velocity pattern in a query result set before his coffee got cold.
Then his manager dropped a CSV file on his desk. Fourteen million rows. Transaction data from a partner bank. "We need to find coordinated refund abuse across both platforms. Cross-reference our internal data with theirs and build a report."
Kwame opened the file in Excel. It crashed. He tried again with a filtered import. Excel loaded the first million rows and silently dropped the rest.
He spent the next hour writing SQL to import the CSV into a temp table, but the partner's column formats didn't match his schema. Dates were strings. Amounts had currency symbols embedded. Half the ZIP codes had leading zeros stripped. He was three hours in and hadn't started the actual analysis.
His teammate Priya walked over, opened her laptop, and typed four lines of code:
import pandas as pd
df = pd.read_csv("partner_transactions.csv", dtype=str)
df["amount"] = pd.to_numeric(
df["amount"].str.replace(r"[\$,]", "", regex=True),
errors="coerce",
)
df.head()pd.to_numeric(..., errors="coerce") turns anything it can't parse (like "N/A" or "-") into NaN instead of crashing, which is what you want on the messy data you actually get from partner systems.
The file loaded in under a minute. Dirty data handled inline. "Want me to show you the rest?" she asked.
By the end of the week, Kwame had found the refund abuse ring. But the real discovery was that Python let him do in an afternoon what would have taken him days in SQL alone.
This story is fictional, but the patterns are real.
Why This Matters
You've built strong SQL skills through the exercises in this platform and the SQL crash course in fraud-basics. SQL is excellent for querying databases: filtering transactions, joining tables, aggregating data. It's the backbone of fraud investigation.
But SQL has limits. It doesn't handle messy data well. It can't produce visualizations. It struggles with ad-hoc analysis where you're exploring data without knowing exactly what you're looking for. And it can't run machine learning models.
Python with the pandas library↗[1] fills those gaps. It's not a replacement for SQL. It's the tool you reach for when SQL isn't enough. This article covers the transition: how to map your SQL thinking to pandas, and when to use which tool.
Your First Pandas Workflow
DataFrames: Tables You Can Manipulate
If you think in SQL, you already think in DataFrames. A DataFrame is a table. Rows and columns. That's it.
| SQL Concept | Pandas Equivalent |
|---|---|
| Table | DataFrame |
| Row | Row (accessed by index) |
| Column | Series |
SELECT * | df |
SELECT col1, col2 | df[["col1", "col2"]] |
WHERE condition | df[df["col"] > value] |
ORDER BY col | df.sort_values("col") |
LIMIT 10 | df.head(10) |
COUNT(*) | len(df) |
GROUP BY | df.groupby("col") |
Loading Data
SQL queries a database. Pandas reads files directly via pd.read_csv↗[2] and its siblings.
import pandas as pd
# Read a CSV (the most common format you'll encounter)
df = pd.read_csv("transactions.csv")
# Read an Excel file
df = pd.read_excel("report.xlsx")
# Read JSON (common from API responses)
df = pd.read_json("api_response.json")
# Read directly from a SQL database
df = pd.read_sql("SELECT * FROM transactions WHERE date > '2025-01-01'", connection)That last one is the bridge between worlds. You can pull data from your SQL database directly into pandas for further analysis.
Filtering: The WHERE Clause Equivalent
In SQL, you write WHERE amount > 1000. In pandas:
# SQL: SELECT * FROM transactions WHERE amount > 1000
high_value = df[df["amount"] > 1000]
# SQL: SELECT * FROM transactions WHERE status = 'refund' AND amount > 500
suspicious = df[(df["status"] == "refund") & (df["amount"] > 500)]
# SQL: SELECT * FROM transactions WHERE merchant IN ('Store A', 'Store B')
target_merchants = df[df["merchant"].isin(["Store A", "Store B"])]The syntax is different, but the logic is identical. You're filtering rows based on conditions.
Grouping and Aggregation
This is where pandas starts to shine. SQL's GROUP BY is powerful but rigid. Pandas groupby↗[3] gives you more flexibility.
# SQL: SELECT customer_id, COUNT(*), SUM(amount)
# FROM transactions GROUP BY customer_id
summary = df.groupby("customer_id").agg(
transaction_count=("amount", "count"),
total_spent=("amount", "sum")
)
# SQL: SELECT customer_id, COUNT(*) FROM transactions
# WHERE status = 'chargeback' GROUP BY customer_id HAVING COUNT(*) > 3
repeat_chargebacks = (
df[df["status"] == "chargeback"]
.groupby("customer_id")
.size()
.reset_index(name="chargeback_count")
.query("chargeback_count > 3")
)Adding Calculated Columns
In SQL, you'd use a CASE statement or a computed column. In pandas, you operate directly on columns:
# Flag transactions over $10,000 as a high-value review trigger.
# Note: BSA's CTR rule (31 CFR 1010.311) applies only to currency (cash)
# transactions; if your dataframe mixes wires, ACH, and cards, you'd
# narrow this to cash before reaching for "CTR." SAR filing is a
# separate, suspicion-based regime with its own $5,000 threshold.
df["high_value"] = df["amount"] > 10000
# Convert the timestamp column from string to datetime first
# (CSVs load everything as strings; .dt accessors and .diff() need real datetimes)
df["timestamp"] = pd.to_datetime(df["timestamp"])
# Calculate time between transactions per customer
df = df.sort_values(["customer_id", "timestamp"])
df["time_since_last"] = df.groupby("customer_id")["timestamp"].diff()
# Once converted, you can pull out parts of the timestamp
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.day_name()
# Create a risk score based on multiple factors
df["risk_score"] = (
(df["amount"] > 5000).astype(int) * 2 +
(df["is_new_device"]).astype(int) * 3 +
(df["country"] != df["home_country"]).astype(int) * 2
)That last example would be awkward in SQL. In pandas, you're combining boolean conditions into a weighted score in three lines. The high_value flag in the example above is the natural pandas hook for downstream review queues. For cash transactions specifically, the same $10,000 cutoff is the CTR reporting threshold, and unusual patterns above it (or structured below it) feed SAR filing workflows.
Working in Jupyter Notebooks
Why do notebooks matter for fraud work?
A Jupyter↗[4] notebook is an interactive environment where you write code in cells and see results immediately. It's like a scratchpad that shows your work.
For fraud analysis, this matters because investigations are iterative. You don't know the answer before you start. You explore, filter, plot, refine. Notebooks let you build an investigation step by step, keeping every intermediate result visible.
The Investigation Workflow
A typical fraud investigation in a notebook follows a pattern:
Cell 1: Load and inspect the data
import pandas as pd
df = pd.read_csv("flagged_transactions.csv")
print(f"Loaded {len(df):,} transactions")
df.head()Cell 2: Filter to the suspicious subset
suspicious = df[
df["velocity_flag"] &
(df["amount"] > 1000)
]
print(f"{len(suspicious):,} flagged transactions")Cell 3: Look for patterns
suspicious.groupby("device_fingerprint").agg(
accounts=("customer_id", "nunique"),
total_amount=("amount", "sum"),
transaction_count=("amount", "count")
).sort_values("accounts", ascending=False).head(20)Cell 4: Visualize what you found
suspicious.groupby("hour")["amount"].count().plot(
kind="bar", title="Suspicious Transactions by Hour"
)Each cell builds on the last. You can go back and modify earlier cells, re-run them, and see how the results change. The notebook is both your analysis tool and your documentation, and it pairs naturally with the kind of pattern-based investigation that turns up business-logic abuse: load the logs, filter to the suspicious subset, group on shared infrastructure, plot the time distribution.
Notebooks as evidence: A well-structured notebook documents your entire investigation process: what data you looked at, what filters you applied, what patterns you found. This can be exported as a PDF and included in case files.
Visualization: Seeing the Pattern
Fraud patterns are often invisible in tables but obvious in charts. A column of numbers showing login attempts per hour looks unremarkable. The same data as a bar chart reveals a spike at 3 AM that screams "bot."
Essential Charts for Fraud Analysis
Time-series plots show activity over time. Useful for spotting spikes, unusual hours, or coordinated timing.
import matplotlib.pyplot as plt # the standard Python plotting library
# Transactions per hour
df.groupby(df["timestamp"].dt.hour)["amount"].count().plot(
kind="line", title="Transactions by Hour of Day"
)
plt.xlabel("Hour")
plt.ylabel("Count")
plt.show()matplotlib↗[5] handles the plotting; pandas wires its DataFrames into it.
Histograms show the distribution of values. Useful for spotting amount clustering (e.g., many transactions at exactly $9,999 to stay under reporting thresholds).
df["amount"].hist(bins=50)
plt.title("Transaction Amount Distribution")
plt.xlabel("Amount ($)")
plt.show()Scatter plots reveal relationships between two variables. Useful for spotting outliers.
df.plot.scatter(x="account_age_days", y="transaction_count", alpha=0.3)
plt.title("Account Age vs. Transaction Volume")
plt.show()Heatmaps show patterns across two dimensions. Useful for spotting coordinated behavior.
import seaborn as sns # statistical visualization built on matplotlib
pivot = df.pivot_table(
values="amount", index="day_of_week",
columns="hour", aggfunc="count"
)
sns.heatmap(pivot, cmap="YlOrRd")
plt.title("Transaction Frequency: Day vs. Hour")
plt.show()seaborn↗[6] sits on top of matplotlib and adds higher-level statistical charts (heatmaps, KDE plots, joint distributions) that are useful for spotting coordinated behavior across two variables.
When should you use SQL versus pandas?
Neither tool is universally better. They serve different purposes.
| Use SQL When | Use Pandas When |
|---|---|
| Querying a production database | Working with files (CSV, Excel, JSON) |
| Pulling large datasets with complex joins | Exploring and transforming messy data |
| Scheduled reports that run automatically | Ad-hoc investigation with iterative refinement |
| Other people need to run the same query | You need visualizations or statistical analysis |
| Performance on very large datasets matters | Combining data from multiple sources |
The most effective fraud analysts use both. Pull the data you need with SQL, then switch to pandas for exploration, cleaning, and visualization. The same hybrid pattern shows up in operational fraud teams: investigation techniques in money-movement often start as a SQL query and end as a pandas notebook.
Common Pitfalls
Loading everything into memory. Pandas loads entire files into RAM. A 10 GB file will crash your machine if you only have 16 GB of RAM. Filter in SQL first, then analyze the subset in pandas.
Chained indexing. Writing df[df["col"] > 5]["other_col"] = "flag" looks right but doesn't always work. Use .loc instead: df.loc[df["col"] > 5, "other_col"] = "flag".
Forgetting about data types. CSV files load everything as strings by default. Amounts stored as "$1,234.56" won't sort numerically until you convert them with Series.str.replace↗[7] (note: regex=True is required because the default is regex=False, the opposite of what you'd guess) followed by .astype(float). Timestamps stay as plain text until you run pd.to_datetime(df["timestamp"]). Without that conversion, .dt.hour, .diff(), and date-based comparisons all fail. Always check df.dtypes after loading.
Not handling missing data. Real-world data has gaps. df["amount"].mean() will silently ignore NaN values. That might be what you want, or it might skew your results. Always check df.isnull().sum() early in your analysis.
Key Takeaways
- Pandas DataFrames are tables. If you can write SQL, you can learn pandas. The concepts map directly: SELECT is column selection, WHERE is boolean filtering, GROUP BY is
.groupby(). - Notebooks support iterative investigation. Unlike SQL scripts, Jupyter notebooks let you explore data step by step, see intermediate results, and document your reasoning as you go.
- Visualization exposes patterns that tables hide. Time-series plots, histograms, and heatmaps make fraud patterns visible that would be easy to miss in raw numbers.
- Use both tools together. SQL is best for querying databases and pulling large datasets. Pandas is best for exploration, transformation, and visualization. The strongest analysts combine them.
- Watch out for memory and data types. Pandas loads data into RAM, and CSV files default to strings. Check your data types and filter before loading when working with large datasets.
What's next: ML for Fraud Detection covers how machine learning models use the patterns you find in data to automatically score transactions for fraud risk. For broader framing on where pandas fits in a fraud team's tooling, see Fraud 101.
References
1. pandas — User guide↗ - Official documentation for the DataFrame library.
2. pandas — pd.read_csv API reference↗ - The canonical CSV loader; covers dtypes, chunking, and encoding options.
3. pandas — Group by user guide↗ - Split-apply-combine patterns; the pandas analog of SQL GROUP BY.
4. Project Jupyter↗ - The official site for Jupyter notebooks and JupyterLab.
5. matplotlib — User guide↗ - The standard Python plotting library that pandas plotting wires into.
6. seaborn↗ - Higher-level statistical visualizations built on top of matplotlib.
7. pandas — Series.str.replace API reference↗ - Note regex=False is the default in current pandas; pass regex=True for regex pattern matching.
Key Terms
| Term | Definition |
|---|---|
| pandas | A Python library for data manipulation and analysis, built around the DataFrame data structure |
| DataFrame | A two-dimensional data structure (rows and columns) similar to a SQL table or spreadsheet |
| Series | A single column from a DataFrame |
| Jupyter notebook | An interactive coding environment where you write and execute code in cells, seeing results inline |
| matplotlib | A Python library for creating charts and visualizations |
| seaborn | A Python library built on matplotlib that provides higher-level statistical visualizations |
| NaN | "Not a Number," used in pandas to represent missing or null values |
.groupby() | The pandas method for grouping rows by column values, equivalent to SQL's GROUP BY |
.agg() | The pandas method for applying aggregation functions (sum, count, mean) to grouped data |
Test Your Knowledge
Ready to test what you've learned? Take the quiz to reinforce your understanding.
Continue learning
- Data Science for FraudML for Fraud DetectionHow machine learning models score transactions, the precision-recall tradeoff, and working with your data science team
- More from Fraud BasicsFraud 101: What Is Fraud?Absolute basics for someone who has never looked at fraud: what is fraud, how is it different from other crimes, and why does it matter
- More from Money Movement & Transaction FraudPayment Systems 101: How Money Really MovesEssential foundation for understanding how ACH, wire transfers, card payments, and digital payments actually work - and why criminals target them
- More from Account TakeoverATO FundamentalsEssential foundation every fraud professional needs to know about account takeover attacks