ML for Fraud Detection
How machine learning models score transactions, the precision-recall tradeoff, and working with your data science team
By Benjamin, Fraud Attacks · Updated
Machine learning for fraud detection is the use of trained statistical models to score whether a transaction looks like fraud, based on patterns learned from labeled historical examples. The model doesn't make the final decision; it ranks risk so analysts and rules engines spend their time on the cases most likely to be real. This article covers what the model sees, how to evaluate it, and what fraud analysts need to know to work with one in production.
The Model That Cried Wolf
Amara's team had just deployed their new machine learning model. It was supposed to catch account takeover attacks before they caused damage. The first week, it flagged 12,000 transactions.
They reviewed a sample. False positive after false positive. A grandmother buying plane tickets for her grandkids. A sales rep logging in from a hotel in a city she'd never visited before. A college student buying textbooks at 2 AM. All flagged as "likely fraud."
Buried in the noise were 23 actual account takeovers, including one where a criminal had been siphoning funds from a small business for three days. Nobody caught it because the analyst reviewing the queue had given up scrolling through false alerts.
Amara pulled the model's feature weights. It had learned that "unusual login location" was the strongest fraud signal. Technically true: criminals often log in from unexpected places. But so does anyone who travels. The model had learned a real pattern and applied it so aggressively that it was useless.
She adjusted the feature weights, added contextual signals (does this user travel frequently? is the device recognized?), and retrained. The second version flagged 400 transactions per week instead of 12,000. The catch rate roughly doubled, because analysts could now actually review every alert.
The machine learning model didn't replace the analysts. It prioritized their attention. And getting that prioritization right was the hard part.
This story is fictional, but the patterns are real.
Why This Matters
In Python for Fraud Analysts, you learned how to use pandas to explore data, find patterns, and visualize suspicious activity. That's manual analysis. It works well when you have a specific question ("show me all refunds over $500 from new accounts") but it doesn't scale when you need to score every transaction in real time. This article connects that pandas work to the broader fraud fundamentals on the operational side.
Machine learning automates the pattern recognition. Instead of writing rules by hand ("flag transactions over $5,000 from new devices"), you show the model thousands of examples of fraud and not-fraud, and it learns the distinguishing patterns on its own. The standard Python toolkit for this is scikit-learn↗[1]; production fraud platforms layer gradient-boosted trees, neural networks, or ensembles on top.
This article isn't about building models. It's about understanding them well enough to work with your data science team, interpret model output, and avoid the mistakes that make models useless in production.
What does a machine learning model actually do?
At its core, machine learning is pattern matching at scale.
You give a model a large collection of historical transactions, each labeled as "fraud" or "not fraud." The model analyzes every feature of those transactions (amount, time of day, device type, location, account age, transaction velocity, and dozens more) and learns which combinations of features tend to predict fraud.
Once trained, the model takes a new transaction it's never seen before, examines the same features, and produces a score: how likely is this transaction to be fraudulent?
That score doesn't decide anything by itself. Your fraud rules, your analysts, and your business logic determine what happens next. The model's job is to rank-order risk so humans spend their time on the cases most likely to be real.
The Analogy
Think of it like hiring an experienced fraud analyst who has reviewed five million historical cases. They've internalized patterns they can't always articulate: certain combinations of device age, transaction velocity, and login timing that "feel" like fraud. A machine learning model does the same thing, but it processes those patterns mathematically and can apply them to thousands of transactions per second.
The difference is that the human analyst can explain their reasoning. The model often can't (at least not in a way that satisfies regulators). That tradeoff shapes how ML is used in fraud operations, and it sharpens with newer approaches that put LLMs in the analyst role: explainability gets harder as the model gets bigger.
Features: What the Model Sees
A feature is any piece of information about a transaction that might help predict whether it's fraudulent. Choosing the right features is often more important than choosing the right algorithm.
Common Feature Categories
| Category | Example Features | Why They Matter |
|---|---|---|
| Transaction | Amount, currency, payment method, merchant category | Unusual amounts or payment types for a given customer |
| Velocity | Transactions per hour, total spend today, distinct merchants today | Rapid activity suggests automation or urgency |
| Device | Device fingerprint, IP address, browser type, screen resolution | New or mismatched devices signal potential account takeover |
| Behavioral | Time of day, typical spending pattern, average transaction size | Deviations from personal baselines are suspicious |
| Account | Account age, verification level, prior fraud history | New or unverified accounts carry higher risk |
| Geographic | Login location, IP geolocation, shipping vs. billing address mismatch | Impossible travel or location inconsistencies |
| Network | Shared devices, shared addresses, shared payment methods | Connections to known fraudulent accounts |
Feature Engineering
Raw data often isn't useful to a model until you transform it. "Feature engineering" is the process of creating derived features from raw data.
For example, a transaction timestamp by itself isn't very informative. But these derived features tell a story:
- Hour of day: Is this transaction happening at 3 AM when the customer usually shops at noon?
- Days since last transaction: Has this dormant account suddenly come alive?
- Transaction count in last hour: Five purchases in sixty minutes from an account that normally makes one per week.
- Amount deviation: This transaction is four standard deviations above the customer's average.
Good feature engineering is where domain expertise meets data science. You understand fraud patterns. Data scientists understand algorithms. The best features come from combining both.
How are supervised and unsupervised learning different?
Supervised Learning: Learning from Labels
Supervised models learn from examples where you already know the answer. You provide thousands of transactions labeled "fraud" or "not fraud," and the model learns to distinguish between them.
How it works:
- Collect historical transactions with known outcomes
- Split the data into training, validation, and test sets (a common split is 60/20/20, with the test set held back until the end)
- Train the model on the training set
- Tune thresholds and hyperparameters against the validation set, or use k-fold cross-validation if data is limited
- Evaluate final performance on the test set (data the model hasn't seen)
Split by time, not at random. Fraud is a moving target. If you shuffle transactions and pick a random 20% for testing, you leak future patterns into your training data: the model trains on March behavior while "predicting" January, which inflates its scores. Use a time-based (or "walk-forward") split instead. Train on, say, January through August, validate on September, and test on October through December. That mirrors how the model will be used in production, where it always predicts on data newer than what it was trained on.
Common algorithms for fraud:
| Algorithm | Strengths | Typical Use |
|---|---|---|
| Logistic regression | Simple, interpretable, fast | Baseline model, regulatory-friendly |
| Random forest | Handles complex patterns, resistant to overfitting | General fraud scoring |
| Gradient boosted trees (XGBoost, LightGBM) | High accuracy, handles feature interactions | Production fraud models |
| Neural networks | Can learn extremely complex patterns | Large-scale systems with abundant data |
The catch: You need labeled data. Someone has to review past transactions and confirm which ones were actually fraud. Labels are expensive to produce and often incomplete. Fraud that was never detected doesn't appear in your training data, which means the model can only learn to catch fraud types you've already identified.
Unsupervised Learning: Finding Anomalies
Unsupervised models don't use labels. Instead, they learn what "normal" looks like and flag anything that deviates significantly.
How it works:
- Feed the model a large dataset of transactions (no labels needed)
- The model builds a statistical profile of normal behavior
- New transactions that deviate from "normal" get flagged as anomalies
Common approaches:
- Clustering: Group similar transactions together. Transactions that don't fit any cluster are suspicious.
- Isolation forests: Randomly partition data and measure how quickly each data point gets isolated. Anomalies get isolated faster.
- Autoencoders: Neural networks that compress and reconstruct data. Transactions the autoencoder can't accurately reconstruct are unusual.
The advantage: You don't need labeled data, and the model can find fraud patterns you've never seen before.
The disadvantage: "Unusual" doesn't mean "fraudulent." A customer making their first international purchase is anomalous but perfectly legitimate. Unsupervised models tend to produce more false positives.
In Practice: Both Together
Most production fraud systems use both approaches. A supervised model scores every transaction based on known fraud patterns. An unsupervised model watches for anomalies that don't match any known pattern. The combination catches more fraud than either alone.
Model Evaluation: The Numbers That Matter
Class Imbalance: Why Accuracy Lies
Fraud is the canonical imbalanced classification problem. In most real datasets, fewer than 1% of transactions are fraudulent, and 0.1% is common. The widely-used Kaggle Credit Card Fraud Detection benchmark↗[2] sits at 0.17% (492 fraud cases in 284,807 transactions). That single fact warps every metric you might reach for.
Consider a model that flags nothing as fraud, ever. If 0.5% of transactions are actually fraudulent, that model is 99.5% accurate. It also catches zero fraud. Accuracy is essentially useless on this kind of data, which is why fraud teams talk about precision, recall, and false positive rate instead.
Imbalance also makes models lazy during training. Most algorithms minimize total error, so a default-trained model will happily learn "predict legitimate" because it's almost always right. Three common levers for fixing this:
- Class weighting. Tell the algorithm that misclassifying a fraud case costs (say) 100 times more than misclassifying a legitimate one. Most libraries support this through a
class_weightorscale_pos_weightparameter. Cheap, no data changes required, usually the first thing to try. - Oversampling fraud (SMOTE↗[3] and variants). Synthetically generate new fraud examples by interpolating between real ones, so the training set looks more balanced. Helps the model see more fraud during training, but can amplify noise and rare-pattern overfitting if used carelessly. Two practical guardrails: apply SMOTE only inside training folds (never on the full dataset before the train/test split, or you'll leak synthetic patterns into the evaluation set), and remember that fraud is often heterogeneous, so interpolating between two very different fraud cases can produce unrealistic synthetics. The imbalanced-learn↗[4] library is the standard Python implementation.
- Undersampling legitimate. Throw away most of the legitimate transactions in your training data to match the fraud count. Simple, fast, but you lose information about what normal behavior looks like.
None of these change reality, the underlying fraud rate stays the same in production. They just shape what the model pays attention to during training. Whichever you use, always evaluate on the original, imbalanced distribution, not the rebalanced training set.
The Confusion Matrix
Every prediction a model makes falls into one of four categories:
| Actually Fraud | Actually Legitimate | |
|---|---|---|
| Model says fraud | True Positive (caught it) | False Positive (false alarm) |
| Model says legitimate | False Negative (missed it) | True Negative (correctly approved) |
The tension in fraud detection is always between false positives and false negatives.
Too many false positives: Legitimate customers get blocked. Analysts waste time reviewing good transactions. Customer experience suffers. Revenue drops.
Too many false negatives: Fraud gets through. The company eats losses. Customers lose money and trust. Regulators ask questions.
Key Metrics
Precision: Of everything the model flagged as fraud, what percentage actually was fraud?
- Low precision = too many false positives
- A model with 10% precision means 90% of flagged transactions are legitimate
Recall (sensitivity): Of all the actual fraud, what percentage did the model catch?
- Low recall = too much fraud getting through
- A model with 70% recall misses 30% of fraud
False positive rate: Of all legitimate transactions, what percentage got incorrectly flagged?
- Even a tiny percentage creates huge volumes. If you process 10 million transactions per day and your false positive rate is 1%, that's 100,000 false alerts daily.
ROC-AUC and PR-AUC: Precision and recall depend on the score threshold you pick. Sweep the threshold from 0 to 1 and you get curves instead of points. The ROC curve plots true positive rate against false positive rate; the area under it (ROC-AUC) summarizes the model's ranking ability in a single number, where 0.5 is random and 1.0 is perfect. The precision-recall curve plots precision against recall, and its area (PR-AUC) is the more honest summary on heavily imbalanced data like fraud, because ROC-AUC can look optimistic when negatives outnumber positives a hundred to one. Most fraud teams quote both, but treat PR-AUC as the headline number. scikit-learn's model evaluation user guide↗[5] covers the implementations.
The precision-recall tradeoff: You can always increase recall by lowering the model's threshold (flag more transactions). But that also increases false positives. You can increase precision by raising the threshold (only flag the most suspicious cases). But that means missing more fraud. There's no free lunch. The right balance depends on your fraud rate, your review capacity, and the cost of false negatives vs. false positives.
What "Good" Looks Like
There's no universal benchmark. The right precision and recall depend on fraud type, transaction volume, the cost of a missed case versus the cost of blocking a good customer, and how much human review capacity you have. A wire fraud model can justify aggressive recall because missing a single case is catastrophic; a credit card model has to keep false positives low because volume is so high; a refund-abuse model leans on precision because each case eats analyst time. Calibrate against your own historical losses and review capacity rather than chasing externally published numbers.
How do ML models work in production?
Real-Time Scoring
In a production fraud system, the ML model sits in the transaction flow. Every transaction gets scored before it's approved or declined.
Customer initiates transaction
↓
System collects features (device, location, amount, velocity, etc.)
↓
ML model scores the transaction (0.0 to 1.0 risk)
↓
Rules engine applies thresholds:
- Score < 0.3 → Approve automatically
- Score 0.3-0.7 → Queue for analyst review
- Score > 0.7 → Decline or step-up authentication
↓
Analyst reviews queued cases → Approves or declines
↓
Outcome feeds back into training data
The model doesn't make the final decision. It triages. Low-risk transactions sail through. High-risk transactions get blocked. The middle band goes to human reviewers who make the judgment call. Anomaly detection on API logs is the API-side analog of this transaction-scoring loop: same triage shape, different evidence layer.
The Feedback Loop
The model's predictions generate outcomes. Those outcomes become training data for the next version of the model. This creates a feedback loop that can be virtuous or vicious.
Virtuous loop: Model catches fraud. Analyst confirms it. Confirmed fraud case improves the next model. Catch rate goes up.
Vicious loop: Model blocks certain customer profiles. Those customers never transact, so there's no fraud data from them. The model "learns" that blocking them was the right call. Bias gets reinforced.
This is why human review in the middle band matters. Analysts aren't just making case-by-case decisions. They're generating the labeled data that trains the next model.
Model Drift
Fraud patterns change. Criminals adapt. A model trained on last year's fraud will miss this year's attacks.
Common causes of drift:
- Criminals change tactics to avoid detection
- Customer behavior shifts (e.g., more mobile payments, new product launches)
- Seasonal patterns (holiday shopping changes baseline behavior)
- New features become available that the old model didn't use
Production fraud models typically need to be retrained every few months. Monitoring for drift (declining precision, rising false negative rate) is a continuous process. The NIST AI Risk Management Framework (AI RMF 1.0)↗[6], released January 26, 2023, is the canonical reference for governing this kind of ongoing model risk in regulated environments.
What Fraud Analysts Need to Know
You don't need to build models. But you need to work with them effectively.
Understand what the model sees. Ask your data science team which features the model uses. If the top feature is "login from new device," you'll know why traveling customers keep getting flagged.
Provide feedback consistently. When you review cases, your fraud/not-fraud decisions become training data. Inconsistent or sloppy reviews corrupt the model. If you mark a transaction as "not fraud" because you're too busy to investigate, the model learns the wrong lesson.
Question the scores. A high fraud score doesn't mean a case is definitely fraud. A low score doesn't mean it's safe. The model is estimating probability based on patterns, not making a definitive judgment.
Report new fraud patterns. If you spot a fraud type the model isn't catching, tell your data science team. They can add features or relabel training data to address the gap. You're the model's eyes on the ground, and your operational investigation work is one of the highest-quality sources of labeled data the team has.
Understand the limitations. Models are trained on historical data. They catch fraud that resembles past fraud. Completely novel attacks will get low scores until enough examples accumulate. Rules and human judgment remain essential for catching the unexpected.
Key Takeaways
- ML models score risk, they don't decide. The model estimates how likely a transaction is to be fraudulent. Rules, thresholds, and human reviewers determine what happens next.
- Features matter more than algorithms. The right inputs (device data, velocity metrics, behavioral baselines) are more important than which algorithm you choose.
- False positives are the real battlefield. Catching fraud is relatively easy. Catching fraud without blocking legitimate customers is the hard problem.
- Analyst feedback trains the model. Your case reviews become the labeled data that improves future models. Consistent, accurate reviews compound over time.
- Models drift and need maintenance. Criminals adapt, customer behavior shifts, and models lose accuracy. Continuous monitoring and periodic retraining are essential.
What's next: Review Python for Fraud Analysts if you haven't already, and explore the Data Science Exercises to practice building features and evaluating model output with real fraud data.
References
1. scikit-learn — User guide↗ - The standard Python ML library; covers classification, regression, model selection, and metrics.
2. Kaggle — Credit Card Fraud Detection dataset↗ - The canonical class-imbalance benchmark: 492 fraud cases in 284,807 transactions (0.17%).
3. Chawla et al. — SMOTE: Synthetic Minority Over-sampling Technique (Journal of Artificial Intelligence Research, 2002)↗ - The original SMOTE paper.
4. imbalanced-learn↗ - Python library implementing SMOTE, random under/over-samplers, and related class-imbalance utilities, compatible with scikit-learn.
5. scikit-learn — Metrics and scoring user guide↗ - Precision, recall, F1, confusion matrix, ROC-AUC, PR-AUC, and the threshold-sweep mechanics behind each.
6. NIST AI 100-1 — Artificial Intelligence Risk Management Framework (AI RMF 1.0, January 26, 2023)↗ - The canonical governance framework for trustworthy AI; covers measurement, drift monitoring, and incident-response considerations for ML systems.
Key Terms
| Term | Definition |
|---|---|
| Feature | A measurable characteristic of a transaction used as input to a machine learning model |
| Feature engineering | Creating derived features from raw data to improve model performance |
| Supervised learning | Training a model on labeled examples (known fraud and non-fraud) |
| Unsupervised learning | Training a model to detect anomalies without labeled examples |
| True positive | A fraudulent transaction correctly identified by the model |
| False positive | A legitimate transaction incorrectly flagged as fraud |
| False negative | A fraudulent transaction the model failed to catch |
| Precision | The percentage of flagged transactions that are actually fraud |
| Recall | The percentage of actual fraud that the model successfully catches |
| Model drift | The decline in model accuracy over time as fraud patterns and customer behavior change |
| Feedback loop | The cycle where model predictions generate outcomes that become training data for future models |
Test Your Knowledge
Ready to test what you've learned? Take the quiz to reinforce your understanding.
Continue learning
- Data Science for FraudPython for Fraud AnalystsTransition from SQL to pandas for data exploration, visualization, and ad-hoc fraud investigation
- More from Fraud BasicsFraud 101: What Is Fraud?Absolute basics for someone who has never looked at fraud: what is fraud, how is it different from other crimes, and why does it matter
- More from Money Movement & Transaction FraudPayment Systems 101: How Money Really MovesEssential foundation for understanding how ACH, wire transfers, card payments, and digital payments actually work - and why criminals target them
- More from Account TakeoverATO FundamentalsEssential foundation every fraud professional needs to know about account takeover attacks