Credit card fraud detection

A fraud detection pipeline using XGBoost with SMOTE oversampling and SHAP explanations, trained on 10K synthetic transactions with a 2% fraud rate.

Fraud detection XGBoost SMOTE SHAP Class imbalance Cost optimization

0.97

AUC-ROC, best model (XGBoost)

Interactive dashboard

Four-page Streamlit application for transaction scoring and model analysis

Transaction scoring

Real-time fraud probability from feature inputs
Risk gauge with approve/block decision
Adjustable transaction parameters

Model comparison

Side-by-side AUC-ROC, PR-AUC, precision, recall, F1
ROC curves for all four classifiers
Precision-recall curves

Threshold tuning

Adjustable FP and FN cost parameters
Total cost curve across thresholds
Recall vs false positive rate trade-off

SHAP analysis

Global feature importance bar chart
Per-transaction waterfall explanations
Feature value and contribution table

$ pip install -r requirements.txt && streamlit run app.py

Key results

XGBoost with SMOTE oversampling on 10K synthetic transactions

0.97

AUC-ROC

94%

Recall (fraud caught)

False positive rate

0.82

PR-AUC

Methodology

Synthetic transaction data (10K records, 2% fraud) with features capturing amount, timing, geographic distance, velocity, and merchant category. SMOTE oversampling balances the training set. Four classifiers (Logistic Regression, Random Forest, XGBoost, LightGBM) are trained with 5-fold stratified cross-validation. SHAP TreeExplainer provides per-transaction explanations. A cost-based threshold optimizer balances missed fraud ($500/FN) against false alarms ($25/FP) to find the decision boundary that minimizes total business cost.

Data + SMOTE

10K transactions, 2% fraud

Model training

LR, RF, XGB, LGBM

SHAP analysis

Global + per-transaction

Threshold tuning

Cost-based optimization

How to run

$ git clone https://github.com/guydev42/fraud-detection.git
$ cd calgary-data-portfolio/project_17_fraud_detection
$ pip install -r requirements.txt
$ python data/generate_data.py
$ streamlit run app.py

Data source

Synthetic transaction data built from realistic distributions modeled on real-world fraud patterns. Legitimate transactions follow log-normal amount distributions centered on typical retail spending, while fraudulent transactions exhibit higher amounts, greater geographic distances from the cardholder's home, and concentration during night hours. The 2% fraud rate matches industry baselines for credit card fraud incidence.

Links

{}

View code on GitHub

Source, notebooks, and transaction data

View notebooks

EDA, feature engineering, modeling, evaluation

Calgary data portfolio

Full project collection