17

Credit card fraud detection

A fraud detection pipeline using XGBoost with SMOTE oversampling and SHAP explanations, trained on 10K synthetic transactions with a 2% fraud rate.

Fraud detection XGBoost SMOTE SHAP Class imbalance Cost optimization
0.97
AUC-ROC, best model (XGBoost)

Interactive dashboard

Four-page Streamlit application for transaction scoring and model analysis

Transaction scoring
  • Real-time fraud probability from feature inputs
  • Risk gauge with approve/block decision
  • Adjustable transaction parameters
Model comparison
  • Side-by-side AUC-ROC, PR-AUC, precision, recall, F1
  • ROC curves for all four classifiers
  • Precision-recall curves
Threshold tuning
  • Adjustable FP and FN cost parameters
  • Total cost curve across thresholds
  • Recall vs false positive rate trade-off
SHAP analysis
  • Global feature importance bar chart
  • Per-transaction waterfall explanations
  • Feature value and contribution table
$ pip install -r requirements.txt && streamlit run app.py

Key results

XGBoost with SMOTE oversampling on 10K synthetic transactions

0.97
AUC-ROC
94%
Recall (fraud caught)
3%
False positive rate
0.82
PR-AUC

Methodology

Synthetic transaction data (10K records, 2% fraud) with features capturing amount, timing, geographic distance, velocity, and merchant category. SMOTE oversampling balances the training set. Four classifiers (Logistic Regression, Random Forest, XGBoost, LightGBM) are trained with 5-fold stratified cross-validation. SHAP TreeExplainer provides per-transaction explanations. A cost-based threshold optimizer balances missed fraud ($500/FN) against false alarms ($25/FP) to find the decision boundary that minimizes total business cost.

Data + SMOTE
10K transactions, 2% fraud
Model training
LR, RF, XGB, LGBM
SHAP analysis
Global + per-transaction
Threshold tuning
Cost-based optimization

How to run

$ git clone https://github.com/guydev42/fraud-detection.git $ cd calgary-data-portfolio/project_17_fraud_detection $ pip install -r requirements.txt $ python data/generate_data.py $ streamlit run app.py

Data source

Synthetic transaction data built from realistic distributions modeled on real-world fraud patterns. Legitimate transactions follow log-normal amount distributions centered on typical retail spending, while fraudulent transactions exhibit higher amounts, greater geographic distances from the cardholder's home, and concentration during night hours. The 2% fraud rate matches industry baselines for credit card fraud incidence.

Links