A fraud detection pipeline using XGBoost with SMOTE oversampling and SHAP explanations, trained on 10K synthetic transactions with a 2% fraud rate.
Four-page Streamlit application for transaction scoring and model analysis
XGBoost with SMOTE oversampling on 10K synthetic transactions
Synthetic transaction data (10K records, 2% fraud) with features capturing amount, timing, geographic distance, velocity, and merchant category. SMOTE oversampling balances the training set. Four classifiers (Logistic Regression, Random Forest, XGBoost, LightGBM) are trained with 5-fold stratified cross-validation. SHAP TreeExplainer provides per-transaction explanations. A cost-based threshold optimizer balances missed fraud ($500/FN) against false alarms ($25/FP) to find the decision boundary that minimizes total business cost.
Synthetic transaction data built from realistic distributions modeled on real-world fraud patterns. Legitimate transactions follow log-normal amount distributions centered on typical retail spending, while fraudulent transactions exhibit higher amounts, greater geographic distances from the cardholder's home, and concentration during night hours. The 2% fraud rate matches industry baselines for credit card fraud incidence.