Sepsis Prediction Pipeline

Tags:Machine Learning, Healthcare AI, Data Science, PythonDate:Oct 15, 2024

> Overview

The pipeline involves a meticulous process for detecting sepsis using patient-level data: - Data Handling: Missing values handled via the MICE algorithm, with categorical encoding and robust scaling applied. - Feature Engineering: Automated column dropping, log transformation, and feature interaction analysis. - Model Training: Models include Random Forest, XGBoost, and Logistic Regression, optimized using Optuna. - Evaluation: Metrics such as AUROC, Precision, Recall, and F1 Score logged with custom visualization reports. - Deployment: Model registry supports versioning and metadata storage, enabling reproducibility.

> Technologies

Python
Scikit-learn
TensorFlow
Pandas
NumPy
Database Management

> Key Features

  • Patient-level data splitting to ensure no data leakage.
  • Comprehensive data preprocessing pipeline with iterative imputation (MICE), log transformation, and robust scaling.
  • Automated feature engineering with redundant column removal, categorical encoding, and scaling.
  • Advanced model evaluation with metrics logging, calibration plots, and feature importance analysis.
  • Automated model registry with versioning, hyperparameter tracking, and artifact storage.
  • Dynamic report generation with comprehensive visualizations (e.g., ROC, PR curves).

> Performance Metrics

randomForest

auroc:0.9760

f1:0.5594

precision:0.5280

recall:0.5948

xgboost

auroc:0.9998

f1:0.2591

precision:0.2399

recall:0.8721

logisticRegression

auroc:0.8955

f1:0.7830

precision:0.7164

recall:0.8858

> Visualizations

ROC Curve - XGBoost

ROC Curve - XGBoost

Receiver Operating Characteristic (ROC) curve showing near-perfect separation.

Precision-Recall Curve - XGBoost

Precision-Recall Curve - XGBoost

Precision-Recall curve for tuned XGBoost model.

ROC Curve - Random Forest

ROC Curve - Random Forest

Receiver Operating Characteristic (ROC) curve for tuned Random Forest model.

Precision-Recall Curve - Random Forest

Precision-Recall Curve - Random Forest

Precision-Recall curve for tuned Random Forest model.

ROC Curve - Logistic Regression

ROC Curve - Logistic Regression

Receiver Operating Characteristic (ROC) curve for tuned Logistic Regression model.

Precision-Recall Curve - Logistic Regression

Precision-Recall Curve - Logistic Regression

Precision-Recall curve for tuned Logistic Regression model.

> Key Learnings

  • Handling class imbalance with advanced techniques like SMOTEENN.
  • Optimizing hyperparameters effectively using Optuna.
  • Understanding the trade-offs between interpretability and performance in models.

> Team

Jeremy Cleland

Graudate Student