ICU Time-Series Prediction Benchmark

Comprehensive comparison of LLMs and baseline models on ICU prediction tasks

Overall Performance Across All Tasks

i Filter models by type: LLMs (Large Language Models) or traditional baseline models (CNN, LSTM, XGBoost, etc.)
i Select the primary metric for sorting and evaluation. AUROC and AUPRC are most commonly used for medical prediction tasks. Note: MCC values are normalized to 0-1 range.
i For LLMs, choose to show only the best performing prompting approach or all tested approaches for each model.
i Select specific models to display. You can choose individual models or view all models at once.
All Models

Overall Model Performance Comparison

Overall Detailed Results Table

Model PULSE Score AUROC AUPRC MCC Norm AUPRC Specificity F1 Score Accuracy Bal Accuracy Precision Recall Kappa MinPSE
Loading overall results...

Acute Kidney Injury (AKI) Prediction

i Filter results by specific ICU dataset: HiRID (Swiss), eICU (US multi-center), or MIMIC-IV (US single-center).
i Filter models by type: LLMs (Large Language Models) or traditional baseline models (CNN, LSTM, XGBoost, etc.)
i For LLMs, choose to show only the best performing prompting approach or all tested approaches for each model.
i Select the primary metric for sorting and evaluation. AUROC and AUPRC are most commonly used for medical prediction tasks. Note: MCC values are normalized to 0-1 range.
i Select specific models to display. You can choose individual models or view all models at once.
All Models

AKI Model Performance Comparison

AKI Detailed Results Table

Model Dataset AUROC AUPRC Norm AUPRC MCC Specificity F1 Score Accuracy Bal Accuracy Precision Recall Kappa MinPSE PULSE Score
Loading AKI results...

Mortality Prediction

i Filter results by specific ICU dataset: HiRID (Swiss), eICU (US multi-center), or MIMIC-IV (US single-center).
i Filter models by type: LLMs (Large Language Models) or traditional baseline models (CNN, LSTM, XGBoost, etc.)
i For LLMs, choose to show only the best performing prompting approach or all tested approaches for each model.
i Select the primary metric for sorting and evaluation. AUROC and AUPRC are most commonly used for medical prediction tasks. Note: MCC values are normalized to 0-1 range.
i Select specific models to display. You can choose individual models or view all models at once.
All Models

Mortality Model Performance Comparison

Mortality Detailed Results Table

Model Dataset AUROC AUPRC Norm AUPRC MCC Specificity F1 Score Accuracy Bal Accuracy Precision Recall Kappa MinPSE PULSE Score
Loading mortality results...

Sepsis Prediction

i Filter results by specific ICU dataset: HiRID (Swiss), eICU (US multi-center), or MIMIC-IV (US single-center).
i Filter models by type: LLMs (Large Language Models) or traditional baseline models (CNN, LSTM, XGBoost, etc.)
i For LLMs, choose to show only the best performing prompting approach or all tested approaches for each model.
i Select the primary metric for sorting and evaluation. AUROC and AUPRC are most commonly used for medical prediction tasks. Note: MCC values are normalized to 0-1 range.
i Select specific models to display. You can choose individual models or view all models at once.
All Models

Sepsis Model Performance Comparison

Sepsis Detailed Results Table

Model Dataset AUROC AUPRC Norm AUPRC MCC Specificity F1 Score Accuracy Bal Accuracy Precision Recall Kappa MinPSE PULSE Score
Loading sepsis results...