# The three model families benchmarked across all three tasks.
models = {
"logistic": LogisticRegression(...),
"random_forest": RandomForestClassifier(...),
"xgboost": XGBClassifier(...),
}Predictive Modeling of U.S. Oral Health Outcomes
Logistic regression, random forests, and XGBoost on NHANES 2017–2018.
Summary. With a team of three, I led the shallow-learning analysis on NHANES 2017–2018 (n=5,265 adults), benchmarking logistic regression, random forests, and XGBoost across two binary classification tasks and one regression task. Best models hit a 5-fold CV ROC-AUC of 0.849 (self-rated oral health) and 0.844 (clinician-recommended care). A two-stage regression cut DMFT mean absolute error from 6.98 to 4.67 teeth (33%) using only socioeconomic predictors.
DSAN 5300, Statistical Learning, Spring 2026. I owned the data preprocessing pipeline and co-authored the manuscript.
The setup
[TODO 1 paragraph framing. Why NHANES, why these three tasks, what makes oral-health prediction interesting from a public-health standpoint. The economic angle (predicting need for care from socioeconomic features alone) is the strongest hook.]
Data and preprocessing
[TODO describe the merged NHANES tables (oral exam, demographics, SES), the imputation strategy, and the train/test splitting decisions. If you can render a sample DataFrame here it’s a great signal of the data wrangling work.]
Models
[TODO a few sentences on hyperparameter tuning approach (grid versus random versus Bayesian) and any cross-validation specifics.]
Results
[TODO a results table, ideally rendered from saved CSV so it stays accurate. Highlight the headline numbers, ROC-AUC of 0.849 and 0.844, and the 33% MAE reduction.]
What surprised me
[TODO 1 to 2 specific surprises. Examples to consider include which predictors mattered most, where XGBoost beat or didn’t beat logistic regression, and what the residuals told you about who the model misses.]
Caveats
A model that predicts oral-health outcomes from socioeconomic predictors is also, implicitly, a model of structural inequity. The accuracy is real, and so is the responsibility to think hard about how a result like this gets used.