Predictive Modeling of U.S. Oral Health Outcomes

Logistic regression, random forests, and XGBoost on NHANES 2017–2018.

statistical learning

classification

regression

public health

XGBoost

Published

April 1, 2026

Summary. With a team of three, I led the shallow-learning analysis on NHANES 2017–2018 (n=5,265 adults), benchmarking logistic regression, random forests, and XGBoost across two binary classification tasks and one regression task. Best models hit a 5-fold CV ROC-AUC of 0.849 (self-rated oral health) and 0.844 (clinician-recommended care). A two-stage regression cut DMFT mean absolute error from 6.98 to 4.67 teeth (33%) using only socioeconomic predictors.

Note

DSAN 5300, Statistical Learning, Spring 2026. I owned the data preprocessing pipeline and co-authored the manuscript.

The setup

[TODO 1 paragraph framing. Why NHANES, why these three tasks, what makes oral-health prediction interesting from a public-health standpoint. The economic angle (predicting need for care from socioeconomic features alone) is the strongest hook.]

Data and preprocessing

[TODO describe the merged NHANES tables (oral exam, demographics, SES), the imputation strategy, and the train/test splitting decisions. If you can render a sample DataFrame here it’s a great signal of the data wrangling work.]

Models

# The three model families benchmarked across all three tasks.
models = {
    "logistic":      LogisticRegression(...),
    "random_forest": RandomForestClassifier(...),
    "xgboost":       XGBClassifier(...),
}

[TODO a few sentences on hyperparameter tuning approach (grid versus random versus Bayesian) and any cross-validation specifics.]

Results

[TODO a results table, ideally rendered from saved CSV so it stays accurate. Highlight the headline numbers, ROC-AUC of 0.849 and 0.844, and the 33% MAE reduction.]

What surprised me

[TODO 1 to 2 specific surprises. Examples to consider include which predictors mattered most, where XGBoost beat or didn’t beat logistic regression, and what the residuals told you about who the model misses.]

Caveats

A model that predicts oral-health outcomes from socioeconomic predictors is also, implicitly, a model of structural inequity. The accuracy is real, and so is the responsibility to think hard about how a result like this gets used.

Code

Repository on GitHub