Curriculum Learning for Dental Disease Detection

A three-stage YOLOv8 pipeline on the DENTEX 2023 dataset, and an honest negative result.

computer vision
deep learning
medical imaging
YOLOv8
PyTorch
Published

April 15, 2026

Summary. A three-stage curriculum learning framework (quadrant localization, then tooth enumeration, then disease diagnosis) on the DENTEX 2023 panoramic X-ray dataset (2,032 hierarchically labeled images) using YOLOv8m segmentation models. Against a matched single-stage baseline, the curriculum approach achieved mAP@0.5 of 0.394 versus 0.417, a small but real regression. The empirical takeaway is that on this size of dataset, additional weakly-related supervision didn’t help fine-grained detection. Class imbalance was the dominant limitation, not the training schedule.

Note

This was my final project for DSAN 6600, Neural Networks & Advanced Deep Learning at Georgetown (Spring 2026).

The question

Curriculum learning, training models on easier sub-tasks before harder ones, has a strong intuitive appeal, especially for hierarchical labels. Dental panoramic X-rays are a near-perfect test bed. Every tooth lives in a quadrant, has a number, and may or may not have one of several conditions. Does staging the supervision in that order actually help fine-grained disease detection on a small medical dataset?

Approach

[TODO 1 to 2 paragraphs on data prep, augmentation, model config. Pull from the report. Keep it concrete around image sizes, batch size, loss, and schedule.]

# Sketch of the curriculum schedule. Full code in the repo.
stages = [
    {"task": "quadrant_localization", "epochs": 30, "data": "quadrant_labels"},
    {"task": "tooth_enumeration",     "epochs": 40, "data": "tooth_labels"},
    {"task": "disease_diagnosis",     "epochs": 60, "data": "disease_labels"},
]

Results

[TODO drop in the table comparing curriculum versus single-stage baseline across mAP@0.5, precision, recall, and per-class F1. If the predictions are saved as CSV, render the table here from a pd.read_csv() cell so it stays in sync with the source data.]

What I learned

The interesting part of this project wasn’t the architecture. It was sitting with a result that didn’t go the way I expected and figuring out why. Two things stood out.

  1. The class distribution was doing more work than the schedule. A small handful of disease classes dominated. A curriculum that doesn’t address that imbalance just front-loads the easy stages without solving the actual problem.
  2. “More supervision” is not a free lunch on small datasets. Each curriculum stage adds variance from its own labels. If those labels are only weakly related to the downstream task, you can pay the variance cost without earning the bias reduction.

What I’d do differently

[TODO for example focal loss or class-rebalanced sampling, pretraining on a related larger dataset, ablating which curriculum stages help versus hurt.]

Code

Repository on GitHub