This study compares the performance of two distinct deep learning architectures, the DenseNet-121 (a Convolutional Neural Network or CNN) and the ViT-Base/16 (a Vision Transformer), on a multi-label classification task using a 5% random sample of the NIH Chest X-ray dataset. The experiment focuses on the challenge of extreme class imbalance in medical data, where "No Finding" (healthy) images dominate the dataset (53.84%) compared to rare pathologies like Hernia or Pneumonia.

To avoid the misleading nature of standard accuracy in such environments, the analysis prioritizes the Macro-Averaged F1 Score, which treats all classes equally, to evaluate the models' actual ability to detect specific diseases rather than simply recognizing the majority class.

The results indicate that the DenseNet-121 is the superior architecture for this constrained environment, achieving a Macro F1 Score of 0.0623 compared to the Vision Transformer’s 0.0472. While the Transformer achieved a higher "Exact Match Accuracy," it largely failed to detect minority diseases, often defaulting to predicting the majority class due to its lack of "inductive bias" and high data requirements.

The analysis concludes that the CNN's inherent ability to extract local features (such as edges and textures) makes it more robust for small, imbalanced datasets, whereas the Vision Transformer requires significantly more data to learn these relationships effectively.

Read full (PDF): Scribd - Google Drive