Classification of Combined MLO and CC Mammographic Views Using Vision–Language Models

Beibit Abdikenov 1 * , Nurbek Saidnassim 1, Birzhan Ayanbayev 1, Aruzhan Imasheva 1
More Detail
1 Science and Innovation Center “Artificial Intelligence”, Astana IT University
* Corresponding Author
J CLIN MED KAZ, In press.
OPEN ACCESS 23 Views 0 Downloads

ABSTRACT

Background: Breast cancer remains one of the leading causes of cancer-related deaths among women globally. Early detection through mammographic screening significantly improves survival rates, but the interpretation of mammograms is time-consuming and requires extensive expertise.  
Methods: We utilized six publicly available datasets, preprocessing paired craniocaudal (CC) and mediolateral oblique (MLO) views into dual-view concatenated images. Three vision-language models (VLMs)—Quantized Qwen2-VL-2B, Quantized SmolVLM (Idefics3-based), and MammoCLIP—were evaluated using two adaptation strategies: full supervised fine-tuning (SFT) and Linear Probing (LP). EfficientNet-B4 served as a CNN baseline. 
Results: Experiments show that while EfficientNet-B4 achieved the highest F1-score (0.5810), VLMs delivered competitive results with additional report generation capability. MammoCLIP exhibited the best VLM performance (F1 = 0.4755, ROC-AUC = 0.6906) under LP, outperforming general-purpose VLMs, which struggled with recall despite high precision. SmolVLM demonstrated balanced performance under full fine-tuning (F1 = 0.5101, ROC-AUC = 0.6304), indicating strong adaptability in resource-efficient setups. 
Conclusion: These findings highlight that domain-specific pretraining significantly enhances VLM effectiveness in mammography classification. Beyond classification, VLMs enable structured reporting and interactive decision support, offering promising avenues for clinical integration despite slightly lower predictive performance compared to specialized CNNs.

CITATION

Abdikenov B, Saidnassim N, Ayanbayev B, Imasheva A. Classification of Combined MLO and CC Mammographic Views Using Vision–Language Models. J Clin Med Kaz. 2025.