Classification of Combined MLO and CC Mammographic Views Using Vision–Language Models

Beibit Abdikenov 1 * , Nurbek Saidnassim 1, Birzhan Ayanbayev 1, Aruzhan Imasheva 1
More Detail
1 Science and Innovation Center “Artificial Intelligence”, Astana IT University
* Corresponding Author
J CLIN MED KAZ, Volume 22, Issue 6, pp. 89-95. https://doi.org/10.23950/jcmk/17449
OPEN ACCESS 372 Views 101 Downloads
Download Full Text (PDF)
Data availability statement: The corresponding author can provide the data supporting the study's conclusions upon request. Due to ethical and privacy constraints, the data are not publicly accessible.

Artificial Intelligence (AI) Disclosure Statement: AI-Unassisted Work.

ABSTRACT

Background: Breast cancer remains one of the leading causes of cancer-related deaths among women globally. Early detection through mammographic screening significantly improves survival rates, but the interpretation of mammograms is time-consuming and requires extensive expertise.  
Methods: We utilized six publicly available datasets, preprocessing paired craniocaudal (CC) and mediolateral oblique (MLO) views into dual-view concatenated images. Three vision-language models (VLMs)—Quantized Qwen2-VL-2B, Quantized SmolVLM (Idefics3-based), and MammoCLIP—were evaluated using two adaptation strategies: full supervised fine-tuning (SFT) and Linear Probing (LP). EfficientNet-B4 served as a CNN baseline. 
Results: Experiments show that while EfficientNet-B4 achieved the highest F1-score (0.5810), VLMs delivered competitive results with additional report generation capability. MammoCLIP exhibited the best VLM performance (F1 = 0.4755, ROC-AUC = 0.6906) under LP, outperforming general-purpose VLMs, which struggled with recall despite high precision. SmolVLM demonstrated balanced performance under full fine-tuning (F1 = 0.5101, ROC-AUC = 0.6304), indicating strong adaptability in resource-efficient setups. 
Conclusion: These findings highlight that domain-specific pretraining significantly enhances VLM effectiveness in mammography classification. Beyond classification, VLMs enable structured reporting and interactive decision support, offering promising avenues for clinical integration despite slightly lower predictive performance compared to specialized CNNs.

CITATION

Abdikenov B, Saidnassim N, Ayanbayev B, Imasheva A. Classification of Combined MLO and CC Mammographic Views Using Vision–Language Models. J CLIN MED KAZ. 2025;22(6):89-95. https://doi.org/10.23950/jcmk/17449

REFERENCES

  • World Health Organization. Breast cancer. Geneva: World Health Organization; 2023. Available from: https://www.who.int/news-room/fact-sheets/detail/breast-cancer
  • Lehman CD, Arao RF, Sprague BL, Lee JM, Buist DS, Kerlikowske K, Henderson LM, Onega T, Tosteson AN, Rauscher GH, Miglioretti DL. National performance benchmarks for modern screening digital mammography: update from the Breast Cancer Surveillance Consortium. Radiology. 2017;283(1):49-58. https://doi.org/10.1148/radiol.2016161174
  • Liberman L, Menell JH. Breast imaging reporting and data system (BI-RADS). Radiol Clin North Am. 2002;40(3):409-430. https://doi.org/10.1016/s0033-8389(01)00017-3
  • Abdikenov B, Zhaksylyk T, Shortanbaiuly O, Orazayev Y, Makhanov N, Karibekov T. Future of breast cancer diagnosis: A review of DL and ML applications and emerging trends for multimodal data. IEEE Access. 2025;13:1-15. https://doi.org/10.1109/ACCESS.2025.3585377
  • Li X, Li L, Jiang Y, Wang H, Qiao X, Feng T, Luo H, Zhao Y. Vision-language models in medical image analysis: From simple fusion to general large models. Inf Fusion. 2025;118:102995. https://doi.org/10.1016/j.inffus.2025.102995
  • Qin Z, Yi H, Lao Q, Li K. Medical image understanding with pretrained vision–language models: A comprehensive study. In: Proceedings of the Eleventh International Conference on Learning Representations (ICLR); 2023. Available from: https://openreview.net/pdf?id=txlWziuCE5W
  • Zhang J, Huang J, Jin S, Lu S. Vision-Language Models for Vision Tasks: A Survey. IEEE Trans Pattern Anal Mach Intell. 2024;46(8):5625-5644. https://doi.org/10.1109/TPAMI.2024.3369699
  • Güneş YC, Cesur T, Çamur E, Karabekmez LG. Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition. Diagn Interv Radiol. 2025;31(2):111-129. https://doi.org/10.4274/dir.2024.242876
  • Haver HL, Bahl M, Doo FX, Kamel PI, Parekh VS, Jeudy J, Yi PH. Evaluation of multimodal ChatGPT (GPT-4V) in describing mammography image features. Can Assoc Radiol J. 2024;75(4):947-949. https://doi.org/10.1177/08465371241247043
  • Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. PMLR; 2021. p. 8748-8763. Available from: https://proceedings.mlr.press/v139/radford21a.html
  • Wang Z, Wu Z, Agarwal D, Sun J. MedCLIP: Contrastive learning from unpaired medical images and text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP); 2022. p. 3876–3887. https://doi.org/10.18653/v1/2022.emnlp-main.256
  • Ghosh S, Poynton CB, Visweswaran S, Batmanghelich K. Mammo-CLIP: A vision–language foundation model to enhance data efficiency and robustness in mammography. In: Linguraru MG, et al., editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. Lecture Notes in Computer Science. Springer; 2024. p. 623-632. https://doi.org/10.1007/978-3-031-72390-2_59
  • Yan Z, Zhang W, Yang Y, Saha A. Multi-distribution mammogram classification: Leveraging CLIP for enhanced generalization across diverse datasets. In: Proceedings of the 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI). IEEE; 2025. p. 1-5. https://doi.org/10.1109/ISBI60581.2025.10981260
  • Cao Z, Deng Z, Ma J, Hu J, Ma L. MammoVLM: A generative large vision–language model for mammography-related diagnostic assistance. Inf Fusion. 2025;102998. https://doi.org/10.1016/j.inffus.2025.102998
  • Yan Z. Enhancing zero-shot learning with CLIP for multi-distribution mammogram classification [Master’s thesis]. Western University; 2025. Available from: https://ir.lib.uwo.ca/etd/10816
  • Vo HQ, Wang L, Wong KK, Ezeana CF, Yu X, Yang W, Chang J, Nguyen HV, Wong STC. Frozen large-scale pretrained vision–language models are the effective foundational backbone for multimodal breast cancer prediction. IEEE J Biomed Health Inform. 2025;29(5):3234-3246. https://doi.org/10.1109/JBHI.2024.3507638
  • Guo X, Chai W, Li SY, Wang G. LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound. In: Proceedings of the 32nd ACM International Conference on Multimedia. ACM; 2024. p. 8845-8854. https://doi.org/10.1145/3664647.3681584
  • Molina‑Román Y, Gómez‑Ortiz D, Menasalvas‑Ruiz E, Tamez‑Peña JG, Santos‑Díaz A. Comparison of ConvNeXt and vision–language models for breast density assessment in screening mammography. In: Proceedings of the 2025 IEEE 38th International Symposium on Computer-Based Medical Systems (CBMS). IEEE; 2025. p. 636–641. https://doi.org/10.1109/CBMS65348.2025.00132
  • Schultheiss T, Walther L, Maunz J, Weninger L, Maier A, Wirkert S. MammoBLIP: End-to-end mammography report generation using a curated, standardized, multi-institutional public dataset [Manuscript under review]. Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg; 2024. Available from: https://lme.tf.fau.de/wp-content/uploads/2020/08/MammoBLIP__End_to_End_Mammography_Report_Generation_Using_a_Curated__Standardized_Multi_Institutional_Public_Dataset-1.pdf
  • Chen X, Li Y, Hu M, Salari E, Chen X, Qiu RLJ, Zheng B, Yang X. Mammo-CLIP: Leveraging contrastive language-image pre-training (CLIP) for enhanced breast cancer diagnosis with multi-view mammography. arXiv preprint. 2024;arXiv:2404.15946. https://doi.org/10.48550/arXiv.2404.15946
  • Du Y, Onofrey J, Dvornek NC. Multi-view and multi-scale alignment for contrastive language-image pre-training in mammography. arXiv preprint. 2024;arXiv:2409.18119. https://doi.org/10.48550/arXiv.2409.18119