(2) * Chastine Fatichah
(3) Nanik Suciati
*corresponding author
AbstractThe evaluation of automatically generated radiology reports remains a critical challenge, as conventional metrics fail to capture the semantic, clinical, and contextual correctness required for automatic medical analysis. This study proposes RadEval, a semantic-aware evaluation framework, to assess the quality of generated radiology reports. This method integrates domain-specific knowledge and contextual embeddings to evaluate the quality of generated radiology reports using a four-level scoring system. Given a reference report and a predicted report from a radiology image, RadEval performs scoring evaluation by first extracting relevant medical entities using a fine-tuned biomedical NER model. These entities are normalized through ontology mapping using RadLex concept identifiers to resolve lexical variation. Then, semantically related entities were clustered using BioBERT's contextual embeddings to capture deeper semantic similarity. In addition, predicted abnormality tags are incorporated to weight clinically significant terms during score aggregation. The final semantic score reflects a weighted combination of exact match, ontology match, and contextual similarity, modulated by tag importance. Experiments were conducted on the MIMIC-CXR dataset, which contains over 200,000 report pairs. Comparative evaluations show that RadEval outperforms traditional metrics, achieving an F1-score of 0.69, compared to 0.56 for BERTScore. Using this method, a more precise clinical interpretation of the predicted report was captured from the reference report. These findings suggest that RadEval method provides a more accurate and clinically aligned framework for evaluating the medical report generation model.
KeywordsRadiology Report Generation; Semantic Evaluation; Clinical NLP; Medical Ontology; Clustering
|
DOIhttps://doi.org/10.26555/ijain.v11i4.2151 |
Article metricsAbstract views : 345 | PDF views : 24 |
Cite |
Full Text Download
|
References
[1] G. Reale-Nosei, E. Amador-Domínguez, and E. Serrano, “From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,” Med. Image Anal., vol. 97, p. 103264, Oct. 2024, doi: 10.1016/j.media.2024.103264.
[2] U. Berger, G. Stanovsky, O. Abend, and L. Frermann, “Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis,” arxiv Artif. Intell., pp. 1–44, Mar. 2025. [Online]. Available at: https://share.google/cl5FrpQPS8mhk88ck.
[3] D. Sharma, C. Dhiman, and D. Kumar, “Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive survey,” Expert Syst. Appl., vol. 221, p. 119773, Jul. 2023, doi: 10.1016/j.eswa.2023.119773.
[4] M. Moor et al., “Foundation models for generalist medical artificial intelligence,” Nature, vol. 616, no. 7956, pp. 259–265, Apr. 2023, doi: 10.1038/s41586-023-05881-4.
[5] K. Zhang, P. Li, and J. Wang, “A Review of Deep Learning-Based Remote Sensing Image Caption: Methods, Models, Comparisons and Future Directions,” Remote Sens., vol. 16, no. 21, p. 4113, Nov. 2024, doi: 10.3390/rs16214113.
[6] M. M. A. Monshi, J. Poon, and V. Chung, “Deep learning in generating radiology reports: A survey,” Artif. Intell. Med., vol. 106, p. 101878, Jun. 2020, doi: 10.1016/j.artmed.2020.101878.
[7] K. Han et al., “A Survey on Vision Transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, Jan. 2023, doi: 10.1109/TPAMI.2022.3152247.
[8] F. Shamshad et al., “Transformers in medical imaging: A survey,” Med. Image Anal., vol. 88, p. 102802, Aug. 2023, doi: 10.1016/j.media.2023.102802.
[9] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in International Conference on Learning Representations, 2021, pp. 1–21, [Online]. Available at: https://openreview.net/pdf?id=YicbFdNTTy.
[10] H. Tsaniya, C. Fatichah, and N. Suciati, “Automatic Radiology Report Generator Using Transformer With Contrast-Based Image Enhancement,” IEEE Access, vol. 12, pp. 25429–25442, 2024, doi: 10.1109/ACCESS.2024.3364373.
[11] D. You, F. Liu, S. Ge, X. Xie, J. Zhang, and X. Wu, “AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation,” 2021, pp. 72–82, doi: 10.1007/978-3-030-87199-4_7.
[12] T. Zhang, W. Xu, B. Luo, and G. Wang, “Depth-Wise Convolutions in Vision Transformers for efficient training on small datasets,” Neurocomputing, vol. 617, p. 128998, Feb. 2025, doi: 10.1016/j.neucom.2024.128998.
[13] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International conference on machine learning, 2021, pp. 5583–5594, [Online]. Available at: https://share.google/nup9TStriM3ctqEN9.
[14] H. Tan and M. Bansal, “LXMert: Learning cross-modality encoder representations from transformers,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 5100–5111, 2019, doi: 10.18653/v1/d19-1514.
[15] M. M. Mohsan, M. U. Akram, G. Rasool, N. S. Alghamdi, M. A. A. Baqai, and M. Abbas, “Vision Transformer and Language Model Based Radiology Report Generation,” IEEE Access, vol. 11, pp. 1814–1824, 2023, doi: 10.1109/ACCESS.2022.3232719.
[16] Z. Chen, Y. Song, T.-H. Chang, and X. Wan, “Generating Radiology Reports via Memory-driven Transformer,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1439–1449, doi: 10.18653/v1/2020.emnlp-main.112.
[17] Z. Wang, L. Liu, L. Wang, and L. Zhou, “R2GenGPT: Radiology Report Generation with frozen LLMs,” Meta-Radiology, vol. 1, no. 3, p. 100033, Nov. 2023, doi: 10.1016/j.metrad.2023.100033.
[18] C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 21315–21326, doi: 10.1109/ICCV51070.2023.01954.
[19] P. Singh and S. Singh, “ChestX-Transcribe: a multimodal transformer for automated radiology report generation from chest x-rays,” Front. Digit. Heal., vol. 7, Jan. 2025, doi: 10.3389/fdgth.2025.1535168.
[20] Z. Ni et al., “M2Trans: Multi-Modal Regularized Coarse-to-Fine Transformer for Ultrasound Image Super-Resolution,” IEEE J. Biomed. Heal. Informatics, vol. 29, no. 5, pp. 3112–3123, May 2025, doi: 10.1109/JBHI.2024.3454068.
[21] J. H. Moon, H. Lee, W. Shin, Y.-H. Kim, and E. Choi, “Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training,” IEEE J. Biomed. Heal. Informatics, vol. 26, no. 12, pp. 6070–6080, Dec. 2022, doi: 10.1109/JBHI.2022.3207502.
[22] S.-C. Huang, L. Shen, M. P. Lungren, and S. Yeung, “GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp. 3922–3931, doi: 10.1109/ICCV48922.2021.00391.
[23] S. Bannur et al., “Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2023-June, pp. 15016–15027, 2023, doi: 10.1109/CVPR52729.2023.01442.
[24] S. Basu, M. Gupta, P. Rana, P. Gupta, and C. Arora, “RadFormer: Transformers with global–local attention for interpretable and accurate Gallbladder Cancer detection,” Med. Image Anal., vol. 83, p. 102676, Jan. 2023, doi: 10.1016/j.media.2022.102676.
[25] K. Roy, T. Garg, and V. Palit, “Knowledge Graph Guided Semantic Evaluation of Language Models For User Trust,” in 2023 IEEE Conference on Artificial Intelligence (CAI), Jun. 2023, pp. 234–236, doi: 10.1109/CAI54212.2023.00108.
[26] G. Wiher, C. Meister, and R. Cotterell, “On Decoding Strategies for Neural Text Generators,” Trans. Assoc. Comput. Linguist., vol. 10, pp. 997–1012, Sep. 2022, doi: 10.1162/tacl_a_00502.
[27] Z. Zhao et al., “ChatCAD+: Toward a Universal and Reliable Interactive CAD Using LLMs,” IEEE Trans. Med. Imaging, vol. 43, no. 11, pp. 3755–3766, Nov. 2024, doi: 10.1109/TMI.2024.3398350.
[28] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, 2001, p. 311, doi: 10.3115/1073083.1073135.
[29] C.-Y. Lin, “{ROUGE}: A Package for Automatic Evaluation of Summaries,” in Text Summarization Branches Out, Jul. 2004, pp. 74–81, [Online]. Available at: https://aclanthology.org/W04-1013/.
[30] A. Lavie and A. Agarwal, “Meteor,” in Proceedings of the Second Workshop on Statistical Machine Translation - StatMT ’07, 2007, pp. 228–231, doi: 10.3115/1626355.1626389.
[31] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” in International Conference on Learning Representations, 2020, pp. 1–43, [Online]. Available at: https://openreview.net/pdf?id=SkeHuCVFDr.
[32] T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning Robust Metrics for Text Generation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7881–7892, doi: 10.18653/v1/2020.acl-main.704.
[33] J. Irvin et al., “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison,” Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 590–597, Jul. 2019, doi: 10.1609/aaai.v33i01.3301590.
[34] S. Jain et al., “RadGraph: Extracting Clinical Entities and Relations from Radiology Reports,” in Advances in Neural Information Processing Systems, 2021, no. NeurIPS, pp. 1–12, [Online]. Available at: https://datasets-benchmarksproceedings.neurips.cc/paper/2021/file/c8ffe9a587b126f152ed3d89a146b445-Paper-round1.pdf.
[35] F. Yu et al., “Evaluating progress in automatic chest X-ray radiology report generation,” Patterns, vol. 4, no. 9, 2023, doi: 10.1016/j.patter.2023.100802.
[36] A. Smit, S. Jain, P. Rajpurkar, A. Pareek, A. Y. Ng, and M. P. Lungren, “CheXbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT,” EMNLP 2020 - 2020 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., pp. 1500–1519, 2020, doi: 10.18653/v1/2020.emnlp-main.117.
[37] V. N. Garla and C. Brandt, “Semantic similarity in the biomedical domain: an evaluation across knowledge sources,” BMC Bioinformatics, vol. 13, no. 1, p. 261, Dec. 2012, doi: 10.1186/1471-2105-13-261.
[38] F. Remy, K. Demuynck, and T. Demeester, “BioLORD: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions,” in Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 1454–1465, doi: 10.18653/v1/2022.findings-emnlp.104.
[39] M. Gao, X. Hu, X. Yin, J. Ruan, X. Pu, and X. Wan, “LLM-based NLG Evaluation: Current Status and Challenges,” Comput. Linguist., vol. 51, no. 2, pp. 661–687, Jun. 2025, doi: 10.1162/coli_a_00561.
[40] D. Deutsch, R. Dror, and D. Roth, “On the Limitations of Reference-Free Evaluations of Generated Text,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 10960–10977, doi: 10.18653/v1/2022.emnlp-main.753.
[41] S. Min et al., “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 12076–12100, doi: 10.18653/v1/2023.emnlp-main.741.
[42] D. Khashabi et al., “GENIE Toward Reproducible and Standardized Human Evaluation for Text Generation,” Proc. 2022 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2022, pp. 11444–11458, 2022, doi: 10.18653/v1/2022.emnlp-main.787.
[43] “RadLex Ontology,” America, Radiological Society of North, 2024. [Online]. Available at: https://share.google/e3HhnqQZymJCMSROV.
[44] SNOMED International, “SNOMED CT: The Global Clinical Terminology,” National Library of Medicine, 2024. [Online]. Available at: https://www.nlm.nih.gov/healthit/snomedct/international.html.
[45] O. Bodenreider, “The Unified Medical Language System (UMLS): integrating biomedical terminology,” Nucleic Acids Res., vol. 32, no. suppl_1, pp. D267--D270, 2004, doi: 10.1093/nar/gkh061.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

























Download