Abstract
Automated radiology report generation significantly reduces radiologists’ workload while maintaining high accuracy and readability standards. We propose an Integrated Multimodal Graph-Enhanced Framework (IMGEF) for graph-enhanced integration to generate precise, clinically relevant medical reports. IMGEF leverages the Spatial-aware Graph Embedding Module (SGEM) to aggregate features from neighbouring nodes in a graph while preserving their inherited spatial relationships. It also incorporates the Multimodal Attention-Based Feature Fusion Module (MABFFM), which integrates information from three modalities, image features, textual features, and graph-based features, to produce a unified feature representation. Extensive experiments on the IU X-ray and MIMIC-CXR datasets demonstrate the effectiveness of the IMGEF model, with results highlighting its ability to generate comprehensive, consistent, and accurate reports.





Similar content being viewed by others
Data availability
The data and source code is available at Usmannooh/IMGEF.
References
Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)
Xue, Y., Tan, Y., Tan, L., Qin, J., Xiang, X.: Generating radiology reports via auxiliary signal guidance and a memory-driven network. Expert Syst. Appl. 237, 121260 (2024). https://doi.org/10.1016/j.eswa.2023.121260
Lang, W., Liu, Z., Zhang, Y.: Dacg: Dual attention and context guidance model for radiology report generation. Med. Image Anal. 99, 103377 (2025)
Sun, Y., Lee, Y.Z., Woodard, G.A., Zhu, H., Lian, C., Liu, M.: R2gen-mamba: A selective state space model for radiology report generation. arXiv preprint arXiv:2410.18135 (2024)
Shen, H., Pei, M., Liu, J., Tian, Z.: Automatic radiology reports generation via memory alignment network. Proc. AAAI Conf. Artif. Intell. 38, 4776–4783 (2024)
Chen, J., Huang, G., Yuan, X., Zhong, G., Tan, Z., Pun, C.-M., Yang, Q.: Visual-linguistic diagnostic semantic enhancement for medical report generation. J. Biomed. Inform. 161, 104764 (2025)
Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13753–13762 (2021)
Yan, B., Pei, M.: Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation. Proc. AAAI Conf. Artif. Intell. 36, 2982–2990 (2022)
Chen, Q., Zhao, R., Wang, S., Phan, V.M.H., Hengel, A.v.d., Verjans, J., Liao, Z., To, M.-S., Xia, Y., Chen, J., et al.: A survey of medical vision-and-language applications and their techniques. arXiv preprint arXiv:2411.12195 (2024)
Liu, F., Ren, X., Liu, Y., Wang, H., Sun, X.: simnet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 137–149 (2018). https://doi.org/10.18653/v1/D18-1013
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning (ICML). JMLR Workshop and Conference Proceedings, vol. 37, pp. 2048–2057 (2015). JMLR.org. http://proceedings.mlr.press/v37/xuc15.html
Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4633–4642 (2019). https://doi.org/10.1109/ICCV.2019.00473
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383 (2017)
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7008–7024 (2017)
Liu, F., Ren, X., Liu, Y., Lei, K., Sun, X.: Exploring and distilling cross-modal information for image captioning. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 5095–5101 (2019). https://doi.org/10.24963/ijcai.2019/708
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086 (2018)
Huang, X., et al.: Meshed memory transformer for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10578–10587 (2019)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. Proc. AAAI Conf. Artif. Intell. 34, 13041–13049 (2020)
Yi, X., Fu, Y., Liu, R., Zhang, H., Hua, R.: Tsget: Two-stage global enhanced transformer for automatic radiology report generation. IEEE J. Biomed. Health Inform. 28(4), 2152–2162 (2024)
Jing, B., et al.: On the automatic generation of medical imaging reports. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7985–7994 (2018)
Chen, Z., et al.: Cross-modal memory networks for radiology report generation. arXiv preprint (2020) arXiv:2003.12052
Hou, X., Liu, Z., Li, X., Li, X., Sang, S., Zhang, Y.: Mkcl: Medical knowledge with contrastive learning model for radiology report generation. J. Biomed. Inform. 146, 104496 (2023). https://doi.org/10.1016/j.jbi.2023.104496
Shamshad, F., Khan, S., Zamir, S.W., Khan, M.H., Hayat, M., Khan, F.S., Fu, H.: Transformers in medical imaging: A survey. Med. Image Anal. 88, 102802 (2023)
Liu, F., et al.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13753–13762 (2021)
Yi, X., Fu, Y., Yu, J., Liu, R., Zhang, H., Hua, R.: Lhr-rfl: Linear hybrid-reward based reinforced focal learning for automatic radiology report generation. IEEE Transactions on Medical Imaging (2024)
Atif, J., Hudelot, C., Fouquier, G., Bloch, I., Angelini, E.D.: From generic knowledge to specific reasoning for medical image interpretation using graph based representations. In: IJCAI, pp. 224–229 (2007)
Wang, X., Wang, S., Ding, Y., Li, Y., Wu, W., Rong, Y., Kong, W., Huang, J., Li, S., Yang, H., et al.: State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516 (2024)
Lu, P., Hu, L., Mitelpunkt, A., Bhatnagar, S., Lu, L., Liang, H.: A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer’s disease. Biomed. Signal Process. Control 88, 105669 (2024)
Subedi, G.: Multimodal learning: Generating precise chest x-ray report on thorax abnormality. Master’s thesis, University of South Dakota, United States (2023)
Demner-Fushman, D., Antani, S., Simpson, M., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2016)
Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.-Y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019)
Li, Y., Liang, X., Hu, Z., Xing, E.P.: Hybrid retrieval-generation reinforced agent for medical image report generation. Advances in neural information processing systems 31 (2018)
Chen, Z., Song, Y., Chang, T.-H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1439–1449 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.112
Jing, B., Wang, Z., Xing, E.: Show, describe and conclude: On exploiting the structure information of chest x-ray reports. (2020)
Kinga, D., Adam, J.B., et al.: A method for stochastic optimization. Int. Conf. Learn. Represent. (ICLR) 5, 6 (2015). (San Diego, California)
Chen, Z., Song, Y., Chang, T.-H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1439–1449 (2020)
Gupta, A.: Analyzing multimodal machine learning model performance and evaluation metrics for medical report generation. PhD thesis, Carnegie Mellon University Pittsburgh, PA (2024)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72 (2005)
Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Chen, Z., Shen, Y., Song, Y., Wan, X.: Cross-modal memory networks for radiology report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5904–5914 (2021). https://doi.org/10.18653/v1/2021.acl-long.459
Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2577–2586 (2018). https://doi.org/10.18653/v1/P18-1240
Jing, B., Wang, Z., Xing, E.: Show, describe and conclude: On exploiting the structure information of chest x-ray reports. In: Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), pp. 6570–6580 (2019)
Liu, F., Ge, S., Wu, X.: Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 3001–3012 (2021). https://doi.org/10.18653/v1/2021.acl-long.234
Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13753–13762 (2021)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242–3250 (2017). https://doi.org/10.1109/CVPR.2017.345
Acknowledgements
This work is supported by a grant from the Natural Science Foundation of China (Grant No. 62072070). All authors have approved the manuscript for publication.
Funding
The funding was provided by Natural Science Foundation of China (Grant No. 62072070).
Author information
Authors and Affiliations
Contributions
MU designed the methodology, conceptualization, and software and analyzed the results-original draft. XH contributes to the validation, Investigation, writing review and editing. YG and ZL provide input for the support analysis and contribute to the writing process. ZY, as the supervisor, contributed to the methodology design and experiments, offering valuable feedback and suggestions throughout the process.
Corresponding author
Ethics declarations
Competing interest
The authors declare no competing interests.
Ethical approval and consent for data
The data for this research project is publicly available and does not require additional consent for its use.
Financial interest
The authors confirm that no financial interest or personal interaction influenced the development or findings of the IMGEF framework presented in the paper.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Usman, M., Hou, X., Guo, Y. et al. IMGEF: integrated multimodal graph-enhanced framework for radiology report generation. Multimedia Systems 31, 275 (2025). https://doi.org/10.1007/s00530-025-01858-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s00530-025-01858-7

