close
Skip to main content
Log in

IMGEF: integrated multimodal graph-enhanced framework for radiology report generation

  • Regular Paper
  • Published:
BERJAYA Multimedia Systems Aims and scope Submit manuscript

Abstract

Automated radiology report generation significantly reduces radiologists’ workload while maintaining high accuracy and readability standards. We propose an Integrated Multimodal Graph-Enhanced Framework (IMGEF) for graph-enhanced integration to generate precise, clinically relevant medical reports. IMGEF leverages the Spatial-aware Graph Embedding Module (SGEM) to aggregate features from neighbouring nodes in a graph while preserving their inherited spatial relationships. It also incorporates the Multimodal Attention-Based Feature Fusion Module (MABFFM), which integrates information from three modalities, image features, textual features, and graph-based features, to produce a unified feature representation. Extensive experiments on the IU X-ray and MIMIC-CXR datasets demonstrate the effectiveness of the IMGEF model, with results highlighting its ability to generate comprehensive, consistent, and accurate reports.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
BERJAYAThe alternative text for this image may have been generated using AI.
Fig. 2
BERJAYAThe alternative text for this image may have been generated using AI.
Fig. 3
BERJAYAThe alternative text for this image may have been generated using AI.
Fig. 4
BERJAYAThe alternative text for this image may have been generated using AI.
Fig. 5
BERJAYAThe alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

The data and source code is available at Usmannooh/IMGEF.

References

  1. Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)

  2. Xue, Y., Tan, Y., Tan, L., Qin, J., Xiang, X.: Generating radiology reports via auxiliary signal guidance and a memory-driven network. Expert Syst. Appl. 237, 121260 (2024). https://doi.org/10.1016/j.eswa.2023.121260

    Article  Google Scholar 

  3. Lang, W., Liu, Z., Zhang, Y.: Dacg: Dual attention and context guidance model for radiology report generation. Med. Image Anal. 99, 103377 (2025)

    Article  Google Scholar 

  4. Sun, Y., Lee, Y.Z., Woodard, G.A., Zhu, H., Lian, C., Liu, M.: R2gen-mamba: A selective state space model for radiology report generation. arXiv preprint arXiv:2410.18135 (2024)

  5. Shen, H., Pei, M., Liu, J., Tian, Z.: Automatic radiology reports generation via memory alignment network. Proc. AAAI Conf. Artif. Intell. 38, 4776–4783 (2024)

    Google Scholar 

  6. Chen, J., Huang, G., Yuan, X., Zhong, G., Tan, Z., Pun, C.-M., Yang, Q.: Visual-linguistic diagnostic semantic enhancement for medical report generation. J. Biomed. Inform. 161, 104764 (2025)

    Article  Google Scholar 

  7. Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13753–13762 (2021)

  8. Yan, B., Pei, M.: Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation. Proc. AAAI Conf. Artif. Intell. 36, 2982–2990 (2022)

    Google Scholar 

  9. Chen, Q., Zhao, R., Wang, S., Phan, V.M.H., Hengel, A.v.d., Verjans, J., Liao, Z., To, M.-S., Xia, Y., Chen, J., et al.: A survey of medical vision-and-language applications and their techniques. arXiv preprint arXiv:2411.12195 (2024)

  10. Liu, F., Ren, X., Liu, Y., Wang, H., Sun, X.: simnet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 137–149 (2018). https://doi.org/10.18653/v1/D18-1013

  11. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning (ICML). JMLR Workshop and Conference Proceedings, vol. 37, pp. 2048–2057 (2015). JMLR.org. http://proceedings.mlr.press/v37/xuc15.html

  12. Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4633–4642 (2019). https://doi.org/10.1109/ICCV.2019.00473

  13. Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383 (2017)

  14. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7008–7024 (2017)

  15. Liu, F., Ren, X., Liu, Y., Lei, K., Sun, X.: Exploring and distilling cross-modal information for image captioning. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 5095–5101 (2019). https://doi.org/10.24963/ijcai.2019/708

  16. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086 (2018)

  17. Huang, X., et al.: Meshed memory transformer for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10578–10587 (2019)

  18. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. Proc. AAAI Conf. Artif. Intell. 34, 13041–13049 (2020)

    Google Scholar 

  19. Yi, X., Fu, Y., Liu, R., Zhang, H., Hua, R.: Tsget: Two-stage global enhanced transformer for automatic radiology report generation. IEEE J. Biomed. Health Inform. 28(4), 2152–2162 (2024)

    Article  Google Scholar 

  20. Jing, B., et al.: On the automatic generation of medical imaging reports. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7985–7994 (2018)

  21. Chen, Z., et al.: Cross-modal memory networks for radiology report generation. arXiv preprint (2020) arXiv:2003.12052

  22. Hou, X., Liu, Z., Li, X., Li, X., Sang, S., Zhang, Y.: Mkcl: Medical knowledge with contrastive learning model for radiology report generation. J. Biomed. Inform. 146, 104496 (2023). https://doi.org/10.1016/j.jbi.2023.104496

    Article  Google Scholar 

  23. Shamshad, F., Khan, S., Zamir, S.W., Khan, M.H., Hayat, M., Khan, F.S., Fu, H.: Transformers in medical imaging: A survey. Med. Image Anal. 88, 102802 (2023)

    Article  Google Scholar 

  24. Liu, F., et al.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13753–13762 (2021)

  25. Yi, X., Fu, Y., Yu, J., Liu, R., Zhang, H., Hua, R.: Lhr-rfl: Linear hybrid-reward based reinforced focal learning for automatic radiology report generation. IEEE Transactions on Medical Imaging (2024)

  26. Atif, J., Hudelot, C., Fouquier, G., Bloch, I., Angelini, E.D.: From generic knowledge to specific reasoning for medical image interpretation using graph based representations. In: IJCAI, pp. 224–229 (2007)

  27. Wang, X., Wang, S., Ding, Y., Li, Y., Wu, W., Rong, Y., Kong, W., Huang, J., Li, S., Yang, H., et al.: State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516 (2024)

  28. Lu, P., Hu, L., Mitelpunkt, A., Bhatnagar, S., Lu, L., Liang, H.: A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer’s disease. Biomed. Signal Process. Control 88, 105669 (2024)

    Article  Google Scholar 

  29. Subedi, G.: Multimodal learning: Generating precise chest x-ray report on thorax abnormality. Master’s thesis, University of South Dakota, United States (2023)

  30. Demner-Fushman, D., Antani, S., Simpson, M., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2016)

    Article  Google Scholar 

  31. Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.-Y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019)

    Article  Google Scholar 

  32. Li, Y., Liang, X., Hu, Z., Xing, E.P.: Hybrid retrieval-generation reinforced agent for medical image report generation. Advances in neural information processing systems 31 (2018)

  33. Chen, Z., Song, Y., Chang, T.-H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1439–1449 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.112

  34. Jing, B., Wang, Z., Xing, E.: Show, describe and conclude: On exploiting the structure information of chest x-ray reports. (2020)

  35. Kinga, D., Adam, J.B., et al.: A method for stochastic optimization. Int. Conf. Learn. Represent. (ICLR) 5, 6 (2015). (San Diego, California)

    Google Scholar 

  36. Chen, Z., Song, Y., Chang, T.-H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1439–1449 (2020)

  37. Gupta, A.: Analyzing multimodal machine learning model performance and evaluation metrics for medical report generation. PhD thesis, Carnegie Mellon University Pittsburgh, PA (2024)

  38. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

  39. Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72 (2005)

  40. Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

  41. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)

  42. Chen, Z., Shen, Y., Song, Y., Wan, X.: Cross-modal memory networks for radiology report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5904–5914 (2021). https://doi.org/10.18653/v1/2021.acl-long.459

  43. Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2577–2586 (2018). https://doi.org/10.18653/v1/P18-1240

  44. Jing, B., Wang, Z., Xing, E.: Show, describe and conclude: On exploiting the structure information of chest x-ray reports. In: Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), pp. 6570–6580 (2019)

  45. Liu, F., Ge, S., Wu, X.: Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 3001–3012 (2021). https://doi.org/10.18653/v1/2021.acl-long.234

  46. Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13753–13762 (2021)

  47. Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242–3250 (2017). https://doi.org/10.1109/CVPR.2017.345

Download references

Acknowledgements

This work is supported by a grant from the Natural Science Foundation of China (Grant No. 62072070). All authors have approved the manuscript for publication.

Funding

The funding was provided by Natural Science Foundation of China (Grant No. 62072070).

Author information

Authors and Affiliations

Authors

Contributions

MU designed the methodology, conceptualization, and software and analyzed the results-original draft. XH contributes to the validation, Investigation, writing review and editing. YG and ZL provide input for the support analysis and contribute to the writing process. ZY, as the supervisor, contributed to the methodology design and experiments, offering valuable feedback and suggestions throughout the process.

Corresponding author

Correspondence to Zhang Yijia.

Ethics declarations

Competing interest

The authors declare no competing interests.

Ethical approval and consent for data

The data for this research project is publicly available and does not require additional consent for its use.

Financial interest

The authors confirm that no financial interest or personal interaction influenced the development or findings of the IMGEF framework presented in the paper.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Usman, M., Hou, X., Guo, Y. et al. IMGEF: integrated multimodal graph-enhanced framework for radiology report generation. Multimedia Systems 31, 275 (2025). https://doi.org/10.1007/s00530-025-01858-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s00530-025-01858-7

Keywords