IMGEF: integrated multimodal graph-enhanced framework for radiology report generation

Usman, Muhammad; Hou, Xiaodi; Guo, Yi; Liang, Zonglin; Yijia, Zhang

doi:10.1007/s00530-025-01858-7

IMGEF: integrated multimodal graph-enhanced framework for radiology report generation

Regular Paper
Published: 31 May 2025

Volume 31, article number 275, (2025)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Muhammad Usman¹,
Xiaodi Hou²,
Yi Guo¹,
Zonglin Liang¹ &
…
Zhang Yijia¹

278 Accesses
3 Citations
Explore all metrics

Abstract

Automated radiology report generation significantly reduces radiologists’ workload while maintaining high accuracy and readability standards. We propose an Integrated Multimodal Graph-Enhanced Framework (IMGEF) for graph-enhanced integration to generate precise, clinically relevant medical reports. IMGEF leverages the Spatial-aware Graph Embedding Module (SGEM) to aggregate features from neighbouring nodes in a graph while preserving their inherited spatial relationships. It also incorporates the Multimodal Attention-Based Feature Fusion Module (MABFFM), which integrates information from three modalities, image features, textual features, and graph-based features, to produce a unified feature representation. Extensive experiments on the IU X-ray and MIMIC-CXR datasets demonstrate the effectiveness of the IMGEF model, with results highlighting its ability to generate comprehensive, consistent, and accurate reports.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Modality and Multi-Grained Transformer for Accurate Radiology Report Generation

Enhancing radiology report generation through pre-trained language models

Article 19 December 2024

Look, Imitate and Refine: A Hierarchical Multimodel Retrieval Augmented Vision-Language Model for Radiology Report Generation

Data availability

The data and source code is available at Usmannooh/IMGEF.

References

Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)
Xue, Y., Tan, Y., Tan, L., Qin, J., Xiang, X.: Generating radiology reports via auxiliary signal guidance and a memory-driven network. Expert Syst. Appl. 237, 121260 (2024). https://doi.org/10.1016/j.eswa.2023.121260
Article Google Scholar
Lang, W., Liu, Z., Zhang, Y.: Dacg: Dual attention and context guidance model for radiology report generation. Med. Image Anal. 99, 103377 (2025)
Article Google Scholar
Sun, Y., Lee, Y.Z., Woodard, G.A., Zhu, H., Lian, C., Liu, M.: R2gen-mamba: A selective state space model for radiology report generation. arXiv preprint arXiv:2410.18135 (2024)
Shen, H., Pei, M., Liu, J., Tian, Z.: Automatic radiology reports generation via memory alignment network. Proc. AAAI Conf. Artif. Intell. 38, 4776–4783 (2024)
Google Scholar
Chen, J., Huang, G., Yuan, X., Zhong, G., Tan, Z., Pun, C.-M., Yang, Q.: Visual-linguistic diagnostic semantic enhancement for medical report generation. J. Biomed. Inform. 161, 104764 (2025)
Article Google Scholar
Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13753–13762 (2021)
Yan, B., Pei, M.: Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation. Proc. AAAI Conf. Artif. Intell. 36, 2982–2990 (2022)
Google Scholar
Chen, Q., Zhao, R., Wang, S., Phan, V.M.H., Hengel, A.v.d., Verjans, J., Liao, Z., To, M.-S., Xia, Y., Chen, J., et al.: A survey of medical vision-and-language applications and their techniques. arXiv preprint arXiv:2411.12195 (2024)
Liu, F., Ren, X., Liu, Y., Wang, H., Sun, X.: simnet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 137–149 (2018). https://doi.org/10.18653/v1/D18-1013
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning (ICML). JMLR Workshop and Conference Proceedings, vol. 37, pp. 2048–2057 (2015). JMLR.org. http://proceedings.mlr.press/v37/xuc15.html
Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4633–4642 (2019). https://doi.org/10.1109/ICCV.2019.00473
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383 (2017)
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7008–7024 (2017)
Liu, F., Ren, X., Liu, Y., Lei, K., Sun, X.: Exploring and distilling cross-modal information for image captioning. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 5095–5101 (2019). https://doi.org/10.24963/ijcai.2019/708
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086 (2018)
Huang, X., et al.: Meshed memory transformer for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10578–10587 (2019)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. Proc. AAAI Conf. Artif. Intell. 34, 13041–13049 (2020)
Google Scholar
Yi, X., Fu, Y., Liu, R., Zhang, H., Hua, R.: Tsget: Two-stage global enhanced transformer for automatic radiology report generation. IEEE J. Biomed. Health Inform. 28(4), 2152–2162 (2024)
Article Google Scholar
Jing, B., et al.: On the automatic generation of medical imaging reports. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7985–7994 (2018)
Chen, Z., et al.: Cross-modal memory networks for radiology report generation. arXiv preprint (2020) arXiv:2003.12052
Hou, X., Liu, Z., Li, X., Li, X., Sang, S., Zhang, Y.: Mkcl: Medical knowledge with contrastive learning model for radiology report generation. J. Biomed. Inform. 146, 104496 (2023). https://doi.org/10.1016/j.jbi.2023.104496
Article Google Scholar
Shamshad, F., Khan, S., Zamir, S.W., Khan, M.H., Hayat, M., Khan, F.S., Fu, H.: Transformers in medical imaging: A survey. Med. Image Anal. 88, 102802 (2023)
Article Google Scholar
Liu, F., et al.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13753–13762 (2021)
Yi, X., Fu, Y., Yu, J., Liu, R., Zhang, H., Hua, R.: Lhr-rfl: Linear hybrid-reward based reinforced focal learning for automatic radiology report generation. IEEE Transactions on Medical Imaging (2024)
Atif, J., Hudelot, C., Fouquier, G., Bloch, I., Angelini, E.D.: From generic knowledge to specific reasoning for medical image interpretation using graph based representations. In: IJCAI, pp. 224–229 (2007)
Wang, X., Wang, S., Ding, Y., Li, Y., Wu, W., Rong, Y., Kong, W., Huang, J., Li, S., Yang, H., et al.: State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516 (2024)
Lu, P., Hu, L., Mitelpunkt, A., Bhatnagar, S., Lu, L., Liang, H.: A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer’s disease. Biomed. Signal Process. Control 88, 105669 (2024)
Article Google Scholar
Subedi, G.: Multimodal learning: Generating precise chest x-ray report on thorax abnormality. Master’s thesis, University of South Dakota, United States (2023)
Demner-Fushman, D., Antani, S., Simpson, M., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2016)
Article Google Scholar
Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.-Y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019)
Article Google Scholar
Li, Y., Liang, X., Hu, Z., Xing, E.P.: Hybrid retrieval-generation reinforced agent for medical image report generation. Advances in neural information processing systems 31 (2018)
Chen, Z., Song, Y., Chang, T.-H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1439–1449 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.112
Jing, B., Wang, Z., Xing, E.: Show, describe and conclude: On exploiting the structure information of chest x-ray reports. (2020)
Kinga, D., Adam, J.B., et al.: A method for stochastic optimization. Int. Conf. Learn. Represent. (ICLR) 5, 6 (2015). (San Diego, California)
Google Scholar
Chen, Z., Song, Y., Chang, T.-H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1439–1449 (2020)
Gupta, A.: Analyzing multimodal machine learning model performance and evaluation metrics for medical report generation. PhD thesis, Carnegie Mellon University Pittsburgh, PA (2024)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72 (2005)
Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Chen, Z., Shen, Y., Song, Y., Wan, X.: Cross-modal memory networks for radiology report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5904–5914 (2021). https://doi.org/10.18653/v1/2021.acl-long.459
Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2577–2586 (2018). https://doi.org/10.18653/v1/P18-1240
Jing, B., Wang, Z., Xing, E.: Show, describe and conclude: On exploiting the structure information of chest x-ray reports. In: Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), pp. 6570–6580 (2019)
Liu, F., Ge, S., Wu, X.: Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 3001–3012 (2021). https://doi.org/10.18653/v1/2021.acl-long.234
Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Exploring and distilling posterior and prior knowledge for radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13753–13762 (2021)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242–3250 (2017). https://doi.org/10.1109/CVPR.2017.345

Download references

Acknowledgements

This work is supported by a grant from the Natural Science Foundation of China (Grant No. 62072070). All authors have approved the manuscript for publication.

Funding

The funding was provided by Natural Science Foundation of China (Grant No. 62072070).

Author information

Authors and Affiliations

School of Information Science and Technology, Dalian Maritime University, Dalian, 116026, Liaoning, China
Muhammad Usman, Yi Guo, Zonglin Liang & Zhang Yijia
School of Artificial Intelligence, Dalian Maritime University, Dalian, 116026, Liaoning, China
Xiaodi Hou

Authors

Muhammad Usman
View author publications
Search author on:PubMed Google Scholar
Xiaodi Hou
View author publications
Search author on:PubMed Google Scholar
Yi Guo
View author publications
Search author on:PubMed Google Scholar
Zonglin Liang
View author publications
Search author on:PubMed Google Scholar
Zhang Yijia
View author publications
Search author on:PubMed Google Scholar

Contributions

MU designed the methodology, conceptualization, and software and analyzed the results-original draft. XH contributes to the validation, Investigation, writing review and editing. YG and ZL provide input for the support analysis and contribute to the writing process. ZY, as the supervisor, contributed to the methodology design and experiments, offering valuable feedback and suggestions throughout the process.

Corresponding author

Correspondence to Zhang Yijia.

Ethics declarations

Competing interest

The authors declare no competing interests.

Ethical approval and consent for data

The data for this research project is publicly available and does not require additional consent for its use.

Financial interest

The authors confirm that no financial interest or personal interaction influenced the development or findings of the IMGEF framework presented in the paper.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Usman, M., Hou, X., Guo, Y. et al. IMGEF: integrated multimodal graph-enhanced framework for radiology report generation. Multimedia Systems 31, 275 (2025). https://doi.org/10.1007/s00530-025-01858-7

Download citation

Received: 14 February 2025
Accepted: 20 May 2025
Published: 31 May 2025
Version of record: 31 May 2025
DOI: https://doi.org/10.1007/s00530-025-01858-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

IMGEF: integrated multimodal graph-enhanced framework for radiology report generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-Modality and Multi-Grained Transformer for Accurate Radiology Report Generation

Enhancing radiology report generation through pre-trained language models

Look, Imitate and Refine: A Hierarchical Multimodel Retrieval Augmented Vision-Language Model for Radiology Report Generation

Explore related subjects

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Ethical approval and consent for data

Financial interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now