Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

Zhang, Zhengbo; Xu, Li; Peng, Duo; Rahmani, Hossein; Liu, Jun

doi:10.1007/978-3-031-73390-1_19

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15086))

Included in the following conference series:

European Conference on Computer Vision

926 Accesses
16 Citations

Abstract

We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an initial prompt learner to enable the diffusion model to recognize the tracking target by learning a prompt representing the target. Furthermore, to facilitate dynamic adaptation of the prompt to the target’s movements, we propose an online prompt updater. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method, which also achieves state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Do Text-Free Diffusion Models Learn Discriminative Visual Representations?

Targeted Image Reconstruction by Sampling Pre-trained Diffusion Model

Feature Machine Unlearning in Diffusion Models

References

Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019)
Google Scholar
Budiharto, W., Irwansyah, E., Suroso, J.S., Gunawan, A.A.S.: Design of object tracking for military robot using PID controller and computer vision. ICIC Express Lett. 14(3), 289–294 (2020)
Google Scholar
Chen, J., Ai, Y., Qian, Y., Zhang, W.: A novel Siamese attention network for visual object tracking of autonomous vehicles. Proc. Inst. Mech. Eng. Part D J. Autom. Eng. 235(10–11), 2764–2775 (2021)
Article Google Scholar
Chen, J., et al.: VideoLLM-online: online video large language model for streaming video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18407–18418 (2024)
Google Scholar
Cheng, X., Xiong, H., Fan, D.P., Zhong, Y., Harandi, M., Drummond, T., Ge, Z.: Implicit motion handling for video camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13864–13873 (2022)
Google Scholar
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: end-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Google Scholar
Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: ECO: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)
Google Scholar
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192 (2020)
Google Scholar
Danelljan, M., Häger, G., Khan, F., Felsberg, M.: Accurate scale estimation for robust visual tracking. In: British Machine Vision Conference, Nottingham, 1–5 September 2014. BMVA Press (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2019)
Google Scholar
Fang, J., Li, Z., Xue, J.: Spatial-sequential-spectral context awareness tracking. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 2582–2586. IEEE (2017)
Google Scholar
Foo, L.G., Gong, J., Rahmani, H., Liu, J.: Distribution-aligned diffusion for human mesh recovery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9221–9232 (2023)
Google Scholar
Gao, M., Jin, L., Jiang, Y., Guo, B.: Manifold Siamese network: a novel visual tracking convnet for autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 21(4), 1612–1623 (2019)
Article Google Scholar
Gong, J., Foo, L.G., Fan, Z., Ke, Q., Rahmani, H., Liu, J.: DiffPose: toward more reliable 3D pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13041–13051 (2023)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, Y., Xu, X., Zhang, J., Shen, F., Yang, Y., Shen, H.T.: Modeling two-stream correspondence for visual sound separation. IEEE Trans. Circuits Syst. Video Technol. 32(5), 3291–3302 (2021)
Article Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2014)
Article Google Scholar
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018)
Google Scholar
Huang, L., Zhao, X., Huang, K.: GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1562–1577 (2019)
Article Google Scholar
Hui, X., Wu, Q., Rahmani, H., Liu, J.: Class-agnostic object counting with text-to-image diffusion model. In: European Conference on Computer Vision. Springer (2024)
Google Scholar
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023)
Google Scholar
Khachatryan, L., et al.: Text2video-zero: text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)
Khani, A., Taghanaki, S.A., Sanghi, A., Amiri, A.M., Hamarneh, G.: Slime: segment like me. arXiv preprint arXiv:2309.03179 (2023)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Čehovin, L., et al: The visual object tracking VOT2016 challenge results. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision – ECCV 2016 Workshops, pp. 777–823. Springer, Cham (2016)
Google Scholar
Kristan, M., et al.: The sixth visual object tracking VOT2018 challenge results. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 3–53. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_1
Chapter Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2019)
Google Scholar
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018)
Google Scholar
Lin, L., Fan, H., Zhang, Z., Xu, Y., Ling, H.: SwinTrack: a simple and strong baseline for transformer tracking. Adv. Neural. Inf. Process. Syst. 35, 16743–16754 (2022)
Google Scholar
Liu, L., et al.: Learning by analogy: reliable supervision from transformations for unsupervised optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6489–6498 (2020)
Google Scholar
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
Google Scholar
Mayer, C., et al.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740 (2022)
Google Scholar
Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 310–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_19
Chapter Google Scholar
Papanikolopoulos, N.P., Khosla, P.K., Kanade, T.: Visual tracking of a moving target by a camera mounted on a robot: a combination of control and vision. IEEE Trans. Robot. Autom. 9(1), 14–35 (1993)
Article Google Scholar
Paul, M., Danelljan, M., Mayer, C., Van Gool, L.: Robust visual tracking by segmentation. In: European Conference on Computer Vision, pp. 571–588. Springer (2022)
Google Scholar
Peng, D., Ke, Q., Lei, Y., Liu, J.: Unsupervised domain adaptation via domain-adaptive diffusion. arXiv preprint arXiv:2308.13893 (2023)
Peng, D., Zhang, Z., Hu, P., Ke, Q., Yau, D., Liu, J.: Harnessing text-to-image diffusion models for category-agnostic pose estimation. In: European Conference on Computer Vision. Springer (2024)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241. Springer (2015)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Article MathSciNet Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Shen, Q., et al.: Unsupervised learning of accurate Siamese tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8101–8110 (2022)
Google Scholar
Sio, C.H., Ma, Y.J., Shuai, H.H., Chen, J.C., Cheng, W.H.: S2SiamFC: self-supervised fully convolutional Siamese network for visual tracking. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1948–1957 (2020)
Google Scholar
Tan, M., Pang, R., Le, Q.V.: EfficientDet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)
Google Scholar
Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881 (2023)
Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., Li, H.: Unsupervised deep tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1308–1317 (2019)
Google Scholar
Wang, N., Zhou, W., Song, Y., Ma, C., Liu, W., Li, H.: Unsupervised deep representation learning for real-time tracking. Int. J. Comput. Vision 129, 400–418 (2021)
Article Google Scholar
Wong, B., Chen, J., Wu, Y., Lei, S.W., Mao, D., Gao, D., Shou, M.Z.: Assistq: Affordance-centric question-driven task completion for egocentric assistant. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 485–501. Springer, Cham (2022)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)
Article Google Scholar
Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: Smartbrush: text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437 (2023)
Google Scholar
Xie, X., Cheng, G., Wang, J., Yao, X., Han, J.: Oriented R-CNN for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3520–3529 (2021)
Google Scholar
Xu, L., Huang, H., Liu, J.: SUTD-TrafficQA: a question answering benchmark and an efficient network for video reasoning over traffic events. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9878–9888 (2021)
Google Scholar
Xu, L., Huang, M.H., Shang, X., Yuan, Z., Sun, Y., Liu, J.: Meta compositional referring expression segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19478–19487 (2023)
Google Scholar
Xu, N., et al: YouTube-VOS: sequence-to-sequence video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 585–601. Springer (2018)
Google Scholar
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021)
Google Scholar
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) European Conference on Computer Vision, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_20
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiseNet: bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 325–341 (2018)
Google Scholar
Yu, Y., Xiong, Y., Huang, W., Scott, M.R.: Deformable Siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 (2020)
Google Scholar
Yuan, D., Chang, X., Huang, P.Y., Liu, Q., He, Z.: Self-supervised deep correlation tracking. IEEE Trans. Image Process. 30, 976–985 (2020)
Article Google Scholar
Zhang, Z., Zhou, C., Tu, Z.: Distilling inter-class distance for semantic segmentation. arXiv preprint arXiv:2205.03650 (2022)
Zhang, Z., Zhou, Y., Gong, J., Liu, J., Tu, Z.: Instance temperature knowledge distillation. arXiv preprint arXiv:2407.00115 (2024)
Zhao, Q., Dai, Y., Li, H., Hu, W., Zhang, F., Liu, J.: LTGC: long-tail recognition via leveraging LLMS-driven generated content. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19510–19520 (2024)
Google Scholar
Zhao, Q., Huang, Y., Hu, W., Zhang, F., Liu, J.: MixPro: data augmentation with maskmix and progressive attention labeling for vision transformer. arXiv preprint arXiv:2304.12043 (2023)
Zheng, J., Ma, C., Peng, H., Yang, X.: Learning to track objects from unlabeled videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13546–13555 (2021)
Google Scholar
Zhu, G., Wang, J., Zhao, C., Lu, H.: Weighted part context learning for visual tracking. IEEE Trans. Image Process. 24(12), 5140–5151 (2015)
Article MathSciNet Google Scholar
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware Siamese networks for visual object tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 103–119. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_7
Chapter Google Scholar

Download references

Acknowledgements

This research is supported by the Ministry of Education, Singapore, under the AcRF Tier 2 Projects (MOE-T2EP20222-0009 and MOE-T2EP20123-0014), and the National Research Foundation Singapore through its AI Singapore Programme (AISG-100E-2023-121).

Author information

Authors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Zhengbo Zhang, Li Xu, Duo Peng & Jun Liu
Lancaster University, Lancaster, UK
Hossein Rahmani & Jun Liu

Authors

Zhengbo Zhang
View author publications
Search author on:PubMed Google Scholar
Li Xu
View author publications
Search author on:PubMed Google Scholar
Duo Peng
View author publications
Search author on:PubMed Google Scholar
Hossein Rahmani
View author publications
Search author on:PubMed Google Scholar
Jun Liu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jun Liu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 908 KB) (download PDF )

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Xu, L., Peng, D., Rahmani, H., Liu, J. (2025). Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15086. Springer, Cham. https://doi.org/10.1007/978-3-031-73390-1_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-73390-1_19
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73389-5
Online ISBN: 978-3-031-73390-1
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Keywords

Publish with us

Policies and ethics