Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block~(DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains.
The TransferAttn architecture utilizes the ViT as a feature encoder for domain adaptation. The encoder consists of four transformer blocks, with the last one incorporating our novel attention mechanism (DTAB). Additionally, to study the encoder's capabilities, we use a pretrained, fixed backbone, making the feature space learning a responsability of the encoder.
We report substantial performance improvements across various 2D and 3D backbones. Additionally, our results show that our model, using single-modal data, even surpasses state-of-the-art methods that utilize multi-modal data (e.g., color and motion).
Although Kinetics and NEC-Drone are datasets with a more pronounced domain shift (YouTube videos vs. top-view drone videos), we were able to report significant performance improvements compared to state-of-the-art methods.
To validate the capabilities of the novel attention mechanism (DTAB), we explored its application in other transformer methods. In the UDAVT, we replaced the last transformer block in the backbone with our DTAB transformer block and achieved an improvement in the method's performance.
Initially, the DTAB mechanism was developed with a focus on video domain adaptation due to its spatio-temporal characteristics. To evaluate the generalization capability of our novel attention mechanism, we applied DTAB to SOTA image-based UDA methods and observed significant improvements.
In addition to the performance improvement, the TransferAttn architecture has a low computational cost and very few parameters, proving to be an efficient architecture.
@InProceedings{WACV_2025_Sacilotti,
author = {A. {Sacilotti} and S. F. {Santos} and N. {Sebe} and J. {Almeida}},
title = {Transferable-guided Attention Is All You Need for Video Domain Adaptation},
pages = {1–11},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
address = {Tucson, AZ, USA},
month = {February 28 – March 4},
year = {2025},
publisher = {{IEEE}},
}