Transferable-guided Attention Is All You Need for Video Domain Adaptation

1University of São Paulo, 2Federal University of São Carlos, 3University of Trento
Accepted at WACV 2025
MY ALT TEXT

TransferAttn is a framework for unsupervised domain adaptation (UDA) in videos that leverages Vision Transformers (ViT) by incorporating spatial and temporal transferability into the attention mechanism.

Abstract

Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block~(DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains.

Overall Architecture

The TransferAttn architecture utilizes the ViT as a feature encoder for domain adaptation. The encoder consists of four transformer blocks, with the last one incorporating our novel attention mechanism (DTAB). Additionally, to study the encoder's capabilities, we use a pretrained, fixed backbone, making the feature space learning a responsability of the encoder.

MY ALT TEXT

Experimental Results


UCF101 - HMDB51

We report substantial performance improvements across various 2D and 3D backbones. Additionally, our results show that our model, using single-modal data, even surpasses state-of-the-art methods that utilize multi-modal data (e.g., color and motion).


Kinetics - NEC-Drone

Although Kinetics and NEC-Drone are datasets with a more pronounced domain shift (YouTube videos vs. top-view drone videos), we were able to report significant performance improvements compared to state-of-the-art methods.

Analysis


DTAB on other Transformer Video Domain Adaptation Method

To validate the capabilities of the novel attention mechanism (DTAB), we explored its application in other transformer methods. In the UDAVT, we replaced the last transformer block in the backbone with our DTAB transformer block and achieved an improvement in the method's performance.


DTAB on Transformer Image Domain Adaptation Methods

Initially, the DTAB mechanism was developed with a focus on video domain adaptation due to its spatio-temporal characteristics. To evaluate the generalization capability of our novel attention mechanism, we applied DTAB to SOTA image-based UDA methods and observed significant improvements.


Computational Cost

In addition to the performance improvement, the TransferAttn architecture has a low computational cost and very few parameters, proving to be an efficient architecture.


BibTeX

@InProceedings{WACV_2025_Sacilotti,
author = {A. {Sacilotti} and S. F. {Santos} and N. {Sebe} and J. {Almeida}},
title = {Transferable-guided Attention Is All You Need for Video Domain Adaptation},
pages = {1–11},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
address = {Tucson, AZ, USA},
month = {February 28 – March 4},
year = {2025},
publisher = {{IEEE}},
}