Which Tokens to Use? Investigating Token Reduction in Vision Transformers
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, and Thomas B. Moeslund
ICCV 2023 NIVT Workshop
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs more efficient by removing redundant information in the processed tokens. While different methods have been explored to achieve this goal, we still lack understanding of the resulting reduction patterns and how those patterns differ across token reduction methods and datasets. To close this gap, we set out to understand the reduction patterns of 10 different token reduction methods using four image classification datasets: ImageNet, NABirds, COCO, and NUS-WIDE.
When comparing state-of-the-art token reduction models, the pruning-based models in general outperforms merging-based approaches. We even find that a static radial pattern outperforms all tested soft merging-based approaches! We also find that the Top-K approach is a very strong baseline, only outperformed by EViT which modifies the Top-K approach by fusing all pruned tokens into a single extra token.
We conduct extensive analysis of the reduction patterns and how similar they are when varying the keep rate, backbone capacity, as well as dataset. Lastly, we find that the similarity in reduction pattern between the Top-K and K-Medoids methods are moderate-to-strong proxies of the models metric performance. See the paper for more details!
We hope this research will inspire future research in the token reduction domain, and inspire new work in order to obtain an even better understanding of these models!
Citation
@InProceedings{Haurum_2023_ICCVW, author = {Joakim Bruslund Haurum and Sergio Escalera and Graham W. Taylor and Thomas B. Moeslund}, title = {Which Tokens to Use? Investigating Token Reduction in Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, }