ViT Token Reduction

Which Tokens to Use? Investigating Token Reduction in Vision Transformers

Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, and Thomas B. Moeslund
ICCV 2023 NIVT Workshop

Paper (ArXiv) | Code

Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs more efficient by removing redundant information in the processed tokens. While different methods have been explored to achieve this goal, we still lack understanding of the resulting reduction patterns and how those patterns differ across token reduction methods and datasets. To close this gap, we set out to understand the reduction patterns of 10 different token reduction methods using four image classification datasets: ImageNet, NABirds, COCO, and NUS-WIDE.

When comparing state-of-the-art token reduction models, the pruning-based models in general outperforms merging-based approaches. We even find that a static radial pattern outperforms all tested soft merging-based approaches! We also find that the Top-K approach is a very strong baseline, only outperformed by EViT which modifies the Top-K approach by fusing all pruned tokens into a single extra token.

We conduct extensive analysis of the reduction patterns and how similar they are when varying the keep rate, backbone capacity, as well as dataset. Lastly, we find that the similarity in reduction pattern between the Top-K and K-Medoids methods are moderate-to-strong proxies of the models metric performance. See the paper for more details!

Reduction Pattern similarity when varying the keep rate. Measured within method at constant backbone capacity.
Reduction Pattern similarity when varying the backbone capacity. Measured within method at constant keep rates.
Correlation between global pruning-based reduction patterns, measured using Pearson’s Correlation Coefficient. Measured within methods, at constant keep rates and backbone capacities.
Spearman’s correlation between difference in metric performance and CLS Features / Reduction Patterns, when compared to a DeiT / Top-K / K-Medoids model.

We hope this research will inspire future research in the token reduction domain, and inspire new work in order to obtain an even better understanding of these models!


author = {Joakim Bruslund Haurum and Sergio Escalera and Graham W. Taylor and Thomas B. Moeslund}, 
title = {Which Tokens to Use? Investigating Token Reduction in Vision Transformers}, 
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, 
month = {October}, 
year = {2023}, }