ViViT: A Video Vision Transformer
Paper
β’
2103.15691
β’
Published
β’
4
ViViT model as introduced in the paper ViViT: A Video Vision Transformer by Arnab et al. and first released in this repository.
Disclaimer: The team releasing ViViT did not write a model card for this model so this model card has been written by the Hugging Face team.
ViViT is an extension of the Vision Transformer (ViT) to video.
We refer to the paper for details.
The model is mostly meant to intended to be fine-tuned on a downstream task, like video classification. See the model hub to look for fine-tuned versions on a task that interests you.
For code examples, we refer to the documentation.
@misc{arnab2021vivit,
title={ViViT: A Video Vision Transformer},
author={Anurag Arnab and Mostafa Dehghani and Georg Heigold and Chen Sun and Mario LuΔiΔ and Cordelia Schmid},
year={2021},
eprint={2103.15691},
archivePrefix={arXiv},
primaryClass={cs.CV}
}