Cross-vit: cross-attention vision transformer for image duplicate detection
dc.contributor.author | Chandrasiri, MDN | |
dc.contributor.author | Talagala, PD | |
dc.contributor.editor | Piyatilake, ITS | |
dc.contributor.editor | Thalagala, PD | |
dc.contributor.editor | Ganegoda, GU | |
dc.contributor.editor | Thanuja, ALARR | |
dc.contributor.editor | Dharmarathna, P | |
dc.date.accessioned | 2024-02-06T08:36:41Z | |
dc.date.available | 2024-02-06T08:36:41Z | |
dc.date.issued | 2023-12-07 | |
dc.description.abstract | Duplicate detection in image databases has immense significance across diverse domains. Its utility transcends specific applications, adapting seamlessly to a range of use cases, either as a standalone process or an integrated component within broader workflows. This study explores cutting-edge vision transformer architecture to revolutionize feature extraction in the context of duplicate image identification. Our proposed framework combines the conventional transformer architecture with a groundbreaking cross-attention layer developed specifically for this study. This unique cross-attention transformer processes pairs of images as input, enabling intricate cross-attention operations that delve into the interconnections and relationships between the distinct features in the two images. Through meticulous iterations of Cross-ViT, we assess the ranking capabilities of each version, highlighting the vital role played by the integrated cross-attention layer between transformer blocks. Our research culminates in recommending a final optimal model that capitalizes on the synergies between higher-dimensional hidden embeddings and mid-size ViT variations, thereby optimizing image pair ranking. In conclusion, this study unveils the immense potential of the vision transformer and its novel cross-attention layer in the domain of duplicate image detection. The performance of the proposed framework was assessed through a comprehensive comparative evaluation against baseline CNN models using various benchmark datasets. This evaluation further underscores the transformative power of our approach. Notably, our innovation in this study lies not in the introduction of new feature extraction methods but in the introduction of a novel cross-attention layer between transformer blocks grounded in the scaled dot-product attention mechanism. | en_US |
dc.identifier.conference | 8th International Conference in Information Technology Research 2023 | en_US |
dc.identifier.department | Information Technology Research Unit, Faculty of Information Technology, University of Moratuwa. | en_US |
dc.identifier.email | dncnawodya@gmail.com | en_US |
dc.identifier.email | priyangad@uom.lk | en_US |
dc.identifier.faculty | IT | en_US |
dc.identifier.pgnos | pp. 1-6 | en_US |
dc.identifier.place | Moratuwa, Sri Lanka | en_US |
dc.identifier.proceeding | Proceedings of the 8th International Conference in Information Technology Research 2023 | en_US |
dc.identifier.uri | http://dl.lib.uom.lk/handle/123/22194 | |
dc.identifier.year | 2023 | en_US |
dc.language.iso | en | en_US |
dc.publisher | Information Technology Research Unit, Faculty of Information Technology, University of Moratuwa. | en_US |
dc.subject | Duplicate image detection | en_US |
dc.subject | Vision transformers | en_US |
dc.subject | Attention | en_US |
dc.title | Cross-vit: cross-attention vision transformer for image duplicate detection | en_US |
dc.type | Conference-Full-text | en_US |