Abstract 摘要

In image retrieval, deep local features learned in a data-driven manner have been demonstrated effective to improve retrieval performance. To realize efficient retrieval on large image database, some approaches quantize deep local features with a large codebook and match images with aggregated match kernel. However, the complexity of these approaches is nontrivial with large memory footprint, which limits their capability to jointly perform feature learning and aggregation.


To generate compact global representations while maintaining regional matching capability, we propose a unified framework to jointly learn local feature representation and aggregation. In our framework, we first extract deep local features using CNNs. Then, we design a tokenizer module to aggregate them into a few visual tokens, each corresponding to a specific visual pattern. This helps to remove background noise, and capture more discriminative regions in the image. Next, a refinement block is introduced to enhance the visual tokens with self-attention and cross-attention. Finally, different visual tokens are concatenated to generate a compact global representation. The whole framework is trained end-to-end with image-level labels. Extensive experiments are conducted to evaluate our approach, which outperforms the state-of-theart methods on the Revisited Oxford and Paris datasets. Our code is available at https://github.com/MCC-WH/Token.

为了在保持区域匹配能力的同时生成紧凑的全局表示,我们提出了一个统一的框架来共同学习局部特征表示和聚合。在我们的框架中,我们首先使用cnn提取深度局部特征。然后,我们设计了一个标记器模块,将它们聚合成几个视觉标记,每个标记对应于一个特定的视觉模式。这有助于去除背景噪声,并在图像中捕获更多的判别区域。其次,引入了一个细化块,通过自关注和交叉关注来增强视觉标记。最后,将不同的视觉标记连接起来以生成紧凑的全局表示。整个框架使用图像级标签进行端到端训练。我们进行了大量的实验来评估我们的方法,该方法在Revisited Oxford和Paris数据集上优于最先进的方法。我们的代码可在https://github.com/MCC-WH/Token上获得。

Introduction 介绍

Given a large image corpus, image retrieval aims to efficiently find target images similar to a given query. It is challenging due to various situations observed in large-scale dataset, e.g., occlusions, background clutter, and dramatic viewpoint changes. In this task, image representation, which describes the content of images to measure their similarities, plays a crucial role. With the introduction of deep learning into computer vision, significant progress has been witnessed in learning image representation for image retrieval in a data-driven paradigm. Generally, there are two main types of representation for image retrieval. One is global feature, which maps an image to a compact vector, while the other is local feature, where an image is described with hundreds of short vectors


Figure 1: Top-5 retrieval results of different methods, including DELG (Cao, Araujo, and Sim 2020), SOLAR (Ng et al2020), HOW (Tolias, Jenicek, and Chum 2020) and ours.
Query image is on the left (black outline) with a target object (orange box), and the right are the top-ranking images for the query. Our approach achieves similar results as HOW, which use large visual codebook to aggregate local features, with lower memory and latency. Green solid outline: positive images for the query; red solid outline: negative results.
图1:不同方法的前5名检索结果,包括DELG (Cao, Araujo, and Sim 2020), SOLAR (Ng et al2020), HOW (Tolias, Jenicek, and Chum 2020)和我们的方法。

In global feature based image retrieval, although the representation is compact, it usually lacks capability to retrieve target images with only partial match. As shown in Fig.1(a) and (b), when the query image occupies only a small region in the target images, global features tend to return false positive examples, which are somewhat similar but do not indicate the same instance as the query image.


Recently, many studies have demonstrated the effectiveness of combining deep local features with traditional ASMK aggregation method in dealing with background clutter and occlusion. In those approaches, the framework usually consists of two stages: feature extraction and feature aggregation, where the former extracts discriminative local features, which are further aggregated by the latter for the efficient retrieval.


However, they require offline clustering and coding procedures, which lead to a considerable complexity of the whole framework with a high memory footprint and long retrieval latency. Besides, it is difficult to jointly learn local features and aggregation due to the involvement of large visual codebook and hard assignment in quantization.


Some existing works such as NetVLAD try to learn local features and aggregation simultaneously. They aggregate the feature maps output by CNNs into compact global features with a learnable VLAD layer.

Specifically, they discard the original features and adopt the sum of residual vectors of each visual word as the representation of an image. However, considering the large variation and diversity of content in different images, these visual words are too coarse-grained for the features of a particular image. This leads to insufficient discriminative capability of the residual vectors, which further hinders the performance of the aggregated image representation.



To address the above issues, we propose a unified framework to jointly learn and aggregate deep local features. We treat the feature map output by CNNs as original deep local features. To obtain compact image representations while preserving the regional matching capability, we propose a tokenizer to adaptively divide the local features into groups with spatial attention. These local features are further aggregated to form the corresponding visual tokens. Intuitively, the attention mechanism ensures that each visual token corresponds to some visual pattern and these patterns are aligned across images. Furthermore, a refinement block is introduced to enhance the obtained visual tokens with selfattention and cross-attention. Finally, the updated attention maps are used to aggregate original local features for enhancing the existing visual tokens. The whole framework is trained end-to-end with only image-level labels.


Compared with the previous methods, there are two advantages in our approach. First, by expressing an image with a few visual tokens, each corresponding to some visual pattern, we implicitly achieve local pattern alignment with the aggregated global representation. As shown in Fig. 1 (d), our approach performs well in the presence of background clutter and occlusion. Secondly, the global representation obtained by aggregation is compact with a small memory footprint. These facilitate effective and efficient semantic content matching between images. We conduct comprehensive experiments on the Revisited Oxford and Paris datasets, which are further mixed with one million distractors. Ablation studies demonstrate the effectiveness of the tokenizer and the refinement block. Our approach surpasses the stateof-the-art methods by a considerable margin.

与以前的方法相比,我们的方法有两个优点。首先,通过使用几个视觉标记来表达图像,每个标记对应一些视觉模式,我们隐式地实现了与聚合全局表示的局部模式对齐。如图1 (d)所示,我们的方法在存在背景杂波和遮挡的情况下表现良好。其次,聚合得到的全局表示紧凑,占用内存小。这些有助于图像之间有效和高效的语义内容匹配。我们在重新访问的牛津和巴黎数据集上进行了全面的实验,这些数据集进一步混合了100万个干扰物。烧蚀实验证明了标记器和细化块的有效性。我们的方法大大超过了最先进的方法。

Related Work 相关工作

In this section, we briefly review the related work including local feature and global feature based image retrieval.

Local feature. Traditionally local features are extracted using hand-crafted detectors and descriptors. They are first organized in bag-of-words and further enhanced by spatial validation , hamming embedding and query expansion . Recently, tremendous advances have been made to learn local features suitable for image retrieval in a data-driven manner. Among these approaches, the state-of-the-art approach is HOW , which uses attention learning to distinguish deep local features with imagelevel annotations. During testing, it combines the obtained local features with the traditional ASMK aggregation method. However, HOW cannot jointly learn feature representation and aggregation due to the very large codebook and the hard assignment during the quantization process. Moreover, its complexity is considerable with a high memory footprint. Our method uses a few visual tokens to effectively represent image. The feature representation and aggregation are jointly learned.



Global feature. Compact global features reduce memory footprint and expedite the retrieval process. They simplify image retrieval to a nearest neighbor search and extend the previous query expansion to an efficient exploration of the entire nearest neighbor graph of the dataset by diffusion. Before deep learning, they are mainly developed by aggregating hand-crafted local features, e.g., VLAD, Fisher vectors ASMK . Recently, global features are obtained simply by performing the pooling operation on the feature map of CNNs. Many pooling methods have been explored,e.g.,max-pooling, sumpooling, weightedsum-pooling, regional-max-pooling, generalized mean-pooling , and so on. These networks are trained using ranking or classification losses. Differently, our method tokenizes the feature map into several visual tokens, enhances the visual tokens using the refinement block, concatenates different visual tokens and performs dimension reduction. Through these steps, our method generates a compact global representation while maintaining the regional matching capability

全局特征。紧凑的全局特性减少了内存占用并加快了检索过程。它们将图像检索简化为最近邻搜索,并将以前的查询扩展扩展为通过扩散对数据集的整个最近邻图进行有效的探索。在深度学习之前,它们主要是通过聚合手工制作的局部特征来开发的,例如VLAD, Fisher vectors ASMK。目前,对cnn的特征映射进行池化操作即可获得全局特征。已经探索了许多池化方法,例如;、max-pooling、sumpooling、weighting-sum-pooling、region-max-pooling、generalized mean-pooling等。这些网络使用排序或分类损失进行训练。不同的是,我们的方法将特征映射标记为多个视觉标记,使用细化块增强视觉标记,连接不同的视觉标记并执行降维。通过这些步骤,我们的方法在保持区域匹配能力的同时生成紧凑的全局表示。

Methodology 方法

An overview of our framework is shown in Fig. 2. Given an image, we first obtain the original deep local featuresthrough a CNN backbone. These local features are obtained with limited receptive fields covering part of the input image. Thus, we follow to apply the Local Feature Self-Attention (LFSA) operation on to obtain context-aware local features . Next, we divide them into L groups with spatial attention mechanism, and the local features of each group are aggregated to form a visual token . We denote the set of obtained visual tokens as.

Furthermore, we introduce a refinement block to update the obtained visual tokens based on the previous local features . Finally, all the visual tokens are concatenated and we reduce its dimension to form the final global descriptor. ArcFace margin loss is used to train the whole network.



Figure 2: An overview of our framework. Given an image, we first use a CNN and a Local Feature Self-Attention (LFSA) module to extract local features F c . Then, they are tokenized into L visual tokens with spatial attention. Further, a refinement block is introduced to enhance the obtained visual tokens with self-attention and cross-attention. Finally, we concatenate all the visual tokens to form a compact global representation fg and reduce its dimension.

Tokenizer 编译器

To effectively cope with the challenging conditions observed in large datasets, such as noisy backgrounds, occlusions, etc., image representation is expected to find patch-level matches between images. A typical pipeline to tackle these challenges consists of local descriptor extraction, quantization with a large visual codebook created usually by kmeans and descriptor aggregation into a single embedding.

However, due to the offline clustering and hard assignment of local features, it is difficult to optimize feature learning and aggregation simultaneously, which further limits the discriminative power of the image representation.

To alleviate this problem, we here use spatial attention to extract the desired visual tokens. By training, the attention module can adaptively discover discriminative visual patterns.





Experiments 实验

Experimental 实验设置

Training dataset. The clean version of Google landmarks dataset V2 (GLDv2-clean) (Weyand et al 2020) is used fortraining. It is first collected by Google and further cleaned by researchers from the Google Landmark Retrieval Competition 2019. It contains a total of 1,580,470 images and 81,313 classes. We randomly divide it into two subsets ‘train’=‘val’ with 80%=20% split. The ‘train’ split is used for training model, and the ‘val’ split is used for validation.

训练数据集。使用Google地标数据集V2的clean版本(GLDv2-clean) (Weyand et al 2020)进行训练。它首先由谷歌收集,并由2019年谷歌地标检索大赛的研究人员进一步清理。它总共包含1,580,470个图像和81,313个类。我们随机将其分成两个子集' train ' = ' val ', 80%=20%分割。' train '分割用于训练模型,' val '分割用于验证。

Table 1: mAP comparison against existing methods on the full benchmark. Underline: best previous methods. Black bold: best results.

Evaluation datasets and metrics. Revisited versions of the original Oxford5k (Philbin et al 2007) and Paris6k (Philbin et al 2008) datasets are used to evaluate our method, which are denoted as ROxf and RPar (Radenovic et al 2018) in ´ the following. Both datasets contain 70 query images and additionally include 4,993 and 6,322 database images, respectively. Mean Average Precision (mAP) is used as our evaluation metric on both datasets with Medium and Hard protocols. Large-scale results are further reported with the R1M dataset, which contains one million distractor images

评估数据集和指标。使用原始Oxford5k (Philbin等人2007年)和Paris6k (Philbin等人2008年)数据集的重新访问版本来评估我们的方法,它们在下文中被标记为ROxf和RPar (Radenovic等人2018年)。两个数据集分别包含70张查询图像和4,993张和6,322张数据库图像。平均平均精度(mAP)被用作我们对中、硬协议数据集的评估指标。使用R1M数据集进一步报告了大规模结果,该数据集包含100万张分心图像。

Training details. All models are pre-trained on ImageNet.

For image augmentation, a 512 × 512-pixel crop is taken from a randomly resized image and then undergoes random color jittering. We use a batch size of 128 to train our model on 4 NVIDIA RTX 3090 GPUs for 30 epochs, which takes about 3 days. SGD is used to optimize the model, with an initial learning rate of 0.01, a weight decay of 0.0001, and a momentum of 0.9. A linearly decaying scheduler is adopted to gradually decay the learning rate to 0 when the desired number of steps is reached. The dimension d of the global feature is set as 1024. For the ArcFace margin loss, we empirically set the margin m as 0.2 and the scale γ as 32:0. Refinement block number N is set to 2. Test images are resized with the larger dimension equal to 1024 pixels, preserving the aspect ratio. Multiple scales are adopted, i.e. L2 normalization is applied for each scale independently, then three global features are averagepooled, followed by another L2 normalization. We train each model 5 times and evaluate the one with median performance on the validation set.


对于图像增强,从随机调整大小的图像中截取512 × 512像素的裁剪,然后进行随机的颜色抖动。我们使用128个批处理大小在4个NVIDIA RTX 3090 gpu上训练我们的模型30个epoch,大约需要3天。使用SGD对模型进行优化,初始学习率为0.01,权重衰减为0.0001,动量为0.9。采用线性衰减调度器,当达到所需步数时,将学习率逐渐衰减到0。全局特征的维数d设置为1024。对于ArcFace边缘损失,我们经验地将边缘m设置为0.2,尺度γ设置为32:0。细化块号N设为2。测试图像被调整为较大的尺寸为1024像素,保持长宽比。采用多尺度,即对每个尺度分别进行L2归一化,然后对三个全局特征进行平均,再进行L2归一化。我们训练每个模型5次,并评估在验证集上具有中位数性能的模型。

Results on Image Retrieval 图像检索结果

Setting for fair comparison. Commonly, existing methods are compared under different settings, e.g., training set,backbone network, feature dimension, loss function, etc.

This may affect our judgment on the effectiveness of the proposed method. In Tab. 1, we re-train several methods under the same settings (using GLDv2-clean dataset and ArcFace loss, 2048 global feature dimension, ResNet101 as backbone), marked with y. Based on this benchmark, we fairly compare the mAP performance of various methods and ours.


这可能会影响我们对所建议方法有效性的判断。在表1中,我们在相同的设置下重新训练了几种方法(使用GLDv2-clean数据集和ArcFace loss, 2048全局特征维数,ResNet101作为主干),用y标记。基于这个基准,我们公平地比较了各种方法和我们的mAP性能。

Comparison with the state of the art. Tab. 1 compares our approach extensively with the state-of-the-art retrieval methods. Our method achieves the best mAP performance in all settings. We divide the previous methods into three groups:

(1) Local feature aggregation. The current state-of-the-art local aggregation method is R101-HOW. We outperform it in mAP by 1:86% and 4:06% on the ROxf dataset and by 3:91%, and 7:80% on the RPar dataset with Medium and Hard protocols, respectively. For R1M, we also achieve the best performance. The results show that our aggregation method is better than existing local feature aggregation methods based on large visual codebook.

(2) Global single-pass. When trained with GLDv2-clean, R101-DELG achieves the best performance mostly. When using ResNet101 as the backbone, the comparison between our method and it in mAP is 82:28% vs. 78:24%, 66:57% vs. 60:15% on the ROxf dataset and 89:34% vs. 88:21%, 78:56% vs. 76:15% on the RPar dataset with Medium and Hard protocols, respectively. These results well demonstrate the superiority of our framework.

(3) Global feature followed by local feature re-ranking. We outperform the best two-stage method (R101-DELG+SP) in mAP by 0:50%, 1:80% on the ROxf dataset and 0:88%, 1:76% on the RPar datasets with Medium and Hard protocols, respectively. Although 2-stage solutions well promote their single-stage counterparts, our method that aggregates local features into a compact descriptor is a better option.



(2)全局单通。当使用GLDv2-clean进行训练时,R101-DELG大多能达到最佳性能。当使用ResNet101作为主干时,我们的方法与mAP中的方法在ROxf数据集上的比较分别为82:28% vs. 78:24%, 66:57% vs. 60:15%,在RPar数据集上的比较分别为89:34% vs. 88:21%, 78:56% vs. 76:15%。这些结果很好地证明了我们的框架的优越性。


Qualitative results. To explore what the proposed tokenizer learned, we visualize the spatial attention generated by the cross-attention layer of the last refinement block in the Fig. 3 (a). Although there is no direct supervision, different visual tokens are associated with different visual patterns. Most of these patterns focus on the foreground building and remain consistent across images, which implicitly enable pattern alignment. e.g., the 3rd visual token reflects the semantics of “the upper edge of the window”.

To further analyze how visual tokens improve the performance, we select the top-2 results of the “hertford” query from the ROxf dataset for the case study. As shown in Fig. 1, when the query object only occupies a small part of the target image, the state-of-the-art methods with global features return false positives which are semantically similar to the query. Our approach uses visual tokens to distinguish different visual patterns, which has the capability of regional matching. In Fig. 3 (b), the 2nd visual token corresponds to the visual pattern described by the query image.

定性的结果。为了探索所提出的标记器学习了什么,我们将图3 (a)中最后一个细化块的交叉注意层生成的空间注意可视化。尽管没有直接监督,但不同的视觉标记与不同的视觉模式相关联。这些模式中的大多数都关注前景构建,并在图像之间保持一致,从而隐式地支持模式对齐。例如,第三个视觉标记反映了“窗口的上边缘”的语义。

为了进一步分析可视化标记如何提高性能,我们从ROxf数据集中选择“hertford”查询的前2个结果进行案例研究。如图1所示,当查询对象仅占目标图像的一小部分时,具有全局特征的最先进方法返回与查询语义相似的假阳性。该方法利用视觉标记来区分不同的视觉模式,具有区域匹配的能力。在图3 (b)中,第二个视觉标记对应于查询图像所描述的视觉模式。

Speed and memory costs. In Tab. 2, we report retrieval latency, feature extraction latency and memory footprint on R1M for different methods. Compared to the local feature aggregation approaches, most of the global features have a smaller memory footprint. To perform spatial verification,“R101-DELG+SP” needs to store a large number of local features, and thus requires about 485 GB of memory. Our method uses a small number of visual tokens to represent the image, generating a 1024-dimensional global feature with a memory footprint of 3.9 GB. We further compress the memory requirements of global features with PQ quantization . As shown in Tab. 1 ´ and Tab. 2, the compressed features greatly reduce the memory footprint with only a small performance loss. Among these methods, our method appears to be a good solution in the performance-memory trade-off

速度和内存成本。在表2中,我们报告了不同方法在R1M上的检索延迟、特征提取延迟和内存占用。与局部特征聚合方法相比,大多数全局特征占用的内存更小。为了进行空间验证,“R101-DELG+SP”需要存储大量的本地特征,因此需要大约485 GB的内存。我们的方法使用少量的视觉标记来表示图像,生成一个1024维的全局特征,内存占用为3.9 GB。我们利用PQ量化进一步压缩全局特征的内存需求。如表1和表2所示,压缩后的特性极大地减少了内存占用,而性能损失很小。在这些方法中,我们的方法似乎是性能-内存权衡的一个很好的解决方案

Figure 3: Qualitative examples. (a) Visualization of the attention maps associated with different visual tokens for eight images. #i denotes the i-th visual token. (b) Detailed analysis of the top-2 retrieval results of the “hertford” query in the ROxf dataset. The 2nd visual token focus on the content of the query image in the target image, which is boxed in red.
图3:定性的例子。(a)与八幅图像的不同视觉标记有关的注意图的可视化。#i表示第i个视觉标记。(b) ROxf数据集中“hertford”查询的前2个检索结果的详细分析。第二个视觉标记关注目标图像中查询图像的内容,该图像用红色框起。

The extraction of global features is faster, since the extraction of local features usually requires scaling the image to seven scales, while global features generally use three scales. Our aggregation method requires tokenization and iterative enhancement, which is slightly slower than direct spatial pooling, e.g., 125 ms for ours vs. 109 ms for “R101DELG”. The average retrieval latency of our method on R1M is 0.2871 seconds, which demonstrates the potential of our method for real-time image retrieval.

全局特征的提取速度更快,因为局部特征的提取通常需要将图像缩放到7个尺度,而全局特征通常使用3个尺度。我们的聚合方法需要标记化和迭代增强,这比直接空间池稍微慢一些,例如,我们的方法是125 ms,而“R101DELG”是109 ms。我们的方法在R1M上的平均检索延迟为0.2871秒,这证明了我们的方法在实时图像检索方面的潜力。

Ablation Study 消融实验

Verification of different components. In Tab. 3, we provide experimental results to validate the contribution of the three components in our framework, by adding individual components to the baseline framework. When the tokenizer is adopted, there is a significant improvement in overall performance. mAP increases from 77:0% to 79:8% on ROxfMedium and 56:0% to 62:5% on ROxf-Hard. This indicates that dividing local features into groups according to visual patterns is more effective than direct global spatial pooling. From the 3rd and last row, the performance is further enhanced when the refinement block is introduced, which shows that enhancing the visual tokens with the original features further makes them more discriminative. There is also a performance improvement when the Local Feature SelfAttention (LFSA) is incorporated.


Table 2: Time and memory measurements. We report extraction time on a single thread GPU (RTX 3090) / CPU (Intel Xeon CPU E5-2640 v4 @ 2.40GHz) and the search time (on a single thread CPU) for the database of ROxf+R1M.
表2:时间和内存度量。我们报告了单线程GPU (RTX 3090) / CPU (Intel Xeon CPU E5-2640 v4 @ 2.40GHz)的提取时间和ROxf+R1M数据库的搜索时间(在单线程CPU上)。
Table 3: Ablation studies of different components. We use R101-SPoC as the baseline and incrementally add tokenizer, Local Feature Self-Attention (LFSA) and refinement block.


Impact of each component in the refinement block. The role of the different components in the refinement block is shown in Tab. 4. By removing the individual components, we find that modeling the relationship between different visual words before and further enhancing the visual tokens using the original local features demonstrate the effectiveness in enhancing the aggregated features.


Table 4: Analysis of components in the refinement block.

Impact of tokenizer type. In Tab. 5, we compare our Attenbased tokenizer with the other two tokenizers: (1) Vq-Based.We directly define visual tokens as a matrix .It is randomly initialized and further updated by a moving average operation in one mini-batch. See the appendix for details. (2) Learned. It is similar to the Vq-Based method, except that T is set as the network parameters, learned during training. Our method achieves the best performance. We use the attention mechanism to generate visual tokens directly from the original local features. Compared with the other two, our approach obtains more discriminative visual tokens with a better capability to match different images.

Impact of token number. The granularity of visual tokens is influenced by their number. As shown in Tab. 6, as L increases, mAP performance first increases and then decreases, achieving the best at L = 4. This is due to the lack of capability to distinguish local features when the number of visual tokens is small; conversely, when the number is large, they are more fine-grained and noise may be introduced when grouping local features.


令牌数量的影响。视觉标记的粒度受其数量的影响。如表6所示,随着L的增大,mAP性能先增大后减小,在L = 4时达到最佳。这是因为当视觉标记的数量很少时,缺乏区分局部特征的能力;相反,当数量大时,它们的粒度更细,在对局部特征进行分组时可能会引入噪声。

Table 5: mAP comparison of different variants of tokenizers.

Table 6: mAP comparison of visual tokens number L.


Conclusion 结论

In this paper, we propose a joint local feature learning and aggregation framework, which generates compact global representations for images while preserving the capability of regional matching. It consists of a tokenizer and a refinement block. The former represents the image with a few visual tokens, which is further enhanced by the latter based on the original local features. By training with image-level labels, our method produces representative aggregated features. Extensive experiments demonstrate that the proposed method achieves superior performance on image retrieval benchmark datasets. In the future, we will extend the proposed aggregation method to a variety of existing local features, which means that instead of directly performing local feature learning and aggregation end-to-end, local features of images are first extracted using existing methods and further aggregated with our method.Acknowledgements. This work was supported in part by the National Key R&D Program of China under contract 2018YFB1402605, in part by the National Natural Science Foundation of China under Contract 62102128, 61822208 and 62172381, and in part by the Youth Innovation Promotion Association CAS under Grant 2018497. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.









最后更新于 2024-04-26