文章

LET-NET 中英对照

Breaking of brightness consistency in optical flow with a lightweight CNN network

轻量级CNN网络在光流中打破亮度一致性

Yicheng ${\mathrm{{Lin}}}^{1}$ ,Shuo Wang ${}^{1}$ ,Yunlong Jiang and Bin Han,Member,IEEE

易成 ${\mathrm{{Lin}}}^{1}$,王硕 ${}^{1}$,江云龙和韩斌,IEEE会员

Abstract-The sparse optical flow method is a fundamental task in computer vision. However, its reliance on the assumption of constant environmental brightness constrains its applicability in high dynamic range (HDR) scenes. In this study, we propose a novel approach aimed at transcending image color information by learning a feature map that is robust to illumination changes. This feature map is subsequently structured into a feature pyramid and integrated into sparse Lucas-Kanade (LK) optical flow. By adopting this hybrid optical flow method, we circumvent the limitation imposed by the brightness constant assumption. Specifically, we utilize a lightweight network to extract both the feature map and keypoints from the image. Given the challenge of obtaining reliable keypoints for the shallow network, we employ an additional deep network to support the training process. Both networks are trained using unsupervised methods. The proposed lightweight network achieves a remarkable speed of 190fps on the onboard CPU. To validate our approach, we conduct comparisons of repeatability and matching performance with conventional optical flow methods under dynamic illumination conditions. Furthermore, we demonstrate the effectiveness of our method by integrating it into VINS-Mono, resulting in a significantly reduced translation error of ${93}\%$ on a public HDR dataset. The code implementation is publicly available at https://github.com/linyicheng1/LET-NET.

摘要——稀疏光流方法是计算机视觉中的基本任务。然而,它依赖于恒定环境亮度的假设限制了其在高动态范围(HDR)场景中的应用。在本研究中,我们提出了一种新方法,旨在通过学习对光照变化鲁棒的特征图来超越图像颜色信息。随后,该特征图被结构化为特征金字塔并集成到稀疏的Lucas-Kanade(LK)光流中。通过采用这种混合光流方法,我们规避了亮度恒定假设带来的限制。具体来说,我们利用一个轻量级网络从图像中提取特征图和关键点。鉴于浅层网络获取可靠关键点的挑战,我们采用了一个额外的深度网络来支持训练过程。两个网络都使用无监督方法进行训练。所提出的轻量级网络在机载CPU上实现了惊人的190fps速度。为了验证我们的方法,我们在动态光照条件下与传统光流方法进行了重复性和匹配性能的比较。此外,我们通过将其集成到VINS-Mono中,展示了我们方法的有效性,结果在一个公共HDR数据集上的平移误差显著减少 ${93}\%$。代码实现在https://github.com/linyicheng1/LET-NET 公开可用。

Index Terms-Sparse Optical flow, keypoint detection, deep learning

索引术语——稀疏光流,关键点检测,深度学习

I. INTRODUCTION

I. 引言

O PTICAL flow refers to the pixels motion between con- secutive frames in an image sequence and optical flow method is an algorithm for estimating pixel motion between consecutive images, which includes sparse optical flow method and dense optical flow method. Sparse optical flow method only estimates the pixel motion distance of keypoints, while dense optical flow method estimates in all positions. Optical flow estimation is a classic problem in computer vision. For instance,VINS-Mono [1] and ${\mathrm{{OV}}}^{2}$ SLAM [2] utilize sparse optical flow method as a key module in visual Simultaneous Localization and Mapping (vSLAM). The performance of optical flow method in HDR scenes determines their robustness to illumination changes. Therefore, the study of illumination-robust optical flow methods is crucial.

光流是指图像序列中连续帧之间的像素运动,光流方法是一种用于估计连续图像之间像素运动的算法,包括稀疏光流方法和密集光流方法。稀疏光流方法仅估计关键点的像素运动距离,而密集光流方法则在所有位置进行估计。光流估计是计算机视觉中的一个经典问题。例如,VINS-Mono [1] 和 ${\mathrm{{OV}}}^{2}$ SLAM [2] 利用稀疏光流方法作为视觉同时定位与地图构建(vSLAM)中的关键模块。光流方法在HDR场景中的性能决定了它们对光照变化的鲁棒性。因此,研究对光照鲁棒的光流方法至关重要。

The traditional sparse optical flow methods [3], [4] have historically assumed static environmental illumination, which limits their effectiveness in HDR scenarios. Although subsequent research [5] have introduced gradient consistency and other higher-order constraints like the constancy of the Hessian and Laplacian to improve illumination robustness, these approaches only partially address this issue. Learning-based dense optical flow methods, such as FlowNet [6], offer a solution to this problem. However, they tend to be less efficient and often require GPU support. Thus, there is no sparse optical flow method capable of real-time execution on CPUs without relying on prior assumptions. Consequently, even the most advanced vSLAM [2] still struggle to maintain robustness in HDR scenes.

传统的稀疏光流方法 [3], [4] 历来假设静态环境光照,这限制了它们在HDR场景中的有效性。尽管后续研究 [5] 引入了梯度一致性以及其他高阶约束,如Hessian和Laplacian的不变性,以提高光照鲁棒性,但这些方法仅部分解决了这一问题。基于学习的密集光流方法,如FlowNet [6],提供了解决此问题的方法。然而,它们往往效率较低,通常需要GPU支持。因此,目前还没有能够在不依赖先验假设的情况下在CPU上实时执行的稀疏光流方法。因此,即使是最高级的vSLAM [2] 也难以在HDR场景中保持鲁棒性。

Fig. 1. Examples of dynamic lighting scene images. We collected images under different directions of light to demonstrate the robustness of the proposed method to illumination. Among them, forward optical flow refers to extracting keypoints in the first image and tracking them in the second image. The backward optical flow is the opposite.

图1. 动态光照场景图像示例。我们收集了不同光照方向下的图像,以展示所提出方法对光照的鲁棒性。其中,前向光流指的是在第一幅图像中提取关键点并在第二幅图像中跟踪它们。后向光流则相反。

We propose a hybrid optical flow method that combines deep learning with traditional approaches to overcome the limitations of conventional methods under HDR scenes. The difference between the hybrid optical flow method and LK optical flow method is clearly demonstrated in Fig. 1 . Our approach begins by utilizing a lightweight convolutional neural network (CNN) to extract illumination-invariant feature maps and score maps from images. These feature maps go beyond basic RGB information, encompassing higher-level information. Subsequently, we integrate these feature maps with the traditional pyramid LK optical flow method, as Fig. 2 resulting in the hybrid optical flow method. In our approach, the training process involves two key steps. Firstly, we employ a strategy of assisted training using deep networks to enhance the performance of the lightweight network. Secondly, we introduce new loss functions such as the mask neural reprojection error (mNRE) loss to learn illumination-invariant features and the line peaky loss learn keypoint scores. The hybrid optical flow method achieves both efficient and accurate optical flow computation while demonstrating robustness to changes in illumination. In summary, the main contributions of this paper are as follows:

我们提出了一种混合光流方法,该方法结合深度学习与传统方法,以克服传统方法在HDR场景下的局限性。混合光流方法与LK光流方法之间的差异在图1中清晰展示。我们的方法首先利用轻量级卷积神经网络(CNN)从图像中提取光照不变特征图和得分图。这些特征图超越了基本的RGB信息,包含了更高层次的信息。随后,我们将这些特征图与传统的金字塔LK光流方法结合,如图2所示,从而形成了混合光流方法。在我们的方法中,训练过程涉及两个关键步骤。首先,我们采用深度网络辅助训练策略来提升轻量级网络的性能。其次,我们引入了新的损失函数,如掩码神经重投影误差(mNRE)损失来学习光照不变特征,以及线峰值损失来学习关键点得分。混合光流方法在实现高效且准确的光流计算的同时,还表现出对光照变化的鲁棒性。总之,本文的主要贡献如下:


${}^{1}$ These authors contributed equally to this work.

${}^{1}$ 这些作者对本工作的贡献相同。

This work was supported in part by the National Natural Science Foundation of China (52375015) and in part by the Natural Science Foundation of Hubei Province of China (2022CFB239). (Corresponding author: Bin Han)

本工作部分由国家自然科学基金(52375015)和湖北省自然科学基金(2022CFB239)资助。(通讯作者:韩斌)

Y. Lin, S. Wang, Y. Jiang and B. Han are with the State Key Laboratory of Intelligent Manufacturing Equipment and Technology, School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: {yichenglin, shuowang99, jiangyunlong binhan

林毅、王硕、江云龙和韩斌来自华中科技大学机械科学与工程学院,智能制造装备与技术国家重点实验室,武汉430074,中国(电子邮件:{yichenglin, shuowang99, jiangyunlong binhan}@hust.edu.cn)。

}@hust.edu.cn).

}@hust.edu.cn).


Fig. 2. The pipeline of the proposed hybrid optical flow method. A shared encoder is first used to extract the shared feature map of the image, and then the shared feature map are decoded into score map $S$ and illumination-invariant feature map $F$ . The score map $S$ is utilized for extracting keypoints,employing non-maximum suppression (NMS) to identify them. The illumination-invariant feature map $F$ are used to construct the pyramid optical flow method. Following the pyramid LK optical flow method, the hybrid optical flow method begins tracking the extracted keypoints from the highest level of the feature pyramid. The feature map are utilized to locate the positions of these keypoints in another image. Subsequently, the tracking results from the upper level are utilized as initial values for the tracking computation in the lower level, ultimately yielding sparse optical flow results.

图 2. 提出的混合光流方法的流程。首先使用共享编码器提取图像的共享特征图,然后将共享特征图解码为得分图 $S$ 和光照不变特征图 $F$。得分图 $S$ 用于提取关键点,采用非极大值抑制(NMS)来识别它们。光照不变特征图 $F$ 用于构建金字塔光流方法。按照金字塔 LK 光流方法,混合光流方法从特征金字塔的最高级别开始跟踪提取的关键点。特征图用于在另一幅图像中定位这些关键点的位置。随后,上层的跟踪结果被用作下层跟踪计算的初始值,最终产生稀疏光流结果。

1) We propose a hybrid optical flow method that does not rely on brightness consistency assumptions and can work properly in HDR environments.

1) 我们提出了一种不依赖亮度一致性假设的混合光流方法,并且可以在 HDR 环境中正常工作。

2) We propose a loss function mNRE for extracting light invariants in images, and improve the peaky loss in keypoint extraction.

2) 我们提出了一种用于提取图像中光照不变量的损失函数 mNRE,并改进了关键点提取中的峰值损失。

3) We propose to use a deep network to train and a shallow network to infer, balancing performance and efficiency, reaching ${190}\mathrm{\;{Hz}}$ on the CPU.

3) 我们提出使用深度网络进行训练和浅层网络进行推断,平衡性能和效率,在 CPU 上达到 ${190}\mathrm{\;{Hz}}$。

II. 相关工作

A. Traditional optical flow method

A. 传统光流方法

Horn and Schunck (HS) [7] proposed the first truly optical flow method, which formulated optical flow estimation as an optimisation problem minimising the global energy function. Gaussian filtering has been introduced as a pre-processing operation in variational methods to improve the performance in noisy conditions [8]. In contrast to HS, which computes the optical flow field for the entire image, Lucas et al. [3] proposed a method to track the optical flow for specific points. To address the issue of insufficient pixel displacement in real scenes, Bouguet [9] proposed a pyramid structure to implement the coarse-to-fine pyramidal Lucas Kanade (PLK). To improve the robustness of illumination, [5] proposed gradient consistency and other higher-order consistency constraints, such as constancy of the Hessian and constancy of the Laplacian. Robust Local Optical Flow (RLOF) [10], [11] proposes the adjustment of the neighbourhood window size to solve the generalised aperture problem. While employing these methods can enhance the robustness of optical flow, they are unable to fundamentally resolve the issue.

霍恩和舒克(HS)[7]提出了第一个真正的光流方法,将光流估计公式化为一个优化问题,最小化全局能量函数。在变分方法中,高斯滤波已被引入作为预处理操作,以在噪声条件下提高性能[8]。与HS计算整个图像的光流场不同,卢卡斯等人[3]提出了一种跟踪特定点光流的方法。为了解决真实场景中像素位移不足的问题,Bouguet [9]提出了一种金字塔结构来实现粗到细的金字塔Lucas Kanade(PLK)。为了提高光照的鲁棒性,[5]提出了梯度一致性以及其他高阶一致性约束,如Hessian常数和Laplacian常数。鲁棒局部光流(RLOF)[10],[11]提出了调整邻域窗口大小来解决广义孔径问题。尽管采用这些方法可以增强光流的鲁棒性,但它们无法从根本上解决问题。

B. Learning-based optical flow method

B. 基于学习的光流方法

CNN were first used to compute optical flow by Dosovitskiy [6]. Two end-to-end networks, FlowNetS and FlowNetC, were proposed to learn optical flow directly from the synthetic annotated Flying Chairs dataset. The accuracy of the optical flow estimation has been improved by the addition of subnetworks for small displacements in FlowNet [12]. Inspired by iterative updates, Recurrent All-Pairs Field Transforms (RAFT) [13] proposed a new network architecture for optical flow estimation, and [14] further improved its detection efficiency. Although end-to-end methods are designed to be as lightweight as possible, the complexity of optical flow estimation still limits real-time computation to GPUs.

CNN首次被Dosovitskiy [6]用于计算光流。提出了两个端到端的网络,FlowNetS和FlowNetC,直接从合成注释的Flying Chairs数据集中学习光流。通过在FlowNet [12]中添加用于小位移的子网络,提高了光流估计的准确性。受迭代更新的启发,循环全对场变换(RAFT)[13]提出了一种新的网络架构用于光流估计,[14]进一步提高了其检测效率。尽管端到端方法设计得尽可能轻量级,但光流估计的复杂性仍然限制了实时计算在GPU上。

C. Keypoint detection

C. 关键点检测

Classical Harris [15] keypoint detection uses the autocorrelation matrix to search for keypoints. To enhance the tracking performance of the keypoints, Shi and Tomasi [16] proposed a selection criterion that makes the keypoints more distributed, which has been widely used in optical flow methods. In addition, SIFT [17], ORB [18] etc. geometric methods were proposed to extract keypoints and matched by descriptors.

经典的 Harris [15] 关键点检测使用自相关矩阵来搜索关键点。为了增强关键点的跟踪性能,Shi 和 Tomasi [16] 提出了一种选择标准,使得关键点分布更均匀,这一标准已被广泛应用于光流方法中。此外,SIFT [17]、ORB [18] 等几何方法被提出用于提取关键点并通过描述符进行匹配。

Inspired by handcrafted feature detectors, a common approach for CNN-based detection is to construct response maps to locate interest points in a supervised manner [19]. Su-perPoint [20] subsequently suggested self-supervised learning using a pre-trained model to generate pseudo-ground truth points. Furthermore, unsupervised training methods are used to extract keypoints, KeyNet [21] and others. To overcome the problem of non-differentiable Non-maximum suppression (NMS), ALIKE [22] propose differentiable feature point detection modules.

受手工特征检测器的启发,基于 CNN 的检测的常见方法是构建响应图以监督方式定位兴趣点 [19]。SuperPoint [20] 随后建议使用预训练模型进行自监督学习以生成伪真实点。此外,还使用无监督训练方法来提取关键点,如 KeyNet [21] 等。为了克服非可微非极大值抑制(NMS)的问题,ALIKE [22] 提出了可微分特征点检测模块。

The proposed method does not belong to either traditional optical flow or end-to-end learning-based optical flow. It combines learning-based approaches with traditional methods, striking a balance between computational efficiency and performance. We refer to it as a hybrid optical flow method.

所提出的方法既不属于传统的光流方法,也不属于端到端学习型光流方法。它结合了基于学习的方法和传统方法,在计算效率和性能之间取得了平衡。我们将其称为混合光流方法。

Fig. 3. The network training process. A shallow network is first used to extract the score map $\mathbf{S}$ and feature map $\mathbf{F}$ . Then,in order to supervise the reliability of the training keypoints, a deep network is used to extract the dense descriptor map D. Finally, we calculate the keypoint loss, feature loss and descriptor loss based on the results of $\left\lbrack {\mathbf{S},\mathbf{F},\mathbf{D}}\right\rbrack$ . Only the shallow network was used for the hybrid optical flow method depicted in Fig. 2, while the deep network was only used for training.

图 3. 网络训练过程。首先使用浅层网络提取分数图 $\mathbf{S}$ 和特征图 $\mathbf{F}$。然后,为了监督训练关键点的可靠性,使用深层网络提取密集描述符图 D。最后,我们基于 $\left\lbrack {\mathbf{S},\mathbf{F},\mathbf{D}}\right\rbrack$ 的结果计算关键点损失、特征损失和描述符损失。如图 2 所示的混合光流方法仅使用了浅层网络,而深层网络仅用于训练。

III. HYBRID OPTICAL FLOW METHOD

III. 混合光流方法

A. Network architecture

A. 网络架构

As illustrated in Fig. 2, the network is designed to be as lightweight as possible which only uses four convolution operations. First,shared feature map of size $W \times H \times {16}$ are extracted from the input image $\left( {W \times H \times 3}\right)$ . The shared feature map is then transformed into a illumination-invariant feature map and a keypoint score map,using a $1 \times 1$ convolution kernel. With the lightweight network, the illumination-invariant feature map contains less high-level semantic information while retaining more low-level image information. We consider that the low-level information is sufficient for our tasks. Therefore, the computation complexity of the designed network is much lower than other ones. Each of these is explained in more detail as follows:

如图 2 所示,该网络设计得尽可能轻量化,仅使用四个卷积操作。首先,从输入图像 $\left( {W \times H \times 3}\right)$ 中提取大小为 $W \times H \times {16}$ 的共享特征图。然后,使用 $1 \times 1$ 卷积核将共享特征图转换为光照不变特征图和关键点得分图。通过轻量级网络,光照不变特征图包含较少的高级语义信息,同时保留了更多的低级图像信息。我们认为低级信息对于我们的任务已经足够。因此,所设计网络的计算复杂度远低于其他网络。以下将详细解释每一点:

(a) Shared Encoder The image feature encoder converts the input image $I \in W \times H \times 3$ to the size $W \times H \times {16}$ . The first two convolution operations use a $3 \times 3$ convolution kernel and expand the shared feature map to 8 channels. In the last layer,a $1 \times 1$ convolution kernel is used to increase the number of channels up to 16 . The ReLU [23] activation function is used after each convolution. We keep the original image resolution in all convolution operations.

(a) 共享编码器 图像特征编码器将输入图像 $I \in W \times H \times 3$ 转换为大小 $W \times H \times {16}$。前两个卷积操作使用 $3 \times 3$ 卷积核,并将共享特征图扩展到 8 个通道。在最后一层,使用 $1 \times 1$ 卷积核将通道数增加到 16。每个卷积操作后都使用了 ReLU [23] 激活函数。我们在所有卷积操作中保持原始图像分辨率。

(b) Feature and Socre Map Decoder The shared feature map is decoded into a score map and an illumination-invariant feature map by the decoding layer. It uses an $1 \times 1$ convolution kernel to reduce the channels of the feature map to 4 . The first three channels are the illumination-invariant feature map, and the last channel is the keypoint score map. After the convolution, the score map is activated by the sigmoid function to limit its value between $\left\lbrack {0,1}\right\rbrack$ . The convolution feature map is also L2 normalised. So we get the final output with a score map of $W \times H \times 1$ and a illumination-invariant feature map of $W \times H \times 3$ .

(b) 特征和得分图解码器 共享特征图通过解码层被解码为得分图和光照不变特征图。它使用 $1 \times 1$ 卷积核将特征图的通道数减少到 4。前三个通道是光照不变特征图,最后一个通道是关键点得分图。卷积后,得分图通过 sigmoid 函数激活,将其值限制在 $\left\lbrack {0,1}\right\rbrack$ 之间。卷积特征图也进行了 L2 归一化。因此,我们得到了一个大小为 $W \times H \times 1$ 的得分图和一个大小为 $W \times H \times 3$ 的光照不变特征图。

B. Optical flow method with illumination-invariant feature maps

B. 基于光照不变特征图的光流方法

The keypoint is first extracted from the score map. Then, we use it as the initial position of the optical flow method. Based on the LK optical flow method, we modify the brightness consistency assumption to illumination-invariant feature map consistency, i.e. the convolution feature vector of the same keypoint position in different images is the same. Assuming that, a new hybrid optical flow method is proposed in combination with the pyramidal optical flow method.

关键点首先从得分图中提取。然后,我们将其作为光流法的初始位置。基于LK光流法,我们将亮度一致性假设修改为光照不变特征图一致性,即不同图像中相同关键点位置的卷积特征向量是相同的。基于这一假设,结合金字塔光流法,提出了一种新的混合光流法。

1) Keypoint extraction: The optical flow method expects a more uniform distribution of keypoints to improve the overall tracking robustness. So we use a method similar to the GoodFeaturesToTrack function in OpenCV to extract keypoints. First, the local maximum in the 3x3 neighbourhood is retained by NMS, and then the keypoints with lower scores than the threshold are removed. We then use the maximum interval sampling method to ensure a uniform distribution of the keypoints.

1) 关键点提取:光流法期望关键点分布更均匀以提高整体跟踪鲁棒性。因此,我们使用类似于OpenCV中GoodFeaturesToTrack函数的方法来提取关键点。首先,通过NMS保留3x3邻域内的局部最大值,然后移除得分低于阈值的关键点。接着,我们采用最大间隔采样方法以确保关键点的均匀分布。

2) Pyramid optical flow method: According to [4], pyramid optical flow method can be divided into three steps. First, the spatial and temporal derivatives of the illumination-invariant feature map $\mathbf{F}\left( {x,y,t}\right)$ ,namely ${\mathbf{F}}{x},{\mathbf{F}}{y}$ and ${\mathbf{F}}_{t}$ ,are computed. Then, the derivatives of all keypoints are combined into a coefficient matrix $\mathbf{A}$ and a constant vector $\mathbf{b}$ ,respectively given by

2) 金字塔光流法:根据[4],金字塔光流法可以分为三个步骤。首先,计算光照不变特征图$\mathbf{F}\left( {x,y,t}\right)$的空间和时间导数,即${\mathbf{F}}{x},{\mathbf{F}}{y}$和${\mathbf{F}}_{t}$。然后,将所有关键点的导数分别组合成系数矩阵$\mathbf{A}$和常数向量$\mathbf{b}$,分别由

\[\mathbf{A} = \left\lbrack \begin{array}{l} {\left\lbrack \begin{array}{ll} {\mathbf{F}}_{x} & {\mathbf{F}}_{y} \end{array}\right\rbrack }_{1} \\ \vdots \\ {\left\lbrack \begin{array}{ll} {\mathbf{F}}_{x} & {\mathbf{F}}_{y} \end{array}\right\rbrack }_{k} \end{array}\right\rbrack ,\mathbf{b} = \left\lbrack \begin{matrix} - {\mathbf{F}}_{t1} \\ \vdots \\ - {\mathbf{F}}_{tk} \end{matrix}\right\rbrack , \tag{1}\]

where $\mathrm{k}$ is the number of keypoints. Finally,the optical flow velocity $\mathbf{v}$ is obtained by solving the equation $\mathbf{{Av}} = \mathbf{b}$ . In our method, we use the standard LK optical flow algorithm, but modify the brightness constancy assumption to a convolution feature constancy assumption. Therefore, our method is called the hybrid optical flow method.

其中$\mathrm{k}$是关键点的数量。最后,通过求解方程$\mathbf{{Av}} = \mathbf{b}$得到光流速度$\mathbf{v}$。在我们的方法中,我们使用标准的LK光流算法,但将亮度恒定假设修改为卷积特征恒定假设。因此,我们的方法被称为混合光流法。

Fig. 4. Comparison of line peaky loss and peaky loss. In $5 \times 5$ sized patch, the cyan block represents score 0.5 while the red represents score 1 . The numbers in the blocks represent then the derivative of the different losses with respect to the block. It can be seen that the line peaky loss increases the penalty weight for the ends of the lines.

图 4. 线峰值损失和峰值损失的比较。在 $5 \times 5$ 大小的补丁中,青色块代表分数 0.5,而红色代表分数 1。块中的数字表示不同损失相对于块的导数。可以看出,线峰值损失增加了线条末端的惩罚权重。

IV. LEARNING TRACKED KEYPOINTS AND INVARIANT FEATURE

IV. 学习跟踪的关键点和不变特征

Fig. 3 illustrates the training process of the network. For an image pair, a shallow network is used for the extraction of the score map $\mathbf{S}$ and the feature map $\mathbf{F}$ ,and a deep network is used for the extraction of the dense descriptor map D. The shallow network here is consistent with the network structure in Sec. III while the deep network is only used to assist in training. The keypoint loss, the illumination invariant feature loss, and the descriptor loss are used to train the three distinct outputs $\left\lbrack {\mathbf{S},\mathbf{F},\mathbf{D}}\right\rbrack$ ,respectively. The keypoint loss consists of reprojection loss, line peaky loss and reliability loss. The NRE function and the proposed mNRE function are used for descriptor loss and illumination invariant feature loss, respectively.

图 3 展示了网络的训练过程。对于一对图像,使用浅层网络提取分数图 $\mathbf{S}$ 和特征图 $\mathbf{F}$,并使用深层网络提取密集描述符图 D。这里的浅层网络与第 III 节中的网络结构一致,而深层网络仅用于辅助训练。关键点损失、光照不变特征损失和描述符损失分别用于训练三个不同的输出 $\left\lbrack {\mathbf{S},\mathbf{F},\mathbf{D}}\right\rbrack$。关键点损失包括重投影损失、线峰值损失和可靠性损失。NRE 函数和提出的 mNRE 函数分别用于描述符损失和光照不变特征损失。

A. Keypoint loss

A. 关键点损失

As described in previous work [22], a good keypoint should be repeatable, highly accurate, and matchable. Thus, three loss functions are used to train for the extraction of keypoints. The reprojection loss function ensures the repeatability of the keypoints. The line peaky loss function is beneficial for accuracy improvement of keypoints. And loss of reliability makes it easier to match the extracted keypoints.

如先前的工作 [22] 所述,一个好的关键点应该是可重复的、高度准确的和可匹配的。因此,使用了三个损失函数来训练关键点的提取。重投影损失函数确保了关键点的可重复性。线峰值损失函数有利于提高关键点的准确性。而可靠性损失使得提取的关键点更容易匹配。

1) Reprojection loss: A keypoint should be extracted simultaneously in two images under different conditions. The reprojection error is defined as the distance between the projected point and the extracted point. A point ${\mathbf{p}}{A}$ in image ${\mathbf{I}}{A}$ is projected to image ${\mathbf{I}}{B}$ ,and the projection point is ${\mathbf{p}}{AB}$ . The single reprojection error is

1) 重投影损失:在不同条件下,两个图像中应同时提取一个关键点。重投影误差定义为投影点与提取点之间的距离。图像 ${\mathbf{I}}{A}$ 中的点 ${\mathbf{p}}{A}$ 投影到图像 ${\mathbf{I}}{B}$,投影点为 ${\mathbf{p}}{AB}$。单个重投影误差为

\[{\operatorname{dist}}_{AB} = \begin{Vmatrix}{{\mathbf{p}}_{AB} - {\mathbf{p}}_{B}}\end{Vmatrix} \tag{2}\]

where ${\mathbf{p}}{B}$ is the extracted point in image ${\mathbf{I}}{B}$ and $\parallel \cdot \parallel$ is 2 - norm of the vector. The reprojection error loss is defined in a symmetrical form

其中 ${\mathbf{p}}{B}$ 是图像 ${\mathbf{I}}{B}$ 中的提取点,$\parallel \cdot \parallel$ 是向量的 2-范数。重投影误差损失以对称形式定义

\[{\mathcal{L}}_{rp} = \frac{\mathop{\sum }\limits_{{i = 0}}^{{N - 1}}{\operatorname{dist}}_{AB}^{i}}{N} + \frac{\mathop{\sum }\limits_{{i = 0}}^{{M - 1}}{\operatorname{dist}}_{BA}^{i}}{M}, \tag{3}\]

where $N$ is the number of points in image ${\mathbf{I}}{A}$ that can be found in image ${\mathbf{I}}{B}$ , ${\operatorname{dist}}{AB}^{i}$ is the reprojection error of the ith feature out of the $N$ features, $M$ is the number of points in image ${\mathbf{I}}{B}$ that can be found in image ${\mathbf{I}}{A}$ ,and ${\operatorname{dist}}{BA}^{i}$ is the reprojection error of the ith feature out of the $M$ features.

其中 $N$ 是图像 ${\mathbf{I}}{A}$ 中在图像 ${\mathbf{I}}{B}$ 中可找到的点的数量,${\operatorname{dist}}{AB}^{i}$ 是从 $N$ 个特征中第 i 个特征的重投影误差,$M$ 是图像 ${\mathbf{I}}{B}$ 中在图像 ${\mathbf{I}}{A}$ 中可找到的点的数量,${\operatorname{dist}}{BA}^{i}$ 是从 $M$ 个特征中第 i 个特征的重投影误差。

2) Line peaky loss: On the score map, keypoints should take on a peaked shape in the neighborhood. Consider a $N \times N$ -sized patch near the keypoint $\mathbf{p}$ on the score map,the distance between each pixel location $\left\lbrack {i,j}\right\rbrack$ and the keypoint is

2) 线峰值损失:在得分图上,关键点在其邻域内应呈现尖峰形状。考虑得分图上关键点 $\mathbf{p}$ 附近的 $N \times N$ 大小的块,每个像素位置 $\left\lbrack {i,j}\right\rbrack$ 与关键点之间的距离为

\[d\left( {\mathbf{p},i,j}\right) = \{ \parallel \mathbf{p} - \left\lbrack {i,j}\right\rbrack \parallel \} \tag{4}\]

where $\parallel \cdot \parallel$ is a 2-norm of the vector. The peaky loss proposed by [22] is used to reduce the score of farther-distant positions within a patch, defined as

其中 $\parallel \cdot \parallel$ 是向量的 2-范数。[22] 提出的峰值损失用于降低块内较远位置的得分,定义为

\[{\mathcal{L}}_{pk}\left( \mathbf{p}\right) = \frac{1}{{N}^{2}}\mathop{\sum }\limits_{{0 \leq i,j, < N}}d\left( {\mathbf{p},i,j}\right) s\left( {\mathbf{p},i,j}\right) , \tag{5}\]

where $s\left( {\mathbf{p},i,j}\right)$ is the score corresponding to the position $\left\lbrack {i,j}\right\rbrack$ in the patch near the keypoint $\mathbf{p}$ . This definition views all pixels within the patch uniformly, making the score map form a locally linear shape during training. Thus, we consider four line patterns, horizontal, vertical, left diagonal, and right diagonal, with increased penalty weights for line shapes. In a patch of size $N \times N$ near the keypoint $\mathbf{p}$ ,the four line weights ${w}{1},{w}{2},{w}{3},{w}{4}$ are defined as

其中 $s\left( {\mathbf{p},i,j}\right)$ 是与关键点 $\mathbf{p}$ 附近补丁中位置 $\left\lbrack {i,j}\right\rbrack$ 对应的分数。此定义将补丁内的所有像素均匀看待,使得分数图在训练过程中形成局部线性形状。因此,我们考虑四种线型模式:水平、垂直、左对角线和右对角线,并对线型形状增加惩罚权重。在关键点 $\mathbf{p}$ 附近大小为 $N \times N$ 的补丁中,四个线权重 ${w}{1},{w}{2},{w}{3},{w}{4}$ 定义为

\[{w}_{1}\left( {\mathbf{p},i,j}\right) = \mathcal{N}\left( \left| {i - {p}_{x}}\right| \right)\] \[{w}_{2}\left( {\mathbf{p},i,j}\right) = \mathcal{N}\left( \left| {j - {p}_{y}}\right| \right) \tag{6}\] \[{w}_{3}\left( {\mathbf{p},i,j}\right) = \mathcal{N}{\left( \left| i + j - {p}_{x} - {p}_{y}\right| \right) }^{\prime }\] \[{w}_{4}\left( {\mathbf{p},i,j}\right) = \mathcal{N}\left( \left| {i - j - {p}_{x} + {p}_{y}}\right| \right)\]
where $\left\lbrack {i,j}\right\rbrack$ is the pixel position within the patch, ${p}{x},{p}{y}$ are the coordinates of the keypoint $\mathbf{p},\mathcal{N}$ is the gaussian distribution and $\left\cdot \right$ is the 1 -norm. By using these weights,the line peaky loss are defined as
其中 $\left\lbrack {i,j}\right\rbrack$ 是补丁内的像素位置,${p}{x},{p}{y}$ 是关键点的坐标,$\mathbf{p},\mathcal{N}$ 是高斯分布,$\left\cdot \right$ 是 1-范数。通过使用这些权重,线峰值损失定义为
\[\mathrm{s}\left( {\mathbf{p},i,j}\right) = {w}_{k}\left( {\mathbf{p},i,j}\right) d\left( {\mathbf{p},i,j}\right) s\left( {\mathbf{p},i,j}\right)\] \[{\mathcal{L}}_{lpk}\left( \mathbf{p}\right) = \mathop{\max }\limits_{{k = 1\cdots 4}}\left\{ {\frac{1}{{N}^{2}}\mathop{\sum }\limits_{{{\left\lbrack i,j\right\rbrack }^{T} \in \operatorname{patch}\left( \mathbf{p}\right) }}\mathrm{s}\left( {\mathbf{p},i,j}\right) }\right\} \tag{7}\]

where the max function is used to select the maximum value from the four line patterns. Fig. 4 demonstrates that the peaky loss differs from the derivative of the proposed line peaky loss in that the penalty at the ends of the line is increased by the line peaky loss.

其中使用 max 函数从四种线型模式中选择最大值。图 4 显示,峰值损失与所提出的线峰值损失的导数不同之处在于,线峰值损失增加了线条末端的惩罚。

3) Reliability loss: Accurate and repeatable keypoints are not sufficient. It is also necessary to ensure the matchability of the keypoints [22]. To compute the matchability of the keypoint ${\mathbf{p}}{A}$ in image ${\mathbf{I}}{A}$ ,the vector distance between its corresponding descriptor ${\mathbf{d}}{{\mathbf{p}}{A}} \in {\mathbb{R}}^{\text{dim }}$ and the dense descriptor map ${\mathbf{D}}{B} \in {\mathbb{R}}^{H \times W \times \dim }$ in image ${\mathbf{I}}{B}$ are computed as

3) 可靠性损失:准确且可重复的关键点还不够。还需要确保关键点的可匹配性 [22]。为了计算图像 ${\mathbf{I}}{A}$ 中关键点 ${\mathbf{p}}{A}$ 的可匹配性,计算其对应描述符 ${\mathbf{d}}{{\mathbf{p}}{A}} \in {\mathbb{R}}^{\text{dim }}$ 与图像 ${\mathbf{I}}{B}$ 中密集描述符图 ${\mathbf{D}}{B} \in {\mathbb{R}}^{H \times W \times \dim }$ 之间的向量距离为

\[{\mathbf{C}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}} = {\mathbf{D}}_{B}{\mathbf{d}}_{{\mathbf{p}}_{A}} \tag{8}\]

where ${\mathbf{C}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} \in {\mathbb{R}}^{H \times W}$ is known as the similarity map representing the similarity between the keypoint ${\mathbf{p}}{A}$ and the position of each pixel in the image ${\mathbf{I}}{B}$ . The normalization function then makes the score of the positions with high similarity to 1 , while the score of the positions with low similarity is 0,thus the matching probability map ${\widetilde{\mathbf{C}}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} \in {\mathbb{R}}^{H \times W}$ is defined as

其中 ${\mathbf{C}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} \in {\mathbb{R}}^{H \times W}$ 被称为相似度图,表示关键点 ${\mathbf{p}}{A}$ 与图像 ${\mathbf{I}}{B}$ 中每个像素位置之间的相似度。归一化函数使高相似度位置的分数为 1,而低相似度位置的分数为 0,因此匹配概率图 ${\widetilde{\mathbf{C}}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} \in {\mathbb{R}}^{H \times W}$ 定义为

\[{\widetilde{\mathbf{C}}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}} = \exp \left( \frac{{\mathbf{C}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}} - 1}{t}\right) , \tag{9}\]

where exp is the exponential function used to compose the normalization function,and $t = {0.02}$ controls the shape of the function. For the keypoint ${\mathbf{p}}{A}$ ,the higher matching probability of its projected location ${\mathbf{p}}{AB}$ implies its higher reliability. Therefore the reliability ${r}{\mathbf{p}A}$ of the keypoint ${\mathbf{p}}{A}$ is defined

其中 exp 是用于构成归一化函数的指数函数,而 $t = {0.02}$ 控制函数的形状。对于关键点 ${\mathbf{p}}{A}$,其投影位置 ${\mathbf{p}}{AB}$ 的匹配概率越高,意味着其可靠性越高。因此,关键点 ${\mathbf{p}}{A}$ 的可靠性 ${r}{\mathbf{p}A}$ 被定义为

as

\[{r}_{{\mathbf{p}}_{A}} = \operatorname{bisampling}\left( {{\widetilde{\mathbf{C}}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}},{\mathbf{p}}_{AB}}\right) , \tag{10}\]

where bisampling $\left( {\mathbf{M},\mathbf{p}}\right)$ is a function of bilinear sampling of the $\mathbf{M} \in {\mathbb{R}}^{H \times W}$ at position $\mathbf{p} \in {\mathbb{R}}^{2}$ . Then consider all the keypoints in the image ${\mathbf{I}}_{A}$ and penalize the keypoint scores that are less reliable among them to get the reliability loss

其中 bisampling $\left( {\mathbf{M},\mathbf{p}}\right)$ 是 $\mathbf{M} \in {\mathbb{R}}^{H \times W}$ 在位置 $\mathbf{p} \in {\mathbb{R}}^{2}$ 的双线性采样函数。然后考虑图像 ${\mathbf{I}}_{A}$ 中的所有关键点,并对其中可靠性较低的关键点分数进行惩罚,以获得可靠性损失

\[\mathbb{S} = \mathop{\sum }\limits_{\substack{{{\mathbf{p}}_{A} \in {\mathbf{I}}_{A}} \\ {{\mathbf{p}}_{AB} \in {\mathbf{I}}_{B}} }}{s}_{{\mathbf{p}}_{A}}{s}_{{\mathbf{p}}_{AB}}\] \[{\mathcal{L}}_{\text{reliability }}^{A} = \frac{1}{{N}_{A}}\mathop{\sum }\limits_{\substack{{{\mathbf{p}}_{A} \in {\mathbf{I}}_{A}} \\ {{\mathbf{p}}_{AB} \in {\mathbf{I}}_{B}} }}\frac{{s}_{{\mathbf{p}}_{A}}{s}_{{\mathbf{p}}_{AB}}}{\mathbb{S}}\left( {1 - {r}_{{\mathbf{p}}_{A}}}\right) \tag{11}\]

where ${N}{A}$ is the number of all keypoints in the image ${\mathbf{I}}{A}$ , ${s}{{\mathbf{p}}{A}}$ is the score of the keypoint ${\mathbf{p}}{A}$ and ${s}{{\mathbf{p}}_{AB}}$ is the score of the projection location.

其中 ${N}{A}$ 是图像 ${\mathbf{I}}{A}$ 中所有关键点的数量,${s}{{\mathbf{p}}{A}}$ 是关键点 ${\mathbf{p}}{A}$ 的分数,而 ${s}{{\mathbf{p}}_{AB}}$ 是投影位置的分数。

Based on the above three loss functions, we obtain the keypoint loss

基于上述三个损失函数,我们得到了关键点损失

\[{\mathcal{L}}_{\text{keypoint }} = {k1} \cdot {\mathcal{L}}_{rp} +\] \[{k2} \cdot \frac{1}{N + M}\left( {\mathop{\sum }\limits_{{{\mathbf{p}}_{A} \in {\mathbf{I}}_{A}}}{\mathcal{L}}_{lpk}\left( {\mathbf{p}}_{A}\right) + \mathop{\sum }\limits_{{{\mathbf{p}}_{B} \in {\mathbf{I}}_{B}}}{\mathcal{L}}_{lpk}\left( {\mathbf{p}}_{B}\right) }\right) +\] \[{k3} \cdot \frac{1}{2}\left( {{\mathcal{L}}_{\text{reliability }}^{A} + {\mathcal{L}}_{\text{reliability }}^{B}}\right) \tag{12}\]

where ${k}{1} = 1,{k}{2} = {0.5},{k}{3} = 1$ are the weights and $N,M$ are the number of keypoints in image ${\mathbf{I}}{A}$ and ${\mathbf{I}}_{B}$ respectively.

其中 ${k}{1} = 1,{k}{2} = {0.5},{k}{3} = 1$ 是权重,$N,M$ 分别是图像 ${\mathbf{I}}{A}$ 和 ${\mathbf{I}}_{B}$ 中的关键点数量。

B. Descriptor loss

B. 描述符损失

The NRE [24] function was used to learn the descriptors. Due to its good performance, we adopted it as the descriptor loss. Previous work [22] explains the derivation and definition of NRE function by the cross-entropy function. We give an explanation from another more intuitive perspective. Based on the similarity map defined in Eq. 8, a new matching probability map ${\widetilde{\mathbb{C}}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} \in {\mathbb{R}}^{H \times W}$ is defined as

使用了 NRE [24] 函数来学习描述符。由于其良好的性能,我们将其作为描述符损失。先前的工作 [22] 通过交叉熵函数解释了 NRE 函数的推导和定义。我们从另一个更直观的角度进行了解释。基于在公式 8 中定义的相似度图,定义了一个新的匹配概率图 ${\widetilde{\mathbb{C}}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} \in {\mathbb{R}}^{H \times W}$ 为

\[{\widetilde{\mathbb{C}}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}} \mathrel{\text{:=}} \operatorname{softmax}\left( \frac{{C}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}} - 1}{t}\right) , \tag{13}\]

where the normalization function softmax converts similarity to probability and satisfies that all elements sum to one, i.e., $\mathop{\sum }\limits_{{H \times W}}{\widetilde{\mathbb{C}}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} = 1$ . For a good descriptor ${\mathbf{d}}{{\mathbf{p}}{A}}$ ,it should be as similar as possible to the descriptor at projection position ${\mathbf{p}}{AB}$ and far away from all other descriptors. Thus, by the same sampling function in Eq. 10, we get the matching probability ${\mathrm{p}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}}^{{\mathbf{p}}{AB}}$ as

其中归一化函数 softmax 将相似度转换为概率,并满足所有元素之和为1,即 $\mathop{\sum }\limits_{{H \times W}}{\widetilde{\mathbb{C}}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} = 1$。对于一个好的描述符 ${\mathbf{d}}{{\mathbf{p}}{A}}$,它应尽可能与投影位置 ${\mathbf{p}}{AB}$ 处的描述符相似,并远离所有其他描述符。因此,通过公式 10 中的相同采样函数,我们得到了匹配概率 ${\mathrm{p}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}}^{{\mathbf{p}}{AB}}$ 为

\[{\mathrm{p}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}}^{{\mathbf{p}}_{AB}} = \operatorname{bisampling}\left( {{\widetilde{\mathbb{C}}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}},{\mathbf{p}}_{AB}}\right) . \tag{14}\]

Maximizing the matching probability at the projected positions ${\mathrm{p}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}}^{{\mathbf{p}}_{AB}}$ with the constraint that the sum of all the elements is equal to 1 implies minimizing the matching probability at the other positions. The descriptor loss function is then obtained as

在投影位置 ${\mathrm{p}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}}^{{\mathbf{p}}_{AB}}$ 最大化匹配概率,且所有元素之和等于1的约束下,意味着在其他位置最小化匹配概率。然后得到描述符损失函数为

\[{\mathcal{L}}_{\text{desc }} = \frac{1}{{N}_{A} + {N}_{B}} \cdot \left( {-\mathop{\sum }\limits_{{{\mathbf{p}}_{A} \in {\mathbf{I}}_{A}}}\ln \left( {\mathrm{p}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}}^{{\mathbf{p}}_{AB}}\right) - \mathop{\sum }\limits_{{{\mathbf{p}}_{B} \in {\mathbf{I}}_{B}}}\ln \left( {\mathrm{p}}_{{\mathbf{d}}_{{\mathbf{p}}_{B}},{\mathbf{D}}_{A}}^{{\mathbf{p}}_{BA}}\right) }\right) , \tag{15}\]

where ${N}{A}$ is the number of keypoints in image ${\mathbf{I}}{A},{N}{B}$ is the number of keypoints in image ${\mathbf{I}}{B}$ ,and the $- \ln \left( \cdot \right)$ function converts the maximization problem into a minimization problem.

其中 ${N}{A}$ 是图像 ${\mathbf{I}}{A},{N}{B}$ 的关键点数量,${\mathbf{I}}{B}$ 是图像 ${\mathbf{I}}_{B}$ 的关键点数量,且 $- \ln \left( \cdot \right)$ 函数将最大化问题转化为最小化问题。

C. Illumination-invariant feature loss

C. 光照不变特征损失

The illumination invariant feature map is similar to the dense descriptor map in that both attempt to extract features in an image that do not vary with viewing point and illumination. The difference is that descriptors are usually obtained using 64 or 128 channels and are convolved several times, whereas illumination invariant feature maps have only three or one channel and are obtained by four convolutions. Thus illumination invariant feature maps are difficult to learn to obtain features that are distinguishable over the entire image. Therefore we propose to learn locally distinguishable illumination invariant feature maps using mNRE loss function. Consider first defining the mask function near the keypoint $\mathbf{p} = {\left\lbrack {p}{x},{p}{y}\right\rbrack }^{T}$

光照不变特征图与密集描述符图相似,两者都试图提取图像中不随视点和光照变化的特征。不同之处在于,描述符通常使用64或128个通道,并经过多次卷积,而光照不变特征图只有三个或一个通道,并通过四次卷积获得。因此,光照不变特征图难以学习以在整个图像上获得可区分的特征。因此,我们提出使用mNRE损失函数学习局部可区分的光照不变特征图。首先考虑在关键点 $\mathbf{p} = {\left\lbrack {p}{x},{p}{y}\right\rbrack }^{T}$ 附近定义掩码函数

\[\text{,}\;\operatorname{mask}\left( \mathbf{p}\right) = \left\{ \begin{array}{ll} 1 & \text{ if }\max \left( {\left| {x - {p}_{x}}\right| ,\left| {y - {p}_{y}}\right| }\right) < d \\ 0 & \text{ otherwise } \end{array}\right. \text{,} \tag{16}\]
where $d = {80}$ is the mask range and $\left\cdot \right$ is the 1 -norm. This mask is then added to Eq. 13 of the NRE loss function to obtain the local matching probability map
其中 $d = {80}$ 是掩码范围,$\left\cdot \right$ 是1-范数。然后将此掩码添加到NRE损失函数的公式13中,以获得局部匹配概率图
\[m{\widetilde{\mathbb{C}}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}} \mathrel{\text{:=}} \operatorname{softmax}\left( \frac{\operatorname{mask}\left( {\mathbf{p}}_{A}\right) \cdot {C}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}} - 1}{t}\right) , \tag{17}\]

where $m{\widetilde{\mathbb{C}}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} \in {\mathbb{R}}^{H \times W}$ has a value of zero in the region away from ${\mathrm{p}}_{A}$ . Then similar to in the NRE loss,we computed the local matching probability of the projected positions

其中 $m{\widetilde{\mathbb{C}}}{{\mathbf{d}}{{\mathbf{p}}{A}},{\mathbf{D}}{B}} \in {\mathbb{R}}^{H \times W}$ 在远离 ${\mathrm{p}}_{A}$ 的区域中值为零。然后类似于NRE损失,我们计算了投影位置的局部匹配概率

\[{\left. m\right| }_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}}^{{\mathbf{p}}_{AB}} = \operatorname{bisampling}\left( {m{\widetilde{\mathbb{C}}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}},{\mathbf{p}}_{AB}}\right) . \tag{18}\]

Finally, the loss function is computed based on the local matching probability of all keypoints in images ${\mathbf{I}}{A}$ and ${\mathbf{I}}{B}$ as

最后,基于图像 ${\mathbf{I}}{A}$ 和 ${\mathbf{I}}{B}$ 中所有关键点的局部匹配概率计算损失函数

\[{\mathcal{L}}_{\text{feat }} =\] \[\frac{1}{{N}_{A} + {N}_{B}} \cdot \left( {-\mathop{\sum }\limits_{{{\mathbf{p}}_{A} \in {\mathbf{I}}_{A}}}\ln \left( {m{\mathrm{p}}_{{\mathbf{d}}_{{\mathbf{p}}_{A}},{\mathbf{D}}_{B}}^{{\mathbf{p}}_{AB}}}\right) - \mathop{\sum }\limits_{{{\mathbf{p}}_{B} \in {\mathbf{I}}_{B}}}\ln \left( {m{\mathrm{p}}_{{\mathbf{d}}_{{\mathbf{p}}_{B}},{\mathbf{D}}_{A}}^{{\mathbf{p}}_{BA}}}\right) }\right) . \tag{19}\]

V. EXPERIMENTS

V. 实验

A. Training Details

A. 训练细节

MegaDepth [25] is used to train the model. All images in the dataset are scaled to ${480} \times {480}$ during the training process. Training is done using the ADAM [26] optimiser with a learning rate of $3{e}^{-3}$ . We set the batch size to one,but use the gradient of 16 batches to accumulate, and train for 100 epochs. Using the above settings, our model was trained on a 4090 graphics card for approximately one day.

MegaDepth [25] 用于训练模型。数据集中的所有图像在训练过程中都被缩放到 ${480} \times {480}$。训练使用 ADAM [26] 优化器进行,学习率为 $3{e}^{-3}$。我们将批次大小设置为1,但使用16个批次的梯度进行累积,并训练100个周期。使用上述设置,我们的模型在4090显卡上训练了大约一天。

Fig. 5. Optical flow method comparison in image sequences. In the image sequence, an active light source is used to simulate dynamic lighting environments. At the edge of the elliptical spot, the constant brightness assumption no longer holds, so the performance of the optical flow method will be challenged. The current keypoints are drawn in green, and the optical flow results within ten frames is drawn in red. Image pairs in the sequence are drawn every ten frames. It can be seen that the traditional LK optical flow method has a large error at the edge of the spot. The proposed optical flow method has significantly improved performance at the edge of the spot.

图5. 图像序列中的光流方法比较。在图像序列中,使用主动光源来模拟动态光照环境。在椭圆形光斑的边缘,恒定亮度假设不再成立,因此光流方法的性能将受到挑战。当前的关键点用绿色绘制,十帧内的光流结果用红色绘制。序列中的图像对每十帧绘制一次。可以看出,传统的LK光流方法在光斑边缘存在较大误差。所提出的光流方法在光斑边缘的性能有了显著提升。

TABLE I

REPEATABILITY COMPARISON

重复性比较

MethodRepeatability $\uparrow$CPU time(ms) $\downarrow$
Illumination ScenesViewpoint Scenes
Ours0.6180.6065.2
ALike(T) [22]0.6380.56384.4
SuperPoint [20]0.6520.50393.5
Harris | 150.620.5564.9
Fast [27]0.5750.5520.4
Random0.1010.11

B. HPatches Repeatability

B. HPatches 重复性

In order to verify the performance of the proposed model in the detection of keypoints, we calculated the repeatability of the keypoints on the HPatches [28] dataset. We compared the proposed method with ALIKE(T) [22], SuperPoint [20], Harris [15], Fast [27], Random. The repeatability is calculated for the extraction of 300 keypoints at a resolution of ${240} \times {320}$ . The same NMS is used for all feature point detection methods to suppress the phenomenon of feature point clustering. The keypoints with reprojection distance less than 3 in the two images are considered to be repeat keypoints and the keypoints above this threshold are not repeatable. The comparison of repeatability in illumination scenes and viewpoint scenes is shown in the table I. We can see that the proposed keypoint point repetition rate is at the state-of-the-art level.

为了验证所提出的模型在关键点检测中的性能,我们在HPatches [28]数据集上计算了关键点的重复性。我们将所提出的方法与ALIKE(T) [22]、SuperPoint [20]、Harris [15]、Fast [27]、Random进行了比较。重复性计算是在分辨率为${240} \times {320}$的情况下提取300个关键点。所有特征点检测方法都使用相同的NMS来抑制特征点聚集现象。在两幅图像中,重投影距离小于3的关键点被认为是重复关键点,超过此阈值的关键点则不重复。表I显示了光照场景和视角场景中的重复性比较。我们可以看到,所提出的关键点重复率达到了最先进的水平。

In terms of calculation time, all methods are run on the same onboard CPU I7-1165G7 and measured. We computed the non-deep learning keypoint extraction method in the size of ${480} \times {640}$ ,and computed the deep learning keypoint score map in ${240} \times {320}$ ,and then upsampled to ${480} \times {640}$ . From the table I it can be seen that the proposed method has greatly improved the real-time performance compared to other deep learning solutions, and even comparable to traditional methods.

在计算时间方面,所有方法都在相同的机载CPU I7-1165G7上运行并进行测量。我们计算了大小为${480} \times {640}$的非深度学习关键点提取方法,并在${240} \times {320}$中计算了深度学习关键点得分图,然后上采样到${480} \times {640}$。从表I可以看出,与其它深度学习解决方案相比,所提出的方法大大提高了实时性能,甚至与传统方法相当。

Fig. 6. Examples of dynamic lighting scene images. The collected data contains four typical scenes, indoor light source changes, active light sources, outdoor sunlight changes and image blur caused by light scattering. The gray scale assumption is not satisfied in all image pairs.

图6. 动态光照场景图像示例。收集的数据包含四个典型场景:室内光源变化、主动光源、室外阳光变化以及由光线散射引起的图像模糊。所有图像对都不满足灰度假设。

TABLE II

CORRECT TRACKING RATIO

正确跟踪比率

ScenesIndoorLighting SourceOutdoorImage Blur
Ours0.840.870.670.75
LK Optical Flow [4]0.580.370.450.19
Census0.530.720.610.41
Histogram equalization0.510.500.520.49
ORB [18]0.120.400.230.32
ALIKE(T) [22]0.590.410.540.51
SuperPoint [200.480.490.570.51

C. Keypoint Tracking

C. 关键点跟踪

We collected several sets of image pairs with typical lighting changes and active light sequences. Fig. 6 shows some examples of collected images. These scenes are divided into four typical categories, namely indoor light source changes, outdoor sunlight changes, active light sources and image blur. Each category collects five pairs of images with small angle differences but large illumination changes.

我们收集了几组具有典型光照变化和主动光序列的图像对。图6展示了一些收集到的图像示例。这些场景被分为四个典型类别,即室内光源变化、室外阳光变化、主动光源和图像模糊。每个类别收集了五对图像,这些图像具有小角度差异但光照变化较大。

Fig. 7. Keypoint tracking rejection rate. In the sequence with active light source, the number of outliers in the optical flow matching is counted and divided by the total number of keypoints to obtain the rejection rate. Compared with the original LK optical flow method, the proposed method can effectively reduce the number of outliers, thereby improving the accuracy of optical flow.

图7展示了关键点跟踪的拒绝率。在包含主动光源的序列中,统计了光流匹配中的异常值数量,并将其除以总关键点数以获得拒绝率。与原始LK光流方法相比,所提出的方法能有效减少异常值数量,从而提高光流的准确性。

Using the data, we compared the proposed methods’ matching results in Table III All methods run on images sized ${480} \times {640}$ and restrict to extracting up to 300 keypoints for matching. Brute force matching is used to obtain matching results for all keypoints with descriptors such as ORB [18]. The optical flow method after preprocessing the image using census transform and histogram equalization was similarly compared. As there is no ground truth, we match each set of frames using SIFT [17] keypoints and estimate the fundamental matrix in RANSAC [29]. This matrix serves to filter all matches to obtain the correct number. Table II shows the comparison of the correct tracking rate in different scenes. From the table, we can see that the proposed method has achieved the highest correct tracking rate in all scenes.

利用这些数据,我们在表III中比较了所提出方法的匹配结果。所有方法都在尺寸为${480} \times {640}$的图像上运行,并限制提取最多300个关键点进行匹配。使用暴力匹配方法获取所有具有ORB[18]等描述符的关键点的匹配结果。同样地,比较了在预处理图像使用 census 变换和直方图均衡化后的光流方法。由于没有真实数据,我们使用SIFT[17]关键点对每一组帧进行匹配,并使用RANSAC[29]估计基本矩阵。该矩阵用于过滤所有匹配以获得正确数量。表II展示了不同场景中正确跟踪率的比较。从表中可以看出,所提出的方法在所有场景中实现了最高的正确跟踪率。

For further verification of the performance of the proposed method in dynamic lighting scenes, we also collected a sequence with an active light source. As in the VIO system, we continuously tracked the positions of the extracted keypoints in the sequence. As shown in Fig. 5, the proposed method can continuously track keypoints at the edge of the spot. Furthermore, Fig. 7 shows that the proposed method can continuously track keypoints in scenes with illumination changes. The traditional optical flow method fails due to the interference of the active light source, resulting in a rapid increase in the rejection rate.

为了进一步验证所提方法在动态光照场景下的性能,我们还采集了一个带有主动光源的序列。与VIO系统中一样,我们持续跟踪了序列中提取的关键点的位置。如图5所示,所提方法能够在光斑边缘持续跟踪关键点。此外,图7显示所提方法能够在光照变化的场景中持续跟踪关键点。传统的光流方法由于受到主动光源的干扰,导致拒收率迅速增加。

D. VIO Trajectory Estimation

D. VIO轨迹估计

We embed the proposed optical flow method into the modern VIO system, VINS-Mono [1]. By replacing the original keypoint extraction algorithm and the optical flow method calculation method, the modified VIO system is obtained. As shown in Fig. 8, a comparison test was performed on the UMA-VI [30] dataset with dynamic lighting data. As the dataset only provides the part of the ground truth, we only compute the final trajectory translation error, as shown in table III. We similarly tested this in the widely used EuRoC [31] dataset, as in table IV. The results show that hybrid optical flow can equally improve the accuracy even in common datasets.

我们将所提的光流方法嵌入到现代VIO系统VINS-Mono[1]中。通过替换原有的关键点提取算法和光流方法计算方法,得到了修改后的VIO系统。如图8所示,在具有动态光照数据的UMA-VI[30]数据集上进行了对比测试。由于数据集仅提供了部分地面实况,我们仅计算了最终轨迹的平移误差,如表III所示。我们同样在广泛使用的EuRoC[31]数据集上进行了测试,如表IV所示。结果显示,混合光流即使在普通数据集上也能同等提高准确性。

Fig. 8. Sequence 1 trajectory comparison in the UMA-VI [30] dataset. The starting point and end point of the sequence coincide, and it can be seen that the improved method can effectively improve the accuracy of the trajectory in the dynamic lighting scene. The two set of pictures corresponding to the two parts of the rapid accumulation of trajectory error are displayed above the curve, which verifies that the rapid change of lighting is the main source of error.

图8. UMA-VI[30]数据集中序列1的轨迹对比。序列的起始点和结束点重合,可以看出改进的方法能有效提高动态光照场景下轨迹的准确性。两组图片对应于轨迹误差快速积累的两部分,显示在曲线上方,验证了光照的快速变化是误差的主要来源。

TABLE III

TRAJECTORY ERROR COMPARISON IN INDOOR-OUTDOOR DYNAMIC ILLUMINATION CATEGORY

室内外动态光照类别中的轨迹误差对比

Trajectory Error $\downarrow$VINS [1]VINS(Ours)ORB-SLAM332
two-floors-cscl8.972.96lost
two-floors-csc217.8110.67lost

TABLE IV

COMPARISON OF TRAJECTORY ERRORS ON THE EUROC DATASET

EuRoC数据集上的轨迹误差对比

SequenceMethod${AP}{E}_{rot}$${AP}{E}_{trans}$${RP}{E}_{\text{rot }}$${RP}{E}_{trans}$
V1_02Ours3.650.160.18${5.0}\mathrm{e} - 2$
VINS3.730.120.22${4.5}\mathrm{e} - 2$
V1_03Ours2.480.080.215.7e-2
VINS5.560.150.25${6.9}\mathrm{e} - 2$
V1_04Ours3.360.130.19${4.0}\mathrm{e} - 2$
VINS3.650.160.18${3.9}\mathrm{e} - 2$
MH_03Ours1.480.120.10${6.8}\mathrm{e} - 2$
VINS1.690.190.115.9e-2
MH_04Ours1.800.150.14${6.3}\mathrm{e} - 2$
VINS1.870.180.166.2e-2

E. Ablation experiment

E. 消融实验

This subsection contains two comparison experiments that validate the line peaky loss and the mNRE loss, respectively. The keypoint repeatability obtained after training with peak loss and line peak loss is shown in table V. During the repeatability calculations NMS radius was 2 and the number of keypoints was 400 and performed on the original image size. Similarly, the percentage of correct tracking obtained by training the illumination invariant feature maps using mNRE loss and NRE loss, respectively, is shown in table VI. It can be seen that mNRE can effectively improve the quality of optical flow tracking. The experimental parameters are set up in the same way as V-C.

本小节包含两个对比实验,分别验证了线峰值损失和mNRE损失。表V展示了使用峰值损失和线峰值损失训练后获得的关键点重复性。在重复性计算中,NMS半径为2,关键点数量为400,并在原始图像尺寸上进行。同样,表VI展示了分别使用mNRE损失和NRE损失训练光照不变特征图后获得的正确跟踪百分比。可以看出,mNRE能有效提高光流跟踪的质量。实验参数设置与V-C部分相同。

TABLE V

REPEATABILITY IN HPATCHES

HPATCHES中的重复性

Training steps1020304050
peaky0.200.220.230.220.23
line peaky0.320.360.370.390.40

TABLE VI

CORRECT TRACKING RATIO

正确跟踪比率

ScenesIndoorLighting SourceOutdoorImage Blur
NRE0.920.840.680.44
mNRE0.960.910.710.74

VI. CONCLUSIONS

VI. 结论

In this paper, we propose a hybrid sparse optical flow method that maintains the real-time performance of traditional optical flow method while improving its robustness in dynamic lighting scenes. The basic idea of this work is that CNN are suitable for extracting image features, while traditional LK optical flow method is suitable for optical flow calculation. The combination of the two methods improves the performance of the optical flow method. To achieve this goal, we propose a lightweight network for extracting keypoints and an illumination-invariant feature map. We also propose a training process assisted by a deep network for the shallow network proposed, and multiple loss functions for training the network. The repeatability of the proposed method is verified on the HPatches dataset, and its performance in dynamic lighting scenes is verified on multiple dynamic lighting datasets. Finally, it is embedded in the VIO system to verify its effectiveness in practical applications. This work will support the development of illumination-robust visual SLAM with the hope of achieving robust performance in challenging environments such as caves and tunnels.

本文提出了一种混合稀疏光流方法,该方法在保持传统光流方法实时性能的同时,提高了其在动态光照场景中的鲁棒性。本工作的基本思想是CNN适合提取图像特征,而传统的LK光流方法适合进行光流计算。两者的结合提高了光流方法的性能。为实现这一目标,我们提出了一种用于提取关键点的轻量级网络和一种光照不变特征图。我们还提出了一种由深度网络辅助的浅层网络训练过程,以及用于训练网络的多种损失函数。所提方法的重复性在HPatches数据集上得到验证,其在动态光照场景中的性能在多个动态光照数据集上得到验证。最后,它被嵌入到VIO系统中,以验证其在实际应用中的有效性。这项工作将支持开发具有光照鲁棒性的视觉SLAM,希望在洞穴和隧道等挑战性环境中实现鲁棒性能。

REFERENCES

参考文献

[1] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Trans. Robotics, vol. 34, no. 4, pp. 1004-1020, 2018.

[2] M. Ferrera, A. Eudes, J. Moras, M. Sanfourche, and G. Le Besnerais, “OV2SLAM : A fully online and versatile visual SLAM for real-time applications,” IEEE Robot. Autom. Lett. (RA-L), 2021.

[3] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Int. Joint Conf. Artif. Intell., vol. 2, 1981, pp. 674-679.

[4] J.-Y. Bouguet et al., “Pyramidal Implementation of the affine Lucas Kanade Feature Tracker Description of the Algorithm,” Intel corporation, vol. 5, no. 1-10, p. 4, 2001.

[5] N. Papenberg, A. Bruhn, T. Brox, S. Didas, and J. Weickert, “Highly accurate optic flow computation with theoretically justified warping,” Int. J. Comput. Vis., vol. 67, pp. 141-158, 2006.

[6] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “FlowNet: Learning Optical Flow with Convolutional Networks,” in IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 2758-2766.

[7] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell., vol. 17, no. 1-3, pp. 185-203, 1981.

[8] H. Zimmer, A. Bruhn, and J. Weickert, “Optic flow in harmony,” Int. J. Comput. Vis., vol. 93, pp. 368-388, 2011.

[9] J.-Y. Bouguet et al., “Pyramidal Implementation of the Lucas Kanade Feature Tracker Description of the algorithm,” Intel corporation, vol. 5, no. 1-10, p. 4, 2001.

[10] T. Senst, V. Eiselein, and T. Sikora, “Robust Local Optical Flow for Feature Tracking,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 9, pp. 1377-1387, 2012.

[11] T. Senst, J. Geistert, and T. Sikora, “Robust local optical flow: Long-range motions and varying illuminations,” in IEEE Int. Conf. Image Process., 2016, pp. 4478-4482.

[12] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 2462-2470.

[13] Z. Teed and J. Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow,” in European Conference on Computer Vision, 2020, pp. 402-419.

[14] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, “Gmflow: Learning optical flow via global matching,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 8121-8130.

[15] C. Harris, M. Stephens et al., “A combined corner and edge detector,” in Alvey vision conference, vol. 15, no. 50, 1988, pp. 10-5244.

[16] J. Shi et al., “Good features to track,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 1994, pp. 593-600.

[17] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, pp. 91-110, 2004.

[18] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in IEEE Int. Conf. Comput. Vis. (ICCV), 2011, pp. 2564-2571.

[19] X. Zhang, F. X. Yu, S. Karaman, and S.-F. Chang, “Learning discriminative and transformation covariant local feature detectors,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 6818-6826.

[20] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 224-236.

[21] A. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk, “Key. net: Keypoint detection by handcrafted and learned cnn filters,” in IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 5836-5844.

[22] X. Zhao, X. Wu, J. Miao, W. Chen, P. C. Chen, and Z. Li, “Alike: Accurate and lightweight keypoint detection and descriptor extraction,” IEEE Trans. Multimed., 2022.

[23] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 315-323.

[24] H. Germain, V. Lepetit, and G. Bourmaud, “Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 414-423.

[25] Z. Li and N. Snavely, “MegaDepth: Learning Single-View Depth Prediction From Internet Photos,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2018.

[26] D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” preprint arXiv:1412.6980, 2014.

[27] M. Trajković and M. Hedley, “Fast corner detection,” Image and vision computing, vol. 16, no. 2, pp. 75-87, 1998.

[28] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk, “HPatches: A benchmark and evaluation of handcrafted and learned local descriptors,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 5173- 5182.

[29] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981.

[30] D. Zuñiga-Noël, A. Jaenal, R. Gomez-Ojeda, and J. Gonzalez-Jimenez, “The UMA-VI dataset: Visual-inertial odometry in low-textured and dynamic illumination environments,” Int. J. Rob. Res., vol. 39, no. 9, pp. 1052-1060, 2020.

[31] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” Int. J. Rob. Res., vol. 35, no. 10, pp. 1157-1163, 2016.

[32] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap slam,” IEEE Trans. Robotics, vol. 37, no. 6, pp. 1874-1890, 2021.

本文由作者按照 CC BY 4.0 进行授权
} } } } }