Self-Critical Attention Learning For Person Re-ID
-
Self-Critical Attention Learning for Person Re-Identification
Introduction
-
Most attention modules are usually trained in a weakly-supervised manner with the final objective, for example, the supervision from the triple loss or classification loss in the person ReID task.
-
As the supervision is not specifically designed for the attention module, the gradients from this weak supervisory signal might be vanishing in the back propagation process.
-
The attention maps learned in such manner are not always "transparent" in their meaning, and lack discrimination ability and robustness.
-
The redundant and misleading attention maps are hardly corrected without direct and appropriate supervisory signal.
-
The quality of the attention during training process can only be evaluated qualitatively by the human end-users, examining the attention map one by one, which is labor-intensive and inefficient.
-
-
We learn the attention with a critic which measures the attention quality and provides a powerful supervisory signal to guide the learning process.
-
Since most effective evaluation indicators are usually non-differentiable, e.g. the gain of attention model over the basic network, we jointly train our attention agent and critic in a reinforcement learning manner, where the agent produces the visual attention while the critic analyzes the gain from the attention and guides the agent to maximize this gain.
-
We design spatial- and channel-wise attention models with our critic module.
Approach
Self-critical Attention Learning
- Given the input image
as the state, the feature maps extracted by the basic network is
$$
\begin{equation}
X=F(I|\psi)
\end{equation}
$$where
denotes the parameters of the basic network. - The attention maps
based on the feature maps is
$$
\begin{equation}
A=A(X|\theta)
\end{equation}
$$where
is the parameters of the attention agent . - Critic module
is formulated as
$$
\begin{equation}
V=C(X, A|\phi)
\end{equation}
$$where
defines the parameters of the critic network. - The classification reward
where
denotes the prediction label by the attention-based features about person and the is the ground-truth classification label. - The amelioration reward
where
indicates the predicted probability of the true classification. - The final reward of the attention model
Attention Agent
Spatial Attention
Channel-wise Attention
Stacked Attention Model
We stacked five attention models on the ResNet-50 network.
Optimization
Triplet loss
Cross-entropy loss with the label smooth regularization [1]
Since the classification loss is sensitive to the scales of features, we add a batch-norm (BN) layer before classification loss to normalize the scales.
The critic loss
The Mean Square Error (MSE) loss
Reference
[1] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016. [link]
如果喜欢我的文章,欢迎关注我的个人博客
-