SelfCritical Attention Learning For Person ReID

SelfCritical Attention Learning for Person ReIdentification
Introduction

Most attention modules are usually trained in a weaklysupervised manner with the final objective, for example, the supervision from the triple loss or classification loss in the person ReID task.

As the supervision is not specifically designed for the attention module, the gradients from this weak supervisory signal might be vanishing in the back propagation process.

The attention maps learned in such manner are not always "transparent" in their meaning, and lack discrimination ability and robustness.

The redundant and misleading attention maps are hardly corrected without direct and appropriate supervisory signal.

The quality of the attention during training process can only be evaluated qualitatively by the human endusers, examining the attention map one by one, which is laborintensive and inefficient.


We learn the attention with a critic which measures the attention quality and provides a powerful supervisory signal to guide the learning process.

Since most effective evaluation indicators are usually nondifferentiable, e.g. the gain of attention model over the basic network, we jointly train our attention agent and critic in a reinforcement learning manner, where the agent produces the visual attention while the critic analyzes the gain from the attention and guides the agent to maximize this gain.

We design spatial and channelwise attention models with our critic module.
Approach
Selfcritical Attention Learning
 Given the input image
$I$ as the state, the feature maps$X$ extracted by the basic network$F$ is
$$
\begin{equation}
X=F(I\psi)
\end{equation}
$$where
$\psi$ denotes the parameters of the basic network. The attention maps
$A$ based on the feature maps$X$ is
$$
\begin{equation}
A=A(X\theta)
\end{equation}
$$where
$\theta$ is the parameters of the attention agent$A$ . Critic module
$V$ is formulated as
$$
\begin{equation}
V=C(X, A\phi)
\end{equation}
$$where
$\phi$ defines the parameters of the critic network. The classification reward
$R_c$
$<br /> R_c=<br /> \begin{cases}<br /> 1,\quad &y_i^c=y_i^p\\<br /> 0,&y_i^c\ne y_i^p<br /> \end{cases}<br />$ where
$y_i^p$ denotes the prediction label by the attentionbased features about person$i$ and the$y_i^c$ is the groundtruth classification label. The amelioration reward
$R_a$
$<br /> R_a=<br /> \begin{cases}<br /> 1,\quad &p^k(A_i, X_i)\gt p^k(X_i)\\<br /> 0,&p^k(A_i, X_i)\le p^k(X_i)<br /> \end{cases}<br />$ where
$p^k$ indicates the predicted probability of the true classification. The final reward of the attention model
$R$
$R=R_c+R_a$ Attention Agent
Spatial Attention
$A^s=\sigma(W_2^s\max(0, W_1^s\overline X))$ Channelwise Attention
$A^c=\sigma(W_2^c\max(0, W_1^cX_{pool}))$ Stacked Attention Model
We stacked five attention models on the ResNet50 network.
Optimization
Triplet loss
$J_{tri}(\psi, \theta)=\frac1N\sum_{i=1}^N[f_if_i^+_2^2f_if_i^_2^2+m]_+$ Crossentropy loss with the label smooth regularization [1]
$J_{cls}(\psi, \theta)=\frac1N\sum_{i=1}^N\sum_{k=1}^K\log(p_i^k)((1\epsilon)y_i^k+\frac{\epsilon}{K})$ Since the classification loss is sensitive to the scales of features, we add a batchnorm (BN) layer before classification loss to normalize the scales.
The critic loss
$J_{cri}(\theta)=V_{\phi}^{A_{\theta}}(X, A)$ The Mean Square Error (MSE) loss
$J_{mse}(\phi)=(V_{\phi}^{A_{\theta}}(X, A)R)^2$ Reference
[1] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016. [link]
如果喜欢我的文章，欢迎关注我的个人博客
