Self-Critical Attention Learning For Person Re-ID



  • Self-Critical Attention Learning for Person Re-Identification

    全文链接:
    http://openaccess.thecvf.com/content_ICCV_2019/html/Chen_Self-Critical_Attention_Learning_for_Person_Re-Identification_ICCV_2019_paper.html

    Introduction

    • Most attention modules are usually trained in a weakly-supervised manner with the final objective, for example, the supervision from the triple loss or classification loss in the person ReID task.

      1. As the supervision is not specifically designed for the attention module, the gradients from this weak supervisory signal might be vanishing in the back propagation process.

      2. The attention maps learned in such manner are not always "transparent" in their meaning, and lack discrimination ability and robustness.

      3. The redundant and misleading attention maps are hardly corrected without direct and appropriate supervisory signal.

      4. The quality of the attention during training process can only be evaluated qualitatively by the human end-users, examining the attention map one by one, which is labor-intensive and inefficient.

    1. We learn the attention with a critic which measures the attention quality and provides a powerful supervisory signal to guide the learning process.

    2. Since most effective evaluation indicators are usually non-differentiable, e.g. the gain of attention model over the basic network, we jointly train our attention agent and critic in a reinforcement learning manner, where the agent produces the visual attention while the critic analyzes the gain from the attention and guides the agent to maximize this gain.

    3. We design spatial- and channel-wise attention models with our critic module.

    Fig1

    Approach

    Self-critical Attention Learning

    • Given the input image II as the state, the feature maps XX extracted by the basic network FF is

    $$
    \begin{equation}
    X=F(I|\psi)
    \end{equation}
    $$

    where ψ\psi denotes the parameters of the basic network.

    • The attention maps AA based on the feature maps XX is

    $$
    \begin{equation}
    A=A(X|\theta)
    \end{equation}
    $$

    where θ\theta is the parameters of the attention agent AA.

    • Critic module VV is formulated as

    $$
    \begin{equation}
    V=C(X, A|\phi)
    \end{equation}
    $$

    where ϕ\phi defines the parameters of the critic network.

    • The classification reward RcR_c

    <br/>Rc=<br/>{<br/>1,amp;yic=yip<br/>0,amp;yicyip<br/><br/><br /> R_c=<br /> \begin{cases}<br /> 1,\quad &amp;y_i^c=y_i^p\\<br /> 0,&amp;y_i^c\ne y_i^p<br /> \end{cases}<br />

    where yipy_i^p denotes the prediction label by the attention-based features about person ii and the yicy_i^c is the ground-truth classification label.

    • The amelioration reward RaR_a

    <br/>Ra=<br/>{<br/>1,amp;pk(Ai,Xi)>pk(Xi)<br/>0,amp;pk(Ai,Xi)pk(Xi)<br/><br/><br /> R_a=<br /> \begin{cases}<br /> 1,\quad &amp;p^k(A_i, X_i)\gt p^k(X_i)\\<br /> 0,&amp;p^k(A_i, X_i)\le p^k(X_i)<br /> \end{cases}<br />

    where pkp^k indicates the predicted probability of the true classification.

    • The final reward of the attention model RR

    R=Rc+RaR=R_c+R_a

    Fig2

    Attention Agent

    Fig3

    Spatial Attention

    As=σ(W2smax(0,W1sX))A^s=\sigma(W_2^s\max(0, W_1^s\overline X))

    Channel-wise Attention

    Ac=σ(W2cmax(0,W1cXpool))A^c=\sigma(W_2^c\max(0, W_1^cX_{pool}))

    Stacked Attention Model

    We stacked five attention models on the ResNet-50 network.

    Optimization

    Triplet loss

    Jtri(ψ,θ)=1Ni=1N[fifi+22fifi22+m]+J_{tri}(\psi, \theta)=\frac1N\sum_{i=1}^N[||f_i-f_i^+||_2^2-||f_i-f_i^-||_2^2+m]_+

    Cross-entropy loss with the label smooth regularization [1]

    Jcls(ψ,θ)=1Ni=1Nk=1Klog(pik)((1ϵ)yik+ϵK)J_{cls}(\psi, \theta)=-\frac1N\sum_{i=1}^N\sum_{k=1}^K\log(p_i^k)((1-\epsilon)y_i^k+\frac{\epsilon}{K})

    Since the classification loss is sensitive to the scales of features, we add a batch-norm (BN) layer before classification loss to normalize the scales.

    The critic loss

    Jcri(θ)=VϕAθ(X,A)J_{cri}(\theta)=-V_{\phi}^{A_{\theta}}(X, A)

    The Mean Square Error (MSE) loss

    Jmse(ϕ)=(VϕAθ(X,A)R)2J_{mse}(\phi)=(V_{\phi}^{A_{\theta}}(X, A)-R)^2

    Alg1

    Reference

    [1] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016. [link]

    查看原文

    如果喜欢我的文章,欢迎关注我的个人博客


 

Copyright © 2018 bbs.dian.org.cn All rights reserved.

与 Dian 的连接断开,我们正在尝试重连,请耐心等待