Deep Reinforcement Active Learning

  • Deep Reinforcement Active Learning for Human-In-The-Loop Person Re-Identification



    Most existing supervised person Re-ID approaches employ a train-once-and-deploy scheme, i.e., a large amount of pre-labelled data is put into training phrase all at once.

    However, in practice this assumption is not easy to adapt:

    1. Pairwise pedestrian data is prohibitive to be collected since it is unlikely that a large amount of pedestrian may reappear in other camera views.

    2. The increasing number of camera views amplifies the difficulties in searching the same person among multiple camera views.


    1. Unsupervised learning algorithms

      Unsupervised learning based Re-ID models are inherently weaker compared to supervised learning based models, compromising Re-ID effectiveness in any practical deployment.

    2. Semi-supervised learning scheme

      These models are still based on a strong assumption that parts of the identities (e.g. one third of the training set) are fully labelled for every camera view.

    -> Reinforcement Learning + Active Learning:

    human-in-the-loop (HITL) model learning process [1]

    A step-by-step sequential active learning process is adopted by exploring human selective annotations on a much smaller pool of samples for model learning.

    These cumulatively labelled data by human binary verification are used to update model training for improving Re-ID performance.

    Such an approach to model learning is naturally suited for reinforcement learning together with active learning, the focus of this work.




    Base CNN Network


    A Deep Reinforced Active Learner - An Agent


    As each query instance arrives, we perceive its nsn_s-nearest neighbors as the unlabelled gallery pool.


    The action set defines to select an instance from the unlabelled gallery pool:


    Once At=gkA_t=g_k is performed, the agent is unable to choose it again in the subsequent steps.

    The termination criterion of this process depends on a pre-defined KmaxK_{max} which restricts the maximal annotation amount for each query anchor.



    At each discrete time step tt, the environment provides an observation state StS_t which reveals the instances' relationship, and receives a response from the agent by selecting an action AtA_t:

    • Mahalanobis distance

    $$d(x, y)=\mbox{Mahalanobis}(x, y)=(x-y)^T\Sigma^{-1}(x-y)$$

    • The k-reciprocal neighbors

    R(ni,k)={nj(niN(nj,k))(njN(ni,k))}R(n_i, k)=\lbrace n_j|(n_i\in N(n_j, k))\land(n_j\in N(n_i, k))\rbrace

    where N(ni,k)N(n_i, k) is the top k-nearest neighbors of nin_i

    • Sparse similarity graph

    The similarity value between every two samples i,j(ij)i,j(i\ne j)

    \mbox{Sim}(i, j)=
    1-\frac{d(i, j)}{\max_{i, j\in q, g}d(i, j)},\quad&\mbox{if}\quad j\in R(i, k)\\

    The similarity value of the node nin_i is remained, otherwise be assigned with zero.

    For a state StS_t at time tt, the optimal action At=gkA_t=g_k is selected via the policy network, which indicates the kk-th instance among the unlabelled gallery pool being annotated by human oracle, who replies with binary feedback true or false against the query:

    1. True Match:



    $$\mbox{Sim}(q, g_k)=1$$

    $$\mbox{Sim}(q, g_i)=\frac12[\mbox{Sim}(q, g_i)+\mbox{Sim}(g_k, g_i)]$$

    1. False Match:



    $$\mbox{Sim}(q, g_k)=0$$

    \mbox{Sim}(q, g_i)=
    \mbox{Sim}(q, g_i)-\mbox{Sim}(g_k, g_i),\quad&\mbox{if}\quad\mbox{Sim}(q, g_i)\gt\mbox{Sim}(g_k, g_i)\\

    zoom in the distance among positives and push out the distance among negatives.

    The k-reciprocal operation will also be adopt afterwards, and a renewed state St+1S_{t+1} is then obtained.

    repeats until the maximum annotation amount KmaxK_{max} for each query is exhausted


    Loss for action π\pi

    We use data uncertainty as the objective function of the reinforcement learning policy, i.e., Higher uncertainty indicates that the sample is harder to be distinguished

    Rt=[m+ykt(maxxiXptd(xi,gk)minxjXntd(xj,gk))]+R_t=[m+y_k^t(\max_{x_i\in X_p^t}d(x_i, g_k)-\min_{x_j\in X_n^t}d(x_j, g_k))]_+

    All the future rewards (Rt+1,Rt+2,)(R_{t+1}, R_{t+2}, \cdots):

    Q=maxπE[Rt+γRt+1+γ2Rt+2+π,St,At]Q^*=\max_{\pi}E[R_t+\gamma R_{t+1}+\gamma^2R_{t+2}+\cdots|\pi, S_t, A_t]

    CNN Network Updating

    When plentiful enough pair-wise labelled data are obtained, the CNN parameters enable to be updated via triplet loss function, which in return generates a new initial state for incoming data. Through iteratively executing the sample selection and CNN network refreshing, the proposed algorithm could quickly escalate.

    This progress terminates when each image in the training data pool has been browsed once by our DRAL agent.


    • The key task for the model design becomes how to select more informative samples at a fixed annotation cost.

    • DRAL method releases the restriction of pre-labelling and keeps model upgrading with progressively collected data.


    [1] Hanxiao Wang, Shaogang Gong, Xiatian Zhu, and Tao Xiang. Human-in-the-loop person re-identification. In ECCV, 2016. [link]




Copyright © 2018 All rights reserved.

Looks like your connection to Dian was lost, please wait while we try to reconnect.