Deep Reinforcement Active Learning

Deep Reinforcement Active Learning for HumanInTheLoop Person ReIdentification
Introduction
Most existing supervised person ReID approaches employ a trainonceanddeploy scheme, i.e., a large amount of prelabelled data is put into training phrase all at once.
However, in practice this assumption is not easy to adapt:

Pairwise pedestrian data is prohibitive to be collected since it is unlikely that a large amount of pedestrian may reappear in other camera views.

The increasing number of camera views amplifies the difficulties in searching the same person among multiple camera views.
Solutions:

Unsupervised learning algorithms
Unsupervised learning based ReID models are inherently weaker compared to supervised learning based models, compromising ReID effectiveness in any practical deployment.

Semisupervised learning scheme
These models are still based on a strong assumption that parts of the identities (e.g. one third of the training set) are fully labelled for every camera view.
> Reinforcement Learning + Active Learning:
humanintheloop (HITL) model learning process [1]
A stepbystep sequential active learning process is adopted by exploring human selective annotations on a much smaller pool of samples for model learning.
These cumulatively labelled data by human binary verification are used to update model training for improving ReID performance.
Such an approach to model learning is naturally suited for reinforcement learning together with active learning, the focus of this work.
Methodology
Base CNN Network
$L_{total}=L_{crossentropy}+L_{triplet}$ A Deep Reinforced Active Learner  An Agent
As each query instance arrives, we perceive its
$n_s$ nearest neighbors as the unlabelled gallery pool.Action
The action set defines to select an instance from the unlabelled gallery pool:
$\pi(A_tS_t)$ Once
$A_t=g_k$ is performed, the agent is unable to choose it again in the subsequent steps.The termination criterion of this process depends on a predefined
$K_{max}$ which restricts the maximal annotation amount for each query anchor.State
At each discrete time step
$t$ , the environment provides an observation state$S_t$ which reveals the instances' relationship, and receives a response from the agent by selecting an action$A_t$ : Mahalanobis distance
$$d(x, y)=\mbox{Mahalanobis}(x, y)=(xy)^T\Sigma^{1}(xy)$$  The kreciprocal neighbors
$R(n_i, k)=\lbrace n_j(n_i\in N(n_j, k))\land(n_j\in N(n_i, k))\rbrace$ where
$N(n_i, k)$ is the top knearest neighbors of$n_i$  Sparse similarity graph
The similarity value between every two samples
$i,j(i\ne j)$ $$
\mbox{Sim}(i, j)=
\begin{cases}
1\frac{d(i, j)}{\max_{i, j\in q, g}d(i, j)},\quad&\mbox{if}\quad j\in R(i, k)\\
0,&\mbox{otherwise}
\end{cases}
$$The similarity value of the node
$n_i$ is remained, otherwise be assigned with zero.For a state
$S_t$ at time$t$ , the optimal action$A_t=g_k$ is selected via the policy network, which indicates the$k$ th instance among the unlabelled gallery pool being annotated by human oracle, who replies with binary feedback true or false against the query: True Match:
$y_k^t=1$ then,
$$\mbox{Sim}(q, g_k)=1$$ $$\mbox{Sim}(q, g_i)=\frac12[\mbox{Sim}(q, g_i)+\mbox{Sim}(g_k, g_i)]$$  False Match:
$y_k^t=1$ then,
$$\mbox{Sim}(q, g_k)=0$$ $$
\mbox{Sim}(q, g_i)=
\begin{cases}
\mbox{Sim}(q, g_i)\mbox{Sim}(g_k, g_i),\quad&\mbox{if}\quad\mbox{Sim}(q, g_i)\gt\mbox{Sim}(g_k, g_i)\\
0,&\mbox{otherwise}
\end{cases}
$$zoom in the distance among positives and push out the distance among negatives.
The kreciprocal operation will also be adopt afterwards, and a renewed state
$S_{t+1}$ is then obtained.repeats until the maximum annotation amount
$K_{max}$ for each query is exhaustedReward
Loss for action
$\pi$ We use data uncertainty as the objective function of the reinforcement learning policy, i.e., Higher uncertainty indicates that the sample is harder to be distinguished
$R_t=[m+y_k^t(\max_{x_i\in X_p^t}d(x_i, g_k)\min_{x_j\in X_n^t}d(x_j, g_k))]_+$ All the future rewards
$(R_{t+1}, R_{t+2}, \cdots)$ :$Q^*=\max_{\pi}E[R_t+\gamma R_{t+1}+\gamma^2R_{t+2}+\cdots\pi, S_t, A_t]$ CNN Network Updating
When plentiful enough pairwise labelled data are obtained, the CNN parameters enable to be updated via triplet loss function, which in return generates a new initial state for incoming data. Through iteratively executing the sample selection and CNN network refreshing, the proposed algorithm could quickly escalate.
This progress terminates when each image in the training data pool has been browsed once by our DRAL agent.
Conclusion

The key task for the model design becomes how to select more informative samples at a fixed annotation cost.

DRAL method releases the restriction of prelabelling and keeps model upgrading with progressively collected data.
Reference
[1] Hanxiao Wang, Shaogang Gong, Xiatian Zhu, and Tao Xiang. Humanintheloop person reidentification. In ECCV, 2016. [link]
如果喜欢我的文章，欢迎关注我的个人博客
