Unsupervised Graph Association For Person ReID

Unsupervised Graph Association for Person ReIdentification
The source code is here.
Introduction
Chanllenge One
Since in supervised learning deep CNN is a datadriven method, it requires a large number of pairwise labelled data in training to learn viewinvariant representations. However, labelling sufficient pairwise REID data is expensive and timeconsuming. How to improve the performance and scalability of deep REID algorithm without pairwise labelled data (i.e., unsupervised learning) is a great challenge in recent person REID research.
Related Work
There have been a series of unsupervised image based methods to address this problem, which can be roughly divided into three categories:

imagetoimage translation
transfer the source domain images to the target domain by GAN network

domain adaptation
transfer the source domain trained model to the target domain in an unsupervised manner

unsupervised clustering
obtain the pseudo labels of target domain data through the unsupervised clustering algorithms and fine tune the source domain model with pseudo labels on target domain.
Chanllenge Two
The precondition of above mentioned methods is that there are some similarities between the source domain and the target domain.
Tracklet Based Methods
Due to the fact that UTAL [1] and TAUDL [2] match the underlying positive pairs in the mini batch, both of them need a large batch size to sample the underlying positive pairs.
RACE [3] and BUC [4], which progressively merge the underlying positive pairs in training, are easily damaged by merging noisy pairs.
Unsupervised Graph Association
The core points are mining the crossview relationships and reducing the damage of noisy associations.
Intracamera learning stage is to learn representations of a person with regards to camera information, which helps to reduce false crossview associations in intercamera learning stage.
Intracamera Learning Stage
Each classifier branch corresponds to one camera’s classification task.
Suppose we have a dataset, captured from
$T$ cameras. We adopt the sparse spacetime tracklets sampling (SSTT [2]) to sample the training tracklets$\lbrace s_t^i, y_t^i\rbrace$ from each camera.Denoting
$s_t^i=\lbrace I_1^{s_t^i}, I_2^{s_t^i}, ..., I_n^{s_t^i}\rbrace$ , where$I_n^{s_t^i}$ is the$n$ th image of the$i$ th tracklet ($i∈ [1, . . . , Mt]$) in $t$ th camera ($t ∈ [1, . . . , T ]$). We randomly assign a unique pseudo label
$y_t^i$ ($y_t^i\in \lbrace y_t^1, ..., y_t^{M_t}\rbrace$ ) for the$s_t^i$ .$\phi(\cdot )$ is the backbone function.
Note:

The batch normalization layer is effective to avoid overfitting and restrain negative pairs, i.e., reduce the average similarity score of the negative pairs and make the negative pairs easier to be distinguished.

The assumption of our experiments is that one person has only one tracklet in each camera through SSTT sampling.
Intercamera Learning Stage
Tracklet's representation
$c_t^i=\frac{1}{N_{s_t^i}}\sum_{n=1}^{N_{s_t^i}}\phi(I_n^{s_t^i}),\quad I_n^{s_t^i}\in s_t^i$ CrossView Graph
KNN set
$\lbrace c_t^i\rbrace_K^m$ of$c_t^i$ , which finds the nearest$K$ tracklets of$c_t^i$ in camera$m$ .$$
e(c_t^i, c_m^j)=
\begin{cases}
\cos(c_t^i, c_m^j),\quad &\mbox{if}\quad\cos(c_t^i, c_m^j)\gt \lambda\quad \&\quad c_m^j\in \lbrace c_t^i\rbrace_K^m\quad \&\quad c_t^i\in \lbrace c_m^j\rbrace_K^t\\
1, &\mbox{if}\quad c_t^i=c_m^j\\
0, &\mbox{otherwise}
\end{cases}
$$Crosscamera loss
Graph neighbor set
$N(s_t^i)$ :$N(s_t^i)=\lbrace (s_m^a, y_m^a)if\ e(c_t^i, c_m^a)\ne 0\rbrace$ The weights of MBC are replaced with the corresponding nodes of CVG to fast updating CVG in the training process:
$l_{ce}(I_n^{s_t^i}, s_m^a)=\sum_{j=1}^{M_m}\log\left(\frac{\exp((c_m^j)^T\phi(I_n^{s_t^i}))}{\sum_{k=1}^{M_m}\exp((c_m^k)^T\phi(I_n^{s_t^i}))}\right)$ Graph weighted crosscamera loss
$$
\begin{split}
l_{inter}(I_n^{s_t^i})&=\sum_{N(s_t^i)s_t^i}e(c_t^i, c_m^a)l_{ce}(I_n^{s_t^i}, s_m^a)+\alpha l_{ce}(I_n^{s_t^i}, s_t^i)\\
&=\sum_{N(s_t^i)}e(c_t^i, c_m^a)l_{ce}(I_n^{s_t^i}, s_m^a),\qquad\mbox{where}\quad\alpha=e(c_t^i, c_t^i)
\end{split}
$$CVG's Updating
$\frac{\partial l_{inter}}{\partial c_m^a}=\sum_{N_{bs}}err(I_n^{s_t^i})e(c_t^i, c_m^a)\phi(I_n^{s_t^i})$ where
$err(I_n^{s_t^i})=1(y_m^a==j)\frac{\exp((c_m^j)^T\phi(I_n^{s_t^i}))}{\sum_{k=1}^{M_m}\exp((c_m^k)^T\phi(I_n^{s_t^i}))}$ $c_m^a\leftarrow c_m^a+\eta\frac{\partial l_{inter}}{\partial c_m^a}$ The updating of
$c_t^i$ makes full use of underlying positive pairs from all camera views.References
[1] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised tracklet person reidentification. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019. [link]
[2] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised person reidentification by deep learning tracklet association. In Proceedings of the European Conference on Computer Vision (ECCV), pages 737–753, 2018. [link]
[3] Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, and Pong C Yuen. Dynamic label graph matching for unsupervised video reidentification. In Proceedings of the IEEE International Conference on Computer Vision, pages 5142–5150, 2017. [link]
[4] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottomup clustering approach to unsupervised person reidentification. In AAAI Conference on Artificial Intelligence, volume 2, 2019. [link]
如果喜欢我的文章，欢迎关注我的个人博客
