Unsupervised Graph Association For Person Re-ID

  • Unsupervised Graph Association for Person Re-Identification


    The source code is here.


    Chanllenge One

    Since in supervised learning deep CNN is a data-driven method, it requires a large number of pair-wise labelled data in training to learn view-invariant representations. However, labelling sufficient pairwise RE-ID data is expensive and time-consuming. How to improve the performance and scalability of deep RE-ID algorithm without pair-wise labelled data (i.e., unsupervised learning) is a great challenge in recent person RE-ID research.

    Related Work

    There have been a series of unsupervised image based methods to address this problem, which can be roughly divided into three categories:

    1. image-to-image translation

      transfer the source domain images to the target domain by GAN network

    2. domain adaptation

      transfer the source domain trained model to the target domain in an unsupervised manner

    3. unsupervised clustering

      obtain the pseudo labels of target domain data through the unsupervised clustering algorithms and fine tune the source domain model with pseudo labels on target domain.

    Chanllenge Two

    The precondition of above mentioned methods is that there are some similarities between the source domain and the target domain.

    Tracklet Based Methods

    Due to the fact that UTAL [1] and TAUDL [2] match the underlying positive pairs in the mini batch, both of them need a large batch size to sample the underlying positive pairs.

    RACE [3] and BUC [4], which progressively merge the underlying positive pairs in training, are easily damaged by merging noisy pairs.

    Unsupervised Graph Association

    The core points are mining the cross-view relationships and reducing the damage of noisy associations.

    Intra-camera learning stage is to learn representations of a person with regards to camera information, which helps to reduce false cross-view associations in inter-camera learning stage.

    Intra-camera Learning Stage

    Each classifier branch corresponds to one camera’s classification task.


    Suppose we have a dataset, captured from TT cameras. We adopt the sparse space-time tracklets sampling (SSTT [2]) to sample the training tracklets {sti,yti}\lbrace s_t^i, y_t^i\rbrace from each camera.

    Denoting sti={I1sti,I2sti,...,Insti}s_t^i=\lbrace I_1^{s_t^i}, I_2^{s_t^i}, ..., I_n^{s_t^i}\rbrace, where InstiI_n^{s_t^i} is the nn-th image of the ii-th tracklet ($i∈ [1, . . . , Mt]$) in tt-th camera ($t ∈ [1, . . . , T ]$).

    We randomly assign a unique pseudo label ytiy_t^i(yti{yt1,...,ytMt}y_t^i\in \lbrace y_t^1, ..., y_t^{M_t}\rbrace) for the stis_t^i.

    ϕ()\phi(\cdot ) is the backbone function.


    1. The batch normalization layer is effective to avoid overfitting and restrain negative pairs, i.e., reduce the average similarity score of the negative pairs and make the negative pairs easier to be distinguished.

    2. The assumption of our experiments is that one person has only one tracklet in each camera through SSTT sampling.

    Inter-camera Learning Stage

    Tracklet's representation

    cti=1Nstin=1Nstiϕ(Insti),Instistic_t^i=\frac{1}{N_{s_t^i}}\sum_{n=1}^{N_{s_t^i}}\phi(I_n^{s_t^i}),\quad I_n^{s_t^i}\in s_t^i

    Cross-View Graph

    KNN set {cti}Km\lbrace c_t^i\rbrace_K^m of ctic_t^i, which finds the nearest KK tracklets of ctic_t^i in camera mm.

    e(c_t^i, c_m^j)=
    \cos(c_t^i, c_m^j),\quad &\mbox{if}\quad\cos(c_t^i, c_m^j)\gt \lambda\quad \&\quad c_m^j\in \lbrace c_t^i\rbrace_K^m\quad \&\quad c_t^i\in \lbrace c_m^j\rbrace_K^t\\
    1, &\mbox{if}\quad c_t^i=c_m^j\\
    0, &\mbox{otherwise}

    Cross-camera loss

    Graph neighbor set N(sti)N(s_t^i):

    N(sti)={(sma,yma)if e(cti,cma)0}N(s_t^i)=\lbrace (s_m^a, y_m^a)|if\ e(c_t^i, c_m^a)\ne 0\rbrace

    The weights of MBC are replaced with the corresponding nodes of CVG to fast updating CVG in the training process:

    lce(Insti,sma)=j=1Mmlog(exp((cmj)Tϕ(Insti))k=1Mmexp((cmk)Tϕ(Insti)))l_{ce}(I_n^{s_t^i}, s_m^a)=-\sum_{j=1}^{M_m}\log\left(\frac{\exp((c_m^j)^T\phi(I_n^{s_t^i}))}{\sum_{k=1}^{M_m}\exp((c_m^k)^T\phi(I_n^{s_t^i}))}\right)

    Graph weighted cross-camera loss

    l_{inter}(I_n^{s_t^i})&=\sum_{N(s_t^i)-s_t^i}e(c_t^i, c_m^a)l_{ce}(I_n^{s_t^i}, s_m^a)+\alpha l_{ce}(I_n^{s_t^i}, s_t^i)\\
    &=\sum_{N(s_t^i)}e(c_t^i, c_m^a)l_{ce}(I_n^{s_t^i}, s_m^a),\qquad\mbox{where}\quad\alpha=e(c_t^i, c_t^i)

    CVG's Updating

    lintercma=Nbserr(Insti)e(cti,cma)ϕ(Insti)\frac{\partial l_{inter}}{\partial c_m^a}=-\sum_{N_{bs}}err(I_n^{s_t^i})e(c_t^i, c_m^a)\phi(I_n^{s_t^i})



    cmacma+ηlintercmac_m^a\leftarrow c_m^a+\eta\frac{\partial l_{inter}}{\partial c_m^a}

    The updating of ctic_t^i makes full use of underlying positive pairs from all camera views.


    [1] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised tracklet person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019. [link]

    [2] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised person re-identification by deep learning tracklet association. In Proceedings of the European Conference on Computer Vision (ECCV), pages 737–753, 2018. [link]

    [3] Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, and Pong C Yuen. Dynamic label graph matching for unsupervised video re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 5142–5150, 2017. [link]

    [4] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clustering approach to unsupervised person re-identification. In AAAI Conference on Artificial Intelligence, volume 2, 2019. [link]




Copyright © 2018 bbs.dian.org.cn All rights reserved.

Looks like your connection to Dian was lost, please wait while we try to reconnect.