Beyond Human Parts -- Dual Part-Aligned Representations



  • Beyond Human Parts: Dual Part-Aligned Representations for Person Re-Identification

    全文链接:
    http://openaccess.thecvf.com/content_ICCV_2019/html/Guo_Beyond_Human_Parts_Dual_Part-Aligned_Representations_for_Person_Re-Identification_ICCV_2019_paper.html

    The source code is here.

    Challenges - Misalignment Problem

    The significant visual appearance changes caused by:

    1. human pose variation

    2. lighting conditions

    3. part occlusions

    4. background cluttering

    5. distinct camera viewpoints ......

    Related Work

    1. Hand-crafted partitioning

      relies on manually designed splits of the input image or the feature maps into grid cells or horizontal stripes, based on the assumption that the human parts are well-aligned in the RGB color space

    2. The attention mechanism

      tries to learn an attention map over the last output feature map and constructs the aligned part features accordingly

    3. Predicting a set of predefined attributes as useful features to guide the matching process.

    4. Injecting human pose estimation or human parsing results to extract the human part aligned features based on the predicted human key points or semantic human part regions, while the success of such approaches heavily counts on the accuracy of human parsing models or pose estimators.

    Motivation

    Most of the previous studies mainly focus on learning more accurate human part representations, while neglecting the influence of potentially useful contextual cues that could be addressed as “non-human” parts.

    Beyond these predefined part categories, there still exist many objects or parts which could be critical for person re-identification, but tend to be recognized as background by the pre-trained human parsing models.

    Dual Part-Aligned Representation

    Fig2

    Accurate Human Parts:

    • Human parsing model CE2P [1] extracts the human part masks and computes the human part-aligned representations for the features from the low-levels to high-levels.

    the predicted label map: LL (rescaled to be of the same size as the feature map XX (xix_i is the representation of pixel ii, essentially the ithi_{th} row of XX, i=1,2,...,Ni=1,2,...,N))

    lil_i represents the human part category of pixel ii of LL, which is of KK different values including K1K-1 human part categories and one background category.

    KK confidence maps: P1,P2,...,PKP_1, P_2, ..., P_K, where each confidence map PKP_K is associated with a human part category (or the background category), pkip_{ki} is the ithi_{th} pixel of PkP_k.

    • the representation of the kthk_{th} human part (分量形式):

    hk=g(i=1Npkixi)h_k=g(\sum_{i=1}^Np_{ki}x_i)

    • 矩阵形式:

    hkC×1=g(MC×1Pk1×NXC×NYN×1)h_k^{C\times 1}=g(M^{C\times 1}P_k^{1\times N}\odot X^{C\times N}Y^{N\times 1})

    MC×1M^{C\times 1}: 所有元素全为 1 的向量,表示将 Pk1×NP_k^{1\times N} 广播到 C×NC\times N

    YN×1Y^{N\times 1}: 所有元素全为 1 的向量,表示按行求和 (对每一行求和)

    • the human part-aligned feature map XHumanX^{Human} (same size as XX):

    XHuman=k=1KhkC×1Pk1×NX^{Human}=\sum_{k=1}^Kh_k^{C\times 1}P_k^{1\times N}

    Coarse Non-human Parts:

    The latent part branch learns to predict NN coarse confidence maps Q1,Q2,...,QNQ_1, Q_2, ..., Q_N for all
    NN pixels, qijq_{ij} is the jthj_{th} pixel of QiQ_i:

    qij=1Ziexp(θ(xj)Tϕ(xi))q_{ij}=\frac{1}{Z_i}\exp(\theta(x_j)^T\phi(x_i))

    where

    Zi=j=1Nexp(θ(xj)Tϕ(xi))Z_i=\sum_{j=1}^N\exp(\theta(x_j)^T\phi(x_i))

    • the latent part-aligned feature map XLatentX^{Latent}:

    xiLatent=j=1Nqijψ(xj)x_i^{Latent}=\sum_{j=1}{N}q_{ij}\psi(x_j)

    1. The self-attention mechanism [2, 3] learns to group all the pixels belonging to the same latent part together. We also extract the latent non-human part information on the feature maps from the low-levels to the high-levels.

    2. Self-attention can learn to group the similar pixels together without extra supervision (also shown useful in segmentation).

    3. The performance gains from the latent part branch, which is in fact the mixture of the coarse human and non-human part information, is mainly attributed to capturing non-human parts.

    4. Although the latent part masks are learned from scratch, DPB (latent) achieves comparable results with the human part branch in general, which carries more strong prior information of the human parts knowledge, showing the importance of the non-human part context.

    Conclusion

    • Through combining the advantages of the complementary information from both parts, our approach learns to augment the representation of each pixel with the representation of the part (human parts or non-human parts) that it belongs to.

    • Human part branch and latent part branch are complementary to each other.

    • Human part masks can eliminate the influence of background regions, while the predicted latent part masks serves as reliable surrogate for the non-human part.

    • The human part branch adopts off-the-shelf human parsing model to inject structural prior information by capturing the predefined semantic human parts for a person, while the latent part branch adopts a self-attention mechanism to help capture the detailed part categories beyond the injected prior information.

    References

    [1] Ting Liu, Tao Ruan, Zilong Huang, Yunchao Wei, Shikui Wei, Yao Zhao, and Thomas Huang. Devil in the details: Towards accurate single and multiple human parsing. arXiv:1809.05996, 2018. [link]

    [2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. [link]

    [3] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018. [link]

    查看原文

    如果喜欢我的文章,欢迎关注我的个人博客


 

Copyright © 2018 bbs.dian.org.cn All rights reserved.

与 Dian 的连接断开,我们正在尝试重连,请耐心等待