Beyond Human Parts -- Dual Part-Aligned Representations

谢威

Beyond Human Parts: Dual Part-Aligned Representations for Person Re-Identification

全文链接：
http://openaccess.thecvf.com/content_ICCV_2019/html/Guo_Beyond_Human_Parts_Dual_Part-Aligned_Representations_for_Person_Re-Identification_ICCV_2019_paper.html

The source code is here.

Challenges - Misalignment Problem

The significant visual appearance changes caused by:

human pose variation
lighting conditions
part occlusions
background cluttering
distinct camera viewpoints ......

Related Work

Hand-crafted partitioning

relies on manually designed splits of the input image or the feature maps into grid cells or horizontal stripes, based on the assumption that the human parts are well-aligned in the RGB color space
The attention mechanism

tries to learn an attention map over the last output feature map and constructs the aligned part features accordingly
Predicting a set of predefined attributes as useful features to guide the matching process.
Injecting human pose estimation or human parsing results to extract the human part aligned features based on the predicted human key points or semantic human part regions, while the success of such approaches heavily counts on the accuracy of human parsing models or pose estimators.

Motivation

Most of the previous studies mainly focus on learning more accurate human part representations, while neglecting the influence of potentially useful contextual cues that could be addressed as “non-human” parts.

Beyond these predefined part categories, there still exist many objects or parts which could be critical for person re-identification, but tend to be recognized as background by the pre-trained human parsing models.

Dual Part-Aligned Representation

Fig2

Accurate Human Parts:

Human parsing model CE2P [1] extracts the human part masks and computes the human part-aligned representations for the features from the low-levels to high-levels.

the predicted label map: $L$ (rescaled to be of the same size as the feature map $X$ ( $x_i$ is the representation of pixel $i$ , essentially the $i_{th}$ row of $X$ , $i=1,2,...,N$ ))

$l_i$ represents the human part category of pixel $i$ of $L$ , which is of $K$ different values including $K-1$ human part categories and one background category.

$K$ confidence maps: $P_1, P_2, ..., P_K$ , where each confidence map $P_K$ is associated with a human part category (or the background category), $p_{ki}$ is the $i_{th}$ pixel of $P_k$ .

the representation of the $k_{th}$ human part (分量形式)：

$h_k=g(\sum_{i=1}^Np_{ki}x_i)$

矩阵形式:

$h_k^{C\times 1}=g(M^{C\times 1}P_k^{1\times N}\odot X^{C\times N}Y^{N\times 1})$

$M^{C\times 1}$ : 所有元素全为 1 的向量，表示将 $P_k^{1\times N}$ 广播到 $C\times N$

$Y^{N\times 1}$ : 所有元素全为 1 的向量，表示按行求和 (对每一行求和)

the human part-aligned feature map $X^{Human}$ (same size as $X$ ):

$X^{Human}=\sum_{k=1}^Kh_k^{C\times 1}P_k^{1\times N}$

Coarse Non-human Parts:

The latent part branch learns to predict $N$ coarse confidence maps $Q_1, Q_2, ..., Q_N$ for all
$N$ pixels, $q_{ij}$ is the $j_{th}$ pixel of $Q_i$ :

$q_{ij}=\frac{1}{Z_i}\exp(\theta(x_j)^T\phi(x_i))$

where

$Z_i=\sum_{j=1}^N\exp(\theta(x_j)^T\phi(x_i))$

the latent part-aligned feature map $X^{Latent}$ :

$x_i^{Latent}=\sum_{j=1}{N}q_{ij}\psi(x_j)$

The self-attention mechanism [2, 3] learns to group all the pixels belonging to the same latent part together. We also extract the latent non-human part information on the feature maps from the low-levels to the high-levels.
Self-attention can learn to group the similar pixels together without extra supervision (also shown useful in segmentation).
The performance gains from the latent part branch, which is in fact the mixture of the coarse human and non-human part information, is mainly attributed to capturing non-human parts.
Although the latent part masks are learned from scratch, DPB (latent) achieves comparable results with the human part branch in general, which carries more strong prior information of the human parts knowledge, showing the importance of the non-human part context.

Conclusion

Through combining the advantages of the complementary information from both parts, our approach learns to augment the representation of each pixel with the representation of the part (human parts or non-human parts) that it belongs to.
Human part branch and latent part branch are complementary to each other.
Human part masks can eliminate the influence of background regions, while the predicted latent part masks serves as reliable surrogate for the non-human part.
The human part branch adopts off-the-shelf human parsing model to inject structural prior information by capturing the predefined semantic human parts for a person, while the latent part branch adopts a self-attention mechanism to help capture the detailed part categories beyond the injected prior information.

References

[1] Ting Liu, Tao Ruan, Zilong Huang, Yunchao Wei, Shikui Wei, Yao Zhao, and Thomas Huang. Devil in the details: Towards accurate single and multiple human parsing. arXiv:1809.05996, 2018. [link]

[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. [link]

[3] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018. [link]

查看原文

如果喜欢我的文章，欢迎关注我的个人博客