• # Omni-Scale Feature Learning for Person Re-Identification

The source code is here.

## Introduction

### Major Challenges

As an instance-level recognition problem, person ReID faces two major challenges as illustrated in Figure 1 :

1. The intra-class (instance/identity) variations are typically big due to the changes of camera viewing conditions. - hard positives

2. Small inter-class variations - people in public space often wear similar clothes; from a distance as typically in surveillance videos, they can look incredibly similar. - hard negatives

### Omni-Scale Featrue

To match people and distinguish them from impostors, features corresponding both small local regions and global whole body regions are important.

1. Looking at the global-scale features would narrow down the search to the true match (middle) and an impostor (right).

2. The local-scale features gives away the fact that the person on the right is an impostor. (For more challenging cases, more complicated and richer features that span multiple scales are required.)

### OSNet:

1. enabling omni-scale feature learning;

2. a lightweight network.

benefits:

1. When trained on the ReID datasets, which are often of moderate sizes due to the difficulties in collecting across-camera matched person images, a lightweighted network with a small number of model parameters is less prone to overfitting.

2. In a large-scale surveillance application, the most practical way for ReID is to perform feature extraction at the camera end, in which case only features need to be sent to a central server instead of sending the raw videos.

## Depthwise Seperable Convolutions [notes]

### Pointwise Convolution

$1\times 1$ 卷积：

## Omni-Scale Residual Block

### Multi-Scale Feature Learning

$y=x+\tilde{x}$

where

$$\tilde{x}=\sum_{t=1}^TF^t(x),\quad\mbox{s.t.}\quad T\ge 1$$

### Unified Aggregation Gate

• To learn omni-scale features, we propose to combine the outputs of different streams in a dynamic way, i.e., different weights are assigned to different scales according to the input image, rather than being fixed after training. - learnable neural network AG

1. The output of the AG network $G(x^t)$ is a vector rather a scalar for the $t$-th stream, resulting in a more fine-grained fusion that tunes each feature channel.

2. The weights are dynamically computed by being conditioned on the input data.

$$\tilde{x}=\sum_{t=1}^TG(x^t)\odot x^t,\quad\mbox{where}\quad x^t=F^t(x)$$

• The AG is shared for all feature streams in the same omni-scale residual block.

1. The number of parameters is independent of $T$ (number of streams), thus the model becomes more scalable.

2. The supervision signals from all streams are gathered together to guide the learning of $G$.

$\frac{\partial L}{\partial G}=\frac{\partial L}{\partial \tilde{x}}\frac{\partial \tilde{x}}{\partial G}=\frac{\partial L}{\partial \tilde{x}}(\sum_{t=1}^T x^t)$

### Differences to Inception and ResNeXt

1. The multi-stream design in OSNet strictly follows the scale-incremental principle dictated by the exponent $T$. Specifically, different streams have different receptive fields but are bulit with the same Lite $3\times 3$ layers. Such a design is more effective at capturing a wide range of scales. In contrast, Inception [1] was originally designed to have low computational costs by sharing computations with multiple streams. Therefore its structure, which includes mixed operations of convolution and pooling, was handcrafted. ResNeXt [2] has multiple equal-scale streams thus learning representations at the same scale.

2. Inception/ResNeXt aggregates features by concatenation/addition while OSNet uses a unified AG, which facilitates the learning of combinations of multi-scale feature. Critically, it means that the fusion is dynamic and adaptive to each individual input image. Therefore, OSNet's architecture is fundamentally different from that of Inception/ResNeXt in nature.

3. OSNet uses factorised convolutions and thus the building block and subsequently the whole network is lightweight.

### Differences to SENet

SENet [3] aims to re-calibrate the feature channels by re-scaling the activation values for a single stream, whereas OSNet is designed to selectively fuse multiple feature streams of different receptive field sizes in order to learn omni-scale features.

## References

[1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [link]

[2] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. [link]

[3] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. [link]

查看原文

如果喜欢我的文章，欢迎关注我的个人博客

Looks like your connection to Dian was lost, please wait while we try to reconnect.