OmniScale Feature Learning

OmniScale Feature Learning for Person ReIdentification
全文链接：
https://arxiv.xilesou.top/abs/1905.00953The source code is here.
Introduction
Major Challenges
As an instancelevel recognition problem, person ReID faces two major challenges as illustrated in Figure 1 :

The intraclass (instance/identity) variations are typically big due to the changes of camera viewing conditions.  hard positives

Small interclass variations  people in public space often wear similar clothes; from a distance as typically in surveillance videos, they can look incredibly similar.  hard negatives
OmniScale Featrue
To match people and distinguish them from impostors, features corresponding both small local regions and global whole body regions are important.

Looking at the globalscale features would narrow down the search to the true match (middle) and an impostor (right).

The localscale features gives away the fact that the person on the right is an impostor. (For more challenging cases, more complicated and richer features that span multiple scales are required.)
OSNet:

enabling omniscale feature learning;

a lightweight network.
benefits:

When trained on the ReID datasets, which are often of moderate sizes due to the difficulties in collecting acrosscamera matched person images, a lightweighted network with a small number of model parameters is less prone to overfitting.

In a largescale surveillance application, the most practical way for ReID is to perform feature extraction at the camera end, in which case only features need to be sent to a central server instead of sending the raw videos.

Depthwise Seperable Convolutions [notes]
Traditional Convolution
Depthwise Convolution
Pointwise Convolution
即
$1\times 1$ 卷积：Lite
$3\times 3$ ConvolutionOmniScale Residual Block
MultiScale Feature Learning
$y=x+\tilde{x}$ where
$$\tilde{x}=\sum_{t=1}^TF^t(x),\quad\mbox{s.t.}\quad T\ge 1$$ Unified Aggregation Gate

To learn omniscale features, we propose to combine the outputs of different streams in a dynamic way, i.e., different weights are assigned to different scales according to the input image, rather than being fixed after training.  learnable neural network AG

The output of the AG network
$G(x^t)$ is a vector rather a scalar for the$t$ th stream, resulting in a more finegrained fusion that tunes each feature channel. 
The weights are dynamically computed by being conditioned on the input data.

$$\tilde{x}=\sum_{t=1}^TG(x^t)\odot x^t,\quad\mbox{where}\quad x^t=F^t(x)$$ 
The AG is shared for all feature streams in the same omniscale residual block.
advantages:

The number of parameters is independent of
$T$ (number of streams), thus the model becomes more scalable. 
The supervision signals from all streams are gathered together to guide the learning of
$G$ .
$\frac{\partial L}{\partial G}=\frac{\partial L}{\partial \tilde{x}}\frac{\partial \tilde{x}}{\partial G}=\frac{\partial L}{\partial \tilde{x}}(\sum_{t=1}^T x^t)$ 
Differences to Inception and ResNeXt

The multistream design in OSNet strictly follows the scaleincremental principle dictated by the exponent
$T$ . Specifically, different streams have different receptive fields but are bulit with the same Lite$3\times 3$ layers. Such a design is more effective at capturing a wide range of scales. In contrast, Inception [1] was originally designed to have low computational costs by sharing computations with multiple streams. Therefore its structure, which includes mixed operations of convolution and pooling, was handcrafted. ResNeXt [2] has multiple equalscale streams thus learning representations at the same scale. 
Inception/ResNeXt aggregates features by concatenation/addition while OSNet uses a unified AG, which facilitates the learning of combinations of multiscale feature. Critically, it means that the fusion is dynamic and adaptive to each individual input image. Therefore, OSNet's architecture is fundamentally different from that of Inception/ResNeXt in nature.

OSNet uses factorised convolutions and thus the building block and subsequently the whole network is lightweight.
Differences to SENet
SENet [3] aims to recalibrate the feature channels by rescaling the activation values for a single stream, whereas OSNet is designed to selectively fuse multiple feature streams of different receptive field sizes in order to learn omniscale features.
References
[1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [link]
[2] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. [link]
[3] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. In CVPR, 2018. [link]
如果喜欢我的文章，欢迎关注我的个人博客
