Omni-Scale Feature Learning

  • Omni-Scale Feature Learning for Person Re-Identification


    The source code is here.


    Major Challenges


    As an instance-level recognition problem, person ReID faces two major challenges as illustrated in Figure 1 :

    1. The intra-class (instance/identity) variations are typically big due to the changes of camera viewing conditions. - hard positives

    2. Small inter-class variations - people in public space often wear similar clothes; from a distance as typically in surveillance videos, they can look incredibly similar. - hard negatives

    Omni-Scale Featrue

    To match people and distinguish them from impostors, features corresponding both small local regions and global whole body regions are important.

    1. Looking at the global-scale features would narrow down the search to the true match (middle) and an impostor (right).

    2. The local-scale features gives away the fact that the person on the right is an impostor. (For more challenging cases, more complicated and richer features that span multiple scales are required.)


    1. enabling omni-scale feature learning;

    2. a lightweight network.


      1. When trained on the ReID datasets, which are often of moderate sizes due to the difficulties in collecting across-camera matched person images, a lightweighted network with a small number of model parameters is less prone to overfitting.

      2. In a large-scale surveillance application, the most practical way for ReID is to perform feature extraction at the camera end, in which case only features need to be sent to a central server instead of sending the raw videos.

    Depthwise Seperable Convolutions [notes]

    Traditional Convolution


    Depthwise Convolution


    Pointwise Convolution

    1×11\times 1 卷积:


    Lite 3×33\times 3 Convolution


    Omni-Scale Residual Block

    Multi-Scale Feature Learning




    $$\tilde{x}=\sum_{t=1}^TF^t(x),\quad\mbox{s.t.}\quad T\ge 1$$

    Unified Aggregation Gate

    • To learn omni-scale features, we propose to combine the outputs of different streams in a dynamic way, i.e., different weights are assigned to different scales according to the input image, rather than being fixed after training. - learnable neural network AG

      1. The output of the AG network G(xt)G(x^t) is a vector rather a scalar for the tt-th stream, resulting in a more fine-grained fusion that tunes each feature channel.

      2. The weights are dynamically computed by being conditioned on the input data.

    $$\tilde{x}=\sum_{t=1}^TG(x^t)\odot x^t,\quad\mbox{where}\quad x^t=F^t(x)$$

    • The AG is shared for all feature streams in the same omni-scale residual block.


      1. The number of parameters is independent of TT (number of streams), thus the model becomes more scalable.

      2. The supervision signals from all streams are gathered together to guide the learning of GG.

      LG=Lx~x~G=Lx~(t=1Txt)\frac{\partial L}{\partial G}=\frac{\partial L}{\partial \tilde{x}}\frac{\partial \tilde{x}}{\partial G}=\frac{\partial L}{\partial \tilde{x}}(\sum_{t=1}^T x^t)

    Differences to Inception and ResNeXt

    1. The multi-stream design in OSNet strictly follows the scale-incremental principle dictated by the exponent TT. Specifically, different streams have different receptive fields but are bulit with the same Lite 3×33\times 3 layers. Such a design is more effective at capturing a wide range of scales. In contrast, Inception [1] was originally designed to have low computational costs by sharing computations with multiple streams. Therefore its structure, which includes mixed operations of convolution and pooling, was handcrafted. ResNeXt [2] has multiple equal-scale streams thus learning representations at the same scale.

    2. Inception/ResNeXt aggregates features by concatenation/addition while OSNet uses a unified AG, which facilitates the learning of combinations of multi-scale feature. Critically, it means that the fusion is dynamic and adaptive to each individual input image. Therefore, OSNet's architecture is fundamentally different from that of Inception/ResNeXt in nature.

    3. OSNet uses factorised convolutions and thus the building block and subsequently the whole network is lightweight.

    Differences to SENet

    SENet [3] aims to re-calibrate the feature channels by re-scaling the activation values for a single stream, whereas OSNet is designed to selectively fuse multiple feature streams of different receptive field sizes in order to learn omni-scale features.


    [1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [link]

    [2] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. [link]

    [3] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. [link]




Copyright © 2018 All rights reserved.

Looks like your connection to Dian was lost, please wait while we try to reconnect.