物体检测|CornerNet: Detecting Objects as Paired Keypoints论文笔记



  • 论文地址链接
    个人独立博客的链接,可显示正常latex公式

    概述

    ​ CornerNet模型预测目标边界框的左上角和右下角一对顶点,既使用单一卷积模型生成热点图和连接矢量:所有目标的左上角和所有目标的右下角热点图,每个顶点的连接矢量(embedding vector对顶点进行分组)。

    image-20190628112539307

    创新点

    ​ 1.方法创新,见概述。(the embeddings of two corners from the same object
    is small.)

    ​ 2.作者提出了corner pooling用于定位顶点。自然界的大部分目标是没有边界框也不会有矩形的顶点,依top

    left corner pooling 为例,对每个channel,分别提取特征图的水平和垂直方向的最大值,然后求和。

    image-20190628113729616

    (a corner cannot be localized based on local evidence:)

    image-20190628160327630

    ​ 论文认为corner pooling之所以有效,是因为:

    ​ (1)目标定位框的中心难以确定,和边界框的4条边相关,但是每个顶点只与边界框的两条边相关,所以corner 更容易提取。

    ​ (2)顶点更有效提供离散的边界空间,实用O(wh)顶点可以表示O($w^2h^2$) anchor boxes。

    ​ 3.模型基于hourglass架构,使用focal loss的变体训练神经网络。

    Architecture

    image-20190628115017136

    架构分为三部分:

    • Hourglass Network
    • Bottom-right corners&Top-left Corners Heatmaps
    • Prediction Module。

    Hourglass Network是人体姿态估计的典型架构,论文堆叠两个Hourglass Network生成Top-left和Bottom-right corners,每一个corners都包括corners Pooling,以及对应的Heatmaps, Embeddings vector和offsets。embedding vector使相同目标的两个顶点(左上角和右下角)距离最短, offsets用于调整生成更加紧密的边界定位框。

    Detecting Corners

    ​ 论文模型生成的heatmaps包含C channels(C是目标的类别,没有background channel),每个channel是二进制掩膜,表示相应类别的顶点位置。

    ​ 对于每个顶点,只有一个ground-truth,其他位置都是负样本。在训练过程,模型减少负样本,在每个ground-truth顶点设定半径r区域内都是正样本,这是因为落在半径r区域内的顶点依然可以生成有效的边界定位框。

    ​ 我们根据物体的size、通过确保半径内的一对点(指的应该是圆中的点)会生成一对IoU至少是t的bbox,设定radius。给了radius后,罚减量由非标准化的二维高斯给出,$\begin{equation}
    e^{-\frac{x^{2}+y^{2}}{2 \sigma^{2}}}
    \end{equation}$(x,y为距离顶点的距离,σ 是1/3的radius)。

    (原文:We determine the radius by the size of an object by ensuring that a pair of points within the radius would generate a bounding box with at least t IoU with the ground-truth annotation (we set t to 0.3 in all
    experiments). Given the radius, the amount of penalty reduction is given by an unnormalized 2D Gaussian,$\begin{equation}
    e^{-\frac{x^{2}+y^{2}}{2 \sigma^{2}}}
    \end{equation}$,whose center is at the positive location and whose σ is 1/3 of the radius.)

    image-20190628172920878

    ​ $p_{cij}$表示类别为c,坐标是*(i,j)*的预测热点图,$y_{cij}$表示相应位置的ground-truth,论文提出变体Focal loss表示检测目标的损失函数:
    $$
    L_{d e t}=\frac{-1}{N} \sum_{c=1}^{C} \sum_{i=1}^{H} \sum_{j=1}^{W}\left{\begin{array}{c}{\left(1-p_{c i j}\right)^{\alpha} \log \left(p_{c i j}\right)} & {\text { if } y_{c i j}=1} \ {\left(1-y_{c i j}\right)^{\beta}\left(p_{c i j}\right)^{\alpha} \log \left(1-p_{c i j}\right)} & {\text { otherwise }}\end{array}\right.
    $$
    ​ 由于下采样,模型生成的热点图相比输入图像分辨率低。论文提出偏移的损失函数,用于微调corner和ground-truth偏移。在将热点图remap回输入图像之前,预测location offsets来轻微地调整坐标。offsets计算:
    $$
    \boldsymbol{o}{k}=\left(\frac{x{k}}{n}-\left\lfloor\frac{x_{k}}{n}\right\rfloor, \frac{y_{k}}{n}-\left\lfloor\frac{y_{k}}{n}\right\rfloor\right)
    $$

    $$
    L_{o f f}=\frac{1}{N} \sum_{k=1}^{N} \text { SmoothL } 1 \operatorname{Loss}\left(\boldsymbol{o}{k}, \hat{\boldsymbol{o}}{k}\right)
    $$

    (one set of offsets shared by the top-left corners of all categories, and another set shared by the bottom-right corners.)

    Grouping Corners

    Lpull用于分离点,Lpush用于联合点:
    $$
    \begin{array}{l}{L_{p u l l}=\frac{1}{N} \sum_{k=1}^{N}\left[\left(e_{t_{k}}-e_{k}\right)^{2}+\left(e_{b_{k}}-e_{k}\right)^{2}\right]} \ {L_{p u s h}=\frac{1}{N(N-1)} \sum_{k=1}^{N} \sum_{j=1 \atop j \neq k}^{N} \max \left(0, \Delta-\left|e_{k}-e_{j}\right|\right)}\end{array}
    $$
    其中$e_{t_k}$为左上角点对应第k类物体的embedding,$e_{b_k}$为右下角点对应第k类物体的的embedding,,$\Delta$=1,$\text { where } e_{k} \text { is the average of } e_{t_{k}} \text { and } e_{b_{k}}$。(分析可以看出,Lpull让同类点间的距离尽量小,Lpush让不同类点间的距离尽量接近于1。

    Corner Pooling

    image-20190630140952662

    取出(i,j)所在行中的最大$l_{ij}$、列中的最大$t_{ij}$,然后将$t_{ij}$与$l_{ij}$加起来。在实际应用中,用动态规划实现:

    image-20190630142135437

    Hourglass Network

    image-20190630142326216

    由两个hourglasses组成,一些细节优化这里就不展开了。

    Experiments

    $$
    L=L_{d e t}+\alpha L_{p u l l}+\beta L_{p u s h}+\gamma L_{o f f}
    $$

    $\alpha$、$\beta$为0.1,$\gamma$为1,在 Titan X (PASCAL) GPU上,达到224ms每张对检测速度。评估如下:

    image-20190630153002306

    Error Analysis

    待优化:

    (1)CornerNet将heatmaps, offsets, and embeddings同时输出,会影响检测的效果。

    (2)如果corner丢失,物体就无法被检测到。

    (3)需要精确的效果集来生成紧密的边界框。

    (4)不正确的embeddings将导致许多错误的边界框

    扩展——

    DeNet (Tychsen-Smith and Petersson, 2017a) is a two-stage detector which generates RoIs without using anchor boxes. It first determines how likely each location belongs to either the top-left, top-right, bottomleft or bottom-right corner of a bounding box. It then generates RoIs by enumerating all possible corner combinations, and follows the standard two-stage approach to classify each RoI. Our approach is very different from DeNet.

    • First, DeNet does not identify if two corners are from the same objects and relies on a sub-detection
      network to reject poor RoIs. In contrast, our approach is a one-stage approach which detects and groups the corners using a single ConvNet.
    • Second, DeNet selects features at manually determined locations relative to a region for classification, while our approach does not require any feature selection step.
    • Third, we introduce corner pooling, a novel type of layer to enhance corner detection.)

    Point Linking Network (PLN)

    • First, CornerNet groups the corners by predicting embedding vectors, while PLN groups the corner and center by predicting pixel locations.
    • Second, CornerNet uses corner pooling to better localize the corners.

    inspired by Associative Embedding in the context of multi-person pose estimation.

    作者的思路其实来源于一篇多人姿态估计的论文[1]。基于CNN的2D多人姿态估计方法,通常有2个思路(Bottom-Up Approaches和Top-Down Approaches):

    ​ (1)Top-Down framework,就是先进行行人检测,得到边界框,然后在每一个边界框中检测人体关键点,连接成每个人的姿态,缺点是受人体检测框影响较大,代表算法有RMPE。

    ​ (2)Bottom-Up framework,就是先对整个图片进行每个人体关键点部件的检测,再将检测到的人体部位拼接成每个人的姿态,缺点可能会将不同人的不同部位按一个人进行拼接,代表方法就是openpose。

    Newell et al. propose an approach that detects and groups human joints in a single network. In their approach each detected human joint has an embedding vector. The joints are grouped based on the distances between their embeddings.


 

Copyright © 2018 bbs.dian.org.cn All rights reserved.

Looks like your connection to Dian was lost, please wait while we try to reconnect.