Faster R-CNN: Region Proposal Networks¶

A Region Proposal Network (RPN) takes an image and outputs a set of object proposals.

it is a convolutional neural network
it takes images of any size

Architecture¶

RPN Model Structure:

        =================================
        =       Input Feature Map       =
        =================================
                        |
                        |
                        |
                        |
                        |
        +-------------------------------+
        |    3×3 Convolutional Layer    |
        |                               |------------------|
        |       (sliding window)        |                  |
        +-------------------------------+                  |
                        |                                  |
                        |                                  |
                        |                                  |
                        |                                  |
                        |                                  |
        +-------------------------------+  +-------------------------------+
        |    1×1 Convolutional Layer    |  |    1×1 Convolutional Layer    |
        |                               |  |                               |
        |       (K×2 Classifiers)       |  |        (K×4 Regressors)       |
        +-------------------------------+  +-------------------------------+
                        |                                  |
                        |                                  |
                        |                                  |
                        |                                  |
                        |                                  |
        +-------------------------------+                  |
        |         Proposal Layer        |------------------|
        +-------------------------------+
                        |
                        |
                        |
                        |
                        |
        =================================
        =        Region Proposals       =
        =================================

The number of filters in the sliding window layer depends upon the backbone (\(256\) for the ZF and \(512\) for the VGG).
The two parallel \(1 \times 1\) convolutional layers are the core of the RPN “attention” mechanism.
\(K\) is the number of anchors. The default value is \(9\)
- three scales: \(128\), \(256\), \(512\)
- three ratios \(2:1\), \(1:1\), \(1:2\)

Efficiency¶

Pyramid of Anchors¶

The RPN has pre-defined anchors based on the size and aspect ratio, and it simply computes features at different anchor levels. This is referred as “pyramid of anchors”. It allows the RPN to efficiently generate region proposals at different scales, without the need for computationally expensive resizing or cropping operations, as the image / feature pyramids design.

Translation Invariant¶

If an object can be captured by a region proposal, it will be captured anywhere in any image. The probable reason is that the anchors are pre-defined and invariant and that the \(K\) regressors are also invariant.

Loss Function¶

Only a small portion of predictions are passed through loss calculation:
- positive, which has high overlap (\(\mathrm{IoU} > 0.7\)) with any ground truth box, or
- negative, which has low overlap (\(\mathrm{IoU} < 0.3\)) with all ground truth boxes
It is a combination of both classification and localization risk, which follows the multi-task loss in Fast R-CNN:
- classification: log loss
- localization: smooth \(L_1\)[1]. It is only activated for positive predictions

Back to Object Detection.