YOLO: Loss Function¶
Notation¶
The YOLO algorithm assumes that the model divides an input image into an \(S \times S\) grid. Each grid cell is responsible to predict \(B\) bounding boxes, performing both localization and classification (totally \(K\) classes).
Therefore the bounding box of index \((i, j)\) (\(i \in S^2,\) \(j \in B\)) is [1]:
where:
\(c\): object confidence
prediction: logit value
ground truth: \(1\) / \(0\) flag
\(x\), \(y\), \(w\) and \(h\): bouding box coordinates
YOLO redicts bounding boxes using anchor boxes since YOLO9000 [11]. The predictions correspond to:
\[\begin{split}b_x &= \sigma(x) + c_x \\ b_y &= \sigma(y) + c_y \\ b_w &= p_w e^w \\ b_h &= p_h e^h\end{split}\]\(p_{ij}^{(1)}, \ldots, p_{ij}^{(K)}\): confidence of \(K\) classes
prediction: logit values
ground truth: one-hot values
Mask¶
IoU Mask¶
Given a prediction \(\hat{\mathbf{y}}\) and label \(\mathbf{y}\), the IoU mask is:
where
\(\mathrm{IoU}_{ij}^{\mathrm{max}}\) is the maximum \(\mathrm{IoU}\) value of \(\hat{\mathbf{y}}_{ij}\) compared with all the ground truth bounding boxes;
\(\mathrm{IoU}_0\) is the minimum \(\mathrm{IoU}\) threshold.
Ground Truth Mask¶
Ground truth mask is determined by the ground truth confidence only:
Confidence Mask¶
Confidence Mask is determined by the prediction only:
where
\(o_0\) is the minimum object confidence threshold.
Overall Object / Background Mask¶
The background mask is a combination of \(\mathbb{1}_{ij}^{\mathrm{gt}}\) and \(\mathbb{1}_{ij}^{\mathrm{IoU}}\):
The object mask is \(\mathbb{1}_{ij}^{\mathrm{gt}}\):
Empirical Risk¶
The YOLO risk is nothing but the sum of the following Risks.
IoU Risk¶
Some implementation [2] will give large weights to small bounding boxes:
where \(I_w\) and \(I_h\) are the image width and height respectively.
Confidence Risk¶
The confidence risk is computed with binary cross entropy:
Class Risk¶
YOLOv3 class risk is computed with binary cross entropy as it allows the model to handle class imbalance more effectively [12].
Review¶
Why no MSE¶
We adopted the IoU Risk instead of the mean squared error (MSE), as it does not directly depend on the coordinates or scale of the bounding boxes, and it could be more robust to data imbalance and various object scales.
Back to Object Detection.