1. Abstract

one-stage detector는 빠르지만 성능이 two-stage detector에 비해서 떨어진다.
- 극단적인 foreground-background class 불균형 때문에 발생하는 문제
- region proposal의 대부분이 background(negative) image이므로 굉장히 class가 불균형적임
focal loss
- object detection을 위한 loss function을 새로 정의
- well-classified example(background)의 weight를 낮추고 hard example(object)의 weight는 높이는 방식
- well-classified example(background)이 hard example(object)의 학습을 방해 못하도록 함
RetinaNet
- one-stage detector이지만, two-stage detector의 성능을 보여줌

2. Introduction

fig1-2

two-stage detector
- sparse한(object보다 background가 훨씬 많으므로) object 후보군을 생성하고, 각 후보군에서 CNN으로 object를 분류하는 방식
- SS, RPN과 같은 방식으로 object 후보군을 줄여가는 방식으로 background를 filtering해서 class imbalance를 처리함
one-stage detector
- dense한 object 후보군과 각 후보군의 class를 동시에 예측하는 방식
- object 후보군을 줄이는 방식을 사용하지 않기 때문에 너무 많은 background 때문에 학습이 영향을 받음
focal loss
- class imbalance를 해결하기 위한 loss function
- 동적으로 cross-entropy의 크기를 조절
- 학습하기 쉬운 것들의 loss를 줄이고 학습하기 어려운 object에 집중하도록 설계
RetinaNet
- one-stage detector
- feature pyramid, anchor boxes 사용

Classic Object Detectors
Two-stage Detectors
One-stage Detectors
Class Imbalance
Robust Estimation

4. Focal Loss

cross entropy in binary classification
- $CE(y, p)$ = $-ylog(p) -(1-y)log(1-p)$
rewritten cross entropy in binary classification
- $CE(p_t)$ = $-log(p_t)$

4.1 Balanced Cross Entropy

-balanced cross entropy in binary classification
- $\alpha-CE(p_t)$ = $-\alpha_t log(p_t)$
- $\sum \alpha_t$ = 1
- class의 중요한 정도를 조절
focal loss의 baseline으로 사용

4.2 Focal Loss Definition

object detection의 가장 큰 문제는 쉽게 분류되는 negatives(background)가 너무 많아 gradient에 큰 영향을 미치는 것
- 따라서, class의 중요도 뿐만 아니라 문제의 난이도도 고려를 해야함
- 하지만, $\alpha$ -balancing은 class의 중요도만 고려할 뿐 쉬운 문제와 어려운 문제를 따로 구분하지는 않음

4.2.1 Focal Loss

focal loss in binary classification
- $FL(p_t)$ = $-\alpha_t (1-p_t)^{\gamma} log(p_t)$
- $(1-p_t)^{\gamma}$ = modulating factor
- $\gamma$ = tunable focusing parameter $\in [0,5]$
- 쉬운 문제와 어려운 문제를 구분하여 loss를 적용

= 1 , 1
- modulating factor $\approx$ 0
- 즉, 잘 맞추는 것에 대한 loss가 낮음
= 1 , 0
- modulating factor $\approx$ 1
- 즉, 잘 못맞추는 것에 대한 loss는 거의 변화가 없음
value
- class별로 중요도를 부여
value
- 값이 커질수록 낮은 loss를 적용받는 범위가 넓어짐( $\gamma$ =0일 경우, cross entropy와 동일)
- 즉, 쉽다고 판단되는 문제들이 많아지므로 어려운 문제들에 더욱 집중하게 됨
derivative
- $\frac{dFL}{dx}$ = $y(1-p_t)^{\gamma} (\gamma p_t log(p_t) + p_t -1)$

4.3 Class Imbalance and Model Initialization

class imbalance가 심할 경우, 학습 초기에 dominant class가 loss에 지배적인 영향을 미치게 되므로 학습이 불안정적
따라서, 이럴 경우에는 prior $\pi$ 를 도입하여 model initialization을 해주는 방식으로 안정적이게 학습이 되도록 해야함

4.4 Class Imbalance and Two-stage Detectors

two-stage detector들이 class imbalance를 다루는 방식
- cascade stage : SS 등의 방식으로 region proposal의 수를 줄여나가는 것
- biased mini-batch : mini-batch에서 object와 background를 1:3 비율로 sampling하는 것

5. RetinaNet Detector

5.1 Archtecture

fig3

RetinaNet
- single unified network
- backbone network와 task-specific sub-network로 구성
- backbone network는 전체 image에 대한 convolution을 통해 feature를 추출
- task-specific sub-network는 각각 object classification과 bounding box regression을 수행

5.1.1 Feature Pyramid Network Backbone

Backbone Network
- ResNet + FPN
- ResNet 끝 부분에 FPN을 붙여서 구성
FPN(Feature Pyramid Network)
- 각각의 resolution level마다 object detection을 수행하여 각각의 다른 scale에 대하여 학습가능
- FCN(Fully Convolutional Network) 기반

5.1.2 Anchors

Anchor box
- 각각의 resolution level마다 anchor box 9개 사용
- 각각의 anchor box마다 object class, bounding box coordinate 예측
IoU threshold
- 0.5 이상의 anchor box만 object bounding box로 할당
- 0.5~0.4 사이의 anchor box는 사용하지 않음
- 0.4 이하의 anchor box는 background로 할당

5.1.3 Classification Subnet

classification subnet
- multi-class classification으로, 각 class일 확률을 독립적으로 예측
- multi-class classification이기 때문에 각 class의 최종적인 확률을 구하기 위하여 sigmoid 사용
- FCN으로 구성
- 3x3 filter와 relu 사용
각각의 FPN level의 feature에 대해서 적용하지만, parameter는 모든 level의 subnet이 공유
focal loss 사용

5.1.4 Box Regression Subnet

bounding box subnet
- anchor box의 offset정보 4개를 예측
- 3x3 filter와 relu 사용
- FCN으로 구성
각각의 FPN level의 feature에 대해서 적용하지만, parameter는 모든 level의 subnet이 공유
loss 사용

5.2 Inference and Training

5.2.1 Inference

NMS를 통해 중복 제거

5.2.2 Focal Loss

classification subnet의 결과에 focal loss 적용
sampling을 사용하지 않고, 모든 mini-batch의 anchor box에 대하여 loss 계산
ground-truth bounding box에 배정된 anchor box의 수만큼 나눠주는 방식으로 normalization

5.2.3 Initialization

ImageNet으로 pre-training시킨 backbone network 사용
sub-network는 $\sigma$ = 0.01, $b$ =0 gaussian 분포사용
classification subnet의 마지막 layer만 $b$ = $-log((1-\pi)/\pi)$ 사용

5.2.4 Optimization

SGD
loss = focal loss + smooth L1 loss
decaying learning rate
weight decaying and momentum

6. Experiments

COCO dataset 사용

6.1 Training Dense Detection

table1

6.1.1 Network Initialization

일반적인 cross entropy를 사용했을 경우, prior $\pi$ 을 도입했을 경우에 안정적으로 학습이 되었음
다른 경우에도 prior이 도입되었을 경우가 더 안정적으로 학습이 됨

6.1.2 Balanced Cross Entropy

$\alpha$ 만 사용했을 경우에는 $\alpha$ = 0.75가 가장 좋은 성능

6.1.3 Focal Loss

에 따라서 성능이 좋은 가 다름
- $\gamma$ 이 커질수록 최적의 $\alpha$ 값이 작아지는 경향이 있음
- 이미 $\gamma$ 로 인해 easy negative들이 down-weighted 되었으므로, $\alpha$ 로 positive의 중요성을 조금 낮춰주는 것
$\gamma$ =2, $\alpha$ =0.25 가 가장 좋은 성능

6.1.4 Analysis of the Focal Loss

fig4

positive의 경우에는 $\gamma$ 값에 따라 차이가 별로 없지만, negative는 큰 차이를 보여줌
focal loss가 효과적으로 easy negative를 학습에서 제외시키는 것을 알 수 있음

6.1.5 Online Hard Example Mining (OHEM)

mini-batch에서 positive와 negative를 일정 비율로 sampling하여 학습하는 방식
- imbalance를 해결할 수 있는 방법 중 하나
- sampling을 통해 easy negative를 걸러낼 수 있음
focal loss가 OHEM보다 더 성능이 좋음
- easy negative를 더 잘 구별해냄

6.1.6 Hinge Loss

일정 값 이상의 $p_t$ 에 대해서는 loss를 0으로 하는 방식
성능이 안좋았음

6.2 Model Architecture Design

6.2.1 Anchor Density

다양한 scale, ratio의 anchor box가 있는 것이 더 성능이 좋다.
하지만, 9개 이상의 anchor box를 사용한다고 해서 성능이 더 좋아지지는 않았다.
- two-stage detector는 많은 box를 만들어내는데, 이것은 어쩌면 큰 이점을 주지 못할 수도 있다.

6.2.2 Speed versus Accuracy

backbone network의 크기가 커지면 성능은 좋아지지만, 속도는 느려짐
image의 크기, 즉, 해상도가 높을수록 성능은 좋지만, 속도는 느려짐

6.3 Comparison to State of the Art

table2

7. Conclusion

one-stage object detector에서 class imbalance를 해결하는 것은 아주 중요하며, focal loss는 이 부분에서 성능이 좋다.

8. Appendix

Focal Loss
- $FL\ast$ = $-log(p_t^{\ast})/\gamma$
- $p_t^{\ast}$ = $\sigma(\gamma x_t + \beta)$
- $x_t$ = $yx$
- $y$ = $\pm 1$ = ground truth class
- $x$ = quantity
- 다른 형태의 focal loss이며, focal loss의 형태는 크게 중요하지 않고, imbalance를 해결하는 역할이 중요
- $\frac{dFL\ast}{dx}$ = $y(p_t^{\ast}-1)$

DevelopHyun

RetinaNet[1] Focal Loss for Dense Object Detection(2017) - Review

1. Abstract

2. Introduction

4. Focal Loss

4.1 Balanced Cross Entropy

4.2 Focal Loss Definition

4.2.1 Focal Loss

4.3 Class Imbalance and Model Initialization

4.4 Class Imbalance and Two-stage Detectors

5. RetinaNet Detector

5.1 Archtecture

5.1.1 Feature Pyramid Network Backbone

5.1.2 Anchors

5.1.3 Classification Subnet

5.1.4 Box Regression Subnet

5.2 Inference and Training

5.2.1 Inference

5.2.2 Focal Loss

5.2.3 Initialization

5.2.4 Optimization

6. Experiments

6.1 Training Dense Detection

6.1.1 Network Initialization

6.1.2 Balanced Cross Entropy

6.1.3 Focal Loss

6.1.4 Analysis of the Focal Loss

6.1.5 Online Hard Example Mining (OHEM)

6.1.6 Hinge Loss

6.2 Model Architecture Design

6.2.1 Anchor Density

6.2.2 Speed versus Accuracy

6.3 Comparison to State of the Art

7. Conclusion

8. Appendix

9. Reference

Related Posts

DevelopHyun

1. Abstract

2. Introduction

3. Related Work

4. Focal Loss

4.1 Balanced Cross Entropy

4.2 Focal Loss Definition

4.2.1 Focal Loss

4.3 Class Imbalance and Model Initialization

4.4 Class Imbalance and Two-stage Detectors

5. RetinaNet Detector

5.1 Archtecture

5.1.1 Feature Pyramid Network Backbone

5.1.2 Anchors

5.1.3 Classification Subnet

5.1.4 Box Regression Subnet

5.2 Inference and Training

5.2.1 Inference

5.2.2 Focal Loss

5.2.3 Initialization

5.2.4 Optimization

6. Experiments

6.1 Training Dense Detection

6.1.1 Network Initialization

6.1.2 Balanced Cross Entropy

6.1.3 Focal Loss

6.1.4 Analysis of the Focal Loss

6.1.5 Online Hard Example Mining (OHEM)

6.1.6 Hinge Loss

6.2 Model Architecture Design

6.2.1 Anchor Density

6.2.2 Speed versus Accuracy

6.3 Comparison to State of the Art

7. Conclusion

8. Appendix

9. Reference

Related Posts