1. Abstract

YOLO는 image detection을 bounding box problem과 class probability 문제로 생각
- 과거의 모델은 대부분 classification에서 detection으로 목적을 변경한 모델
YOLO는 하나의 Neural Network로 bounding box와 class probability를 모두 예측하는 모델
- 다른 모델에 비하여 빠르기 때문에 실시간 처리가 가능
- 성능이 조금 뒤쳐지는 단점이 존재하기는 하지만, 배경에 대한 FP(false positive)는 낮음
- 즉, 객체에 대한 general representation을 학습

2. Introduction

fig1

사람은 image를 봤을 때, 빠르고 정확하게 어떤 Object들이 어디에 위치해 있고 어떤 상황인지 한 눈에 파악 가능
R-CNN
- potential bounding box를 생성하고 각각의 bounding box마다 classifier를 수행한 후에 다시 bounding box를 정리하는 방식
- pipeline이 복잡하고 각각의 component가 각각 학습되기 때문에 속도가 느림
YOLO
- unified model
- object detection을 single regression problem으로 생각
- bounding box와 class probability를 동시에 추론
- single convolutional network가 동시에 여러 개의 bounding box와 type을 예측
- full image를 학습
YOLO의 장점
- detection을 single regression problem으로 치환했기 때문에 pipeline이 간단하여 빠름
- full image를 학습하기 때문에 appearance뿐만 아니라 class의 context 정보도 학습가능하여 background에 관한 error가 낮음
- object의 generalizable representation을 더 잘 학습하기 때문에 train data와 test data의 domain이 달라져도 좋은 성능을 보여준다.
YOLO의 단점
- 성능이 다른 모델에 비해 조금 떨어진다.

3. Unified Detection

yolo

object detection의 bounding box와 class probability를 single neural network로 통합
전체 image에서 동시에 bounding box와 object class를 예측
즉, 물체를 각각 찾는 것이 아니라, 전체 image를 고려하여 찾는 것

먼저 image를 $S \times S$ 개의 grid cell로 나눔
각각의 grid cell은 개의 bounding box들을 만들고, 각각의 box에 대하여 confidence score를 계산
- confidence score = $P(Object) \times IoU$
- confidence score는 ‘box 안에 object가 얼마나 있는가?’와 ‘object인지 아닌지를 얼마나 잘 예측했는가?’를 반영
- 따라서, 각 bounding box는 (x,y,w,h)와 confidence score라는 5개의 value로 구성
- (x,y) = bounding box의 중심좌표
각각의 grid cell은 개의 object class에 대하여 조건부 확률을 예측
- 조건부 확률 = $P(class_i \mid Object)$
- box의 개수에 상관없이 각 grid cell에서는 grid cell에 대한 1개의 probability set $(P(class_1 \mid Object), \cdots , P(class_C \mid Object))$ 만 예측
실제 bounding box를 예측할 때는 confidence score와 conditional class probability를 곱해서class-specific confidence score를 계산
- class-specific confidence score = $P(class_i \mid Object) \times P(Object) \times IoU$ = $P(class_i) \times IoU$
- 즉, 각 box마다 ‘각 class에 대한 확률’과 ‘얼마나 box가 잘 만들었는지’를 나타냄
따라서, output은 $S \times S \times (B \times 5 + c)$ 개의 tensor가 됨

3.1 Network Design

fig3

CNN으로 구성
- 24개의 conv layer와 2개의 fc layer
- inception module에 영감을 받아 $1 \times 1$ convolution 사용

3.2 Training

3.2.1 model detail

ImageNet1000 dataset으로 pretraining
pre-trained된 CNN에 4개의 conv layer와 2개의 fc layer를 더 붙여서 fine-tuning
leaky relu 사용

3.2.2 implementation detail

x,y,w,h 값 모두 0~1 사이의 값이 되도록 조정
대부분의 grid cell은 object를 포함하지 않고 있기 때문에 confidence score가 0에 가까워지게 되는데, 만약 object를 포함하지 않았다고 예측한 box가 object를 포함한다면 box의 gradient 값이 튀어서 학습이 불안정해지는 문제가 발생
- parameter $\lambda_{coord}$ , $\lambda_{no\; obj}$ 를 도입하여 bounding box coordinate prediction에 대한 error는 키워주고, bounding box confidence prediction에 대한 error는 줄여주는 방식으로 해결
box의 크기에 따라서 변화하는 정도를 다르게 취급해야하기 때문에 길이의 sqrt 값을 사용
- 큰 box의 변호와 작은 box의 변화가 IoU에 미치는 영향이 다르기 때문

3.2.3 loss detail

multi-part loss
- object를 포함한 bounding box의 위치에 대한 error
- object를 포함한 bounding box의 크기에 대한 error
- object를 포함한 bounding box의 confidence score에 대한 error
- object를 포함하지 않은 bounding box의 confidence score에 대한 error
- object를 포함한 grid cell의 conditional class probability에 대한 error
loss function
- $\theta_{i,j}^{obj}$ = $i$ 번째 grid cell의 $j$ 번째 box에 object가 존재하면 1, 아니면 0
- $\theta_{i,j}^{no\; obj}$ = $i$ 번째 grid cell의 $j$ 번째 box에 object가 존재하면 0, 아니면 0
- $\theta_{i}^{no\; obj}$ = $i$ 번째 grid cell에 ground truth box의 중심점이 존재하면 1, 아니면 0
- $C_{i}$ = confidence score = object가 존재하면 1, 아니면 0
- $C_{i}$ = predicted confidence score

3.3 Inference

YOLO는 단 하나의 network만 사용하므로 속도가 아주 빠름
object당 하나의 bounding box를 가지기 위하여 NMS 적용

3.4 Limitations of YOLO

각각의 grid cell이 하나의 class만을 예측하므로, 작은 object 모여있으면 성능이 떨어진다.
일반적이지 않은(데이터에 없는) ratio를 가지고 있는 object는 잘 예측하지 못한다.
localization error가 크다.

4. Comparison to Other Detection Systems

Deformable parts models(DPM)
R-CNN
Other Fast Detectors
Deep MultiBox
OverFeat
MultiGrasp

5. Experiments

5.1 Comparison to Other Real-Time Systems

table1

5.2 VOC 2007 Error Analysis

fig4

5.3 Combining Fast R-CNN and YOLO

table2

5.4 VOC 2012 Results

table3

5.5 Generalizability: Person Detection in Artwork

fig5

6. Real-Time Detection In The Wild

YOLO는 매우 빨라서 실시간처리에 적합하다.

7. Conclusion

YOLO는 box coordinate와 class를 동시에 예측하는 unified model이다.

DevelopHyun

YOLO[1] You Only Look Once: Unified, Real Time Object Detection(2015) - Review

1. Abstract

2. Introduction

3. Unified Detection

3.1 Network Design

3.2 Training

3.2.1 model detail

3.2.2 implementation detail

3.2.3 loss detail

3.3 Inference

3.4 Limitations of YOLO

4. Comparison to Other Detection Systems

5. Experiments

5.1 Comparison to Other Real-Time Systems

5.2 VOC 2007 Error Analysis

5.3 Combining Fast R-CNN and YOLO

5.4 VOC 2012 Results

5.5 Generalizability: Person Detection in Artwork

6. Real-Time Detection In The Wild

7. Conclusion

8. Reference

Related Posts