DevelopHyun

Data Science & Algorith with Computer Science

Fast R-CNN[1] Fast R CNN(2015) - Review

22 Sep 2018 » deeplearning, cnn, image-detection, r-cnn, paperreview

1. Abstract

R-CNN을 개선한 논문으로 속도와 정확도 모두를 높이려는 시도

2. Introduction

image detection은 image의 위치를 찾고 bounding box까지 찾아내야하므로 단순한 image classification보다 어렵다.
R-CNN은 image detection 분야에 CNN을 도입한 모델이다.

2.1 R-CNN and SPPnet

rcnn-sppnet

2.1.1 R-CNN

R-CNN의 학습과정이 multi-stage pipeline이므로 복잡하다.
SVM과 Bounding-Box Regeressor를 학습시키기 위해서는 CNN의 결과가 모두 disk에 저장되어야 하므로 시간 및 용량이 많이 요구된다.
R-CNN은 모든 proposal에 대하여 각각 CNN연산 및 SVM, Selective Search를 해줘야하므로 Object detection 속도가 느리다.

2.1.2 SPPnet

반면에 SPPnet은 CNN computation을 공유하는 방식으로 R-CNN보다 속도가 빠르다.
R-CNN의 학습과정이 multi-stage pipeline이므로 복잡하다.
feature들이 모두 disk에 저장되어야 하므로 시간 및 용량이 많이 요구된다.
convolution layer가 update되지 않기 때문에 성능의 한계가 존재한다.

2.2 Contributions

fast-rcnn

성능향상
single-stage pipeline with multi-task loss
updating convolution layer by fine tuning
disk를 사용하지 않음

3. Fast R-CNN architecture and training

fast r-cnn

전체 image에 CNN과 Pooling을 적용해 feature를 추출
- 모든 region이 공통적으로 사용
conv feature 위에 projection된 각각의 proposal에 RoI(region of interest) pooling을 적용해 고정된 feature를 추출
fully connected layer 통과
- 각각의 RoI에 대해 적용되며, parameter만 공유
softmax를 통한 label 예측과 bounding box regression을 동시에 진행
- fc layer를 통과한 각각의 RoI에 대하여 적용

3.1 The RoI pooling layer

roi

proposal들을 처리하기 위해서는 일정한 크기로 만들어줘야하는데, R-CNN의 wraping은 정보의 손실이 발생하는 문제가 존재
BoW(Bag of Words)를 적용하게 된다면 다양한 크기로부터 일정한 크기의 feature를 뽑아낼 수 있지만, 위치정보를 모두 잃는다는 단점 존재
따라서, proposal 영역을 여러 구간으로 나눈 후에 각 구간마다 BoW를 적용시켜준다면 지역정보를 유지하는 것이 가능함
- SPPnet의 Spacial Pyramid Pooling의 special case로, SPPnet에서는 각 구간을 다시 여러 구간으로 나누는 방식으로 여러 개의 pooling을 적용한 후 concatenate하여 사용

3.2 Initializing from pre-trained networks

pre-trained된 CNN을 사용

3.3 Fine-tuning for detection

SPPnet에서 학습이 잘 안되는 가장 큰 이유는 각 mini-batch에서 RoI 들이 각각 다른 image에서 올 때 발생
따라서, Fast R-CNN은 hierarchical sampling 방식 사용
- mini-batch의 서로 다른 image $N$ 개 중에서 RoI를 무작위로 sampling하는 것이 아니라, 각각의 image에서 $R/N$ 개를 sampling 하여 학습에 사용
- mini-batch끼리 학습을 공유할 수 있기 때문에 더 빠르게 학습 가능
- 논문에서는 $N$ =2, $R$ =128이 가장 성능이 좋았다고 함

3.3.1 Multi-task loss

loss

=
- $L_{class} (p,u)$ = $-log p_u$
- $L_{loc}(t^u, v)$ = $\sum_{i \in (x,y,w,h)} smooth_{L_1} (t_i^u -v_i)$
- $p$ = $(p_0, p_1, \cdots , p_K)$ = probability for each class $K$ including background
- $p$ = ground truth class
- $t^u$ = $(t_x^u, t_y^u, t_w^u, t_h^u)$ = predicted bounding box
- $p$ = ground truth bounding box

3.3.2 Mini-batch sampling

mini-batch

학습을 시킬 때는 ground truth bounding box와 IoU가 0.5 이상인 RoI들만 사용
IoU가 0.1~0.5 사이인 RoI는 background image로 labeling하여 사용

3.3.3 Back-propagation through RoI pooling layer

=
- $y_{rj}$ = $x_{i \ast (r,j)}$ = layer’s $j$ -th output from the $r$ -th
- $x_i$ = $i$ -th activation input into RoI pooling layer
- $i \ast (r,j)$ = $argmax_{\acute{i}} x_{\acute{i}}$
즉, RoI $r$ 의 pooling layer output $y_{rj}$ 에 대하여 $i$ 가 max pooling에 의해 argmax selected 될 때 $\frac{\partial L}{\partial y_{rj}}$ 가 back-propagated됨

3.3.4 SGD hyper-parameters

decaying learning rate
momentum = 0.9

3.4 Scale invariance

scale invariant object detection하는 방식으로 2개의 방식을 실험
- brute force : proposal image에 대한 scale을 고정해두고, 해당 데이터에 대해서만 학습
- image pyramids : proposal image를 정규화한 후, 임의로 여러 scale을 추출하여 학습

4. Fast R-CNN detection

RoI pooling을 해준 이후, 각각의 proposal $r$ 마다 확률 $P(class=k \mid r)$ = $p_k$ 을 계산
계산된 $p_k$ 를 기반으로 NMS를 수행

4.1 Truncated SVD for faster detection

truncated SVD를 통한 행렬연산을 수행하여 fc layer의 연산속도를 높임
$W \approx U \sum_t V^T \in R^{u \times v}$ 로 factorizing
- $u \times v$ 개의 parameter를 $t \times (u+v)$ 개로 줄일 수 있음
- fully connected layer $W$ 를 fully connected layer $U$ , $\sum_t V^T$ 로 분해하여 사용
- fc layer $\sum_t V^T$ 를 적용한 후, fc layer $U$ 를 적용하는 방식

5. Main results

5.1 Experimental setup

CNN Architecture로 3가지 모델 실험
- AlexNet
- VGG_CNN_M_1024
- VGG16

5.2 VOC 2010 and 2012 results

result1

5.3 VOC 2007 results

result2

5.4 Training and testing time

table4

5.4.1 Truncated SVD

fig2

5.5 Which layers to fine-tune?

table5

SPPnet에서는 conv layer는 고정시키고 fc layer에 대해서만 fine tuning 했지만 좋은 성능을 보여줌.
R-CNN과 같이 깊은 CNN Architecture에서는 conv layer도 많이 tuning되므로 fc layer만 fine tuning하는 것은 성능에 좋지 않음
하지만 모든 conv layer가 tuning이 필요한 것은 아님을 실험을 통해 알 수 있었음

6. Design evaluation

6.1 Does multi-task training help?

table6

multi-task training한 것이 더 좋은 성능을 보여줌
- 각 task의 학습이 공유된 representation에 대해서 학습도 공유하는 효과

6.2 Scale invariance: to brute force or finesse?

scale

single scale을 하는 것보다 multi scale하는 것이 더 좋은 성능을 보여줬지만, 속도와 성능 trade-off에서 single scale이 더 좋다고 판단하여 모델에서는 single scale 사용

6.3 Do we need more training data?

dataset의 크기를 늘릴수록 성능이 좋아짐

6.4 Do SVMs outperform softmax?

table8

softmax가 SVM보다 다소 좋은 성능을 보임

6.5 Are more proposals always better?

fig3

proposal이 많아질수록 성능이 떨어지는 경향을 모여줌

6.6 Preliminary MS COCO results

MS COCO dataset에 대해서도 Fast R-CNN 모델을 실험해봄

7. Conclusion

Fast R-CNN은 기존의 R-CNN과 SPPnet보다 속도가 훨씬 빠르고, 성능도 좋다.

8. Reference

Related Posts

Video Style Transfer[1] Artistic style transfer for videos(2016) - Review (Categories: deeplearning, video, style-transfer, paperreview)
FPN[1] Feature Pyramid Networks for Object Detection(2016) - Review (Categories: deeplearning, cnn, image-detection, fpn, paperreview)
R-FCN[2] R-FCN++: Towards Accurate Region Based Fully Convolutional Networks for Object Detection(2018) - Review (Categories: deeplearning, cnn, image-detection, r-fcn, paperreview)
R-FCN[1] R-FCN: Object Detection via Region based Fully Convolutional Networks(2016) - Review (Categories: deeplearning, cnn, image-detection, r-fcn, paperreview)
Light-Head R-CNN[1] Light-Head R-CNN: In Defense of Two Stage Object Detector(2017) - Review (Categories: deeplearning, cnn, image-detection, r-cnn, paperreview)
DSSD[1] DSSD: Deconvolutional Single Shot Detector(2017) - Review (Categories: deeplearning, cnn, image-detection, dssd, paperreview)

« R-CNN[1] Rich feature hierarchies for accurate object detection and semantic segmentation(2014) - Review Faster R-CNN[1] Faster R CNN: Towards Real Time Object Detection with Region Proposal Networks(2015) - Review »