1. Abstract

R-CNN = Regions with CNN
image detection 분야에 CNN을 적용
CNN을 통해 object의 위치를 찾아내고 분리하기 위하여 bottom-up region proposal 방식 사용
데이터가 부족하면 pre-train한 후에 fine-tuning
당시 성능이 제일 좋던 OverFeat 모델보다 좋은 성능

2. Introduction

r-cnn

기존에 사용되던 SIFT와 HOG 방식은 blockwise orientation histogram으로, 인간 뇌의 시각경로 중 첫 번째 피질영역의 역할과 연관
하지만, 인간이 물체를 인식하는 것은 계층적 프로세스로 진행되기 때문에 추가적인 정보가 필요
image classification과 image detection의 큰 차이 중 하나는 Object의 위치를 찾아내는 것
- ‘recognition using regions’ paradigm으로 해결
- input image에 2000개 정도의 region을 추천하고, 각 추천된 region에 대하여 CNN filter로 feature를 추출하여 SVM으로 분류하는 방식
bounding-box regression 방식으로 mis-localization된 경계선을 학습

3. Object detection with R-CNN

r-cnn

R-CNN은 3개의 모듈로 구성
- category-independent region proposals을 만들어내는 모듈
- 제안된 각 region에서 feature를 추출하는 CNN 모듈
- 추출된 feature를 기반으로 label을 classify하는 linear SVM

3.1 Module design

3.1.1 Region Proposals

Selective Search
- image의 color와 texture뿐만 아니라, 계층적구조(겹쳐있는 상태 등)도 같이 활용
- 크기에 상관없이 대상을 찾아내야 함
- 빠른 속도
Selective Search의 과정
- sub-segmentation
- integration(통합)
- 통합된 영역을 바탕으로 candidate region생성

3.1.2 Feature Extraction

제안된 region을 227x227x3 size로 확대한 후 CNN을 적용
CNN의 결과에 FC를 적용하여 4096-dimension의 fixed feature vector를 추출

3.2 Test-time detection

Selective Search로 2000개의 region proposals 추출
각 region에 CNN과 FC를 적용하여 feature 추출
각 feature에 SVM을 적용하여 score 계산
각 class에 대해서 greedy non-maximum suppression(NMS)을 진행
- 겹쳐있는 bounding box를 하나로 통합하는 과정
- 현재 픽셀의 score를 중심으로 주변 픽셀의 score를 비교하여 현재 픽셀 값이 최대값일 경우는 그대로 두고, 아닐 경우에는 값을 0으로 바꾸어 값을 제거하는 것
- 따라서, 비슷한 곳에 위치한 여러 개의 bounding-box의 선들이 션명한 하나의 선으로 합쳐지게 됨
- IoU의 값이 일정 threshold 이상인 것들에 대해서 수행하기 때문에 동일한 Object를 detection한 bounding box만 합쳐짐
- IoU = $\frac{AreaOfOverlap}{AreaOfUnion}$ = bounding box가 겹치는 비율

3.2.1 Run-time Analysis

CNN filter의 params은 모든 category에 동일하게 공유되므로 효율적
CNN의 결과로 나온 4096개의 feature는 다른 모델의 feature에 비해서 적으므로 효율적
class-specific computation은 SVM, NMS 밖에 없으므로 효율적
따라서, 한 image에 100,000개의 Object가 존재하더라도 연산시간이 10초밖에 걸리지 않음

3.3 Training

3.3.1 Supervised Pre-training

ILSVRC2012 dataset으로 pre-training

3.3.2 Domain-Spcific fine-tuning

CNN에 warped proposal windows를 학습시키기 위한 작업

3.3.3 Object Category Classifiers

IoU값이 threshold 이상인 것들은 NMS 적용
각 class에 하나의 linear SVM 학습

3.4 Results on PASCAL VOC 2010-12

result1

3.5 Results on ILSVRC2013 detection

result2

4. Visualization, ablation, and modes of error

4.1 Visualizing learned features

Network가 image의 어느 부분을 학습했는지 알기위한 시도
각각의 image마다 특정 layer의 결과로 나오는 proposal들에 NMS를 적용한 후, 모든 image의 proposal들을 정렬하여 높은 socre의 region만 확인하는 방법 사용

4.2 Ablation studies

table2

4.2.1 Performance layer-by-layer, without fine-tuning

모델의 마지막 부분인 poo5, fc6, fc7 layer들을 사용한 분석을 진행
fc6, fc7 layer를 제거하고 pool5 layer만 사용해도 좋은 결과를 보임
- CNN이 중요한 역할을 하고있음을 알 수 있음

4.2.2 Performance layer-by-layer, with fine-tuning

fine-tuning을 하는 것이 훨씬 좋은 성능을 보임

4.2.3 Comparison to recent feature learning methods

최근의 DPM 모델보다 훨씬 좋은 성능을 보여줌

4.3 Network architectures

table3

CNN의 Architecture가 R-CNN의 성능에 큰 영향

4.4 Detection error analysis

4.5 Bounding-box regression

bounding-box regression을 사용하여 localization error를 줄이는 것이 없는 것보다 더 좋은 성능을 보여줌
=
- $P$ = $(P_x, P_y, P_w, P_h)$ = proposal boundong box
- $G$ = $(G_x, G_y, G_w, G_h)$ = ground-truth boundong box
- $\phi_{5}(P)$ = proposal P의 pool5 layer output
- $d_{\ast}(P)$ = $w_{\ast}^T \phi_{5}(P)$
- $t_x$ = $(G_x - P_x) / P_w$
- $t_y$ = $(G_y - P_y) / P_h$
- $t_w$ = $log(G_w / P_w)$
- $t_h$ = $log(G_h / P_h)$
- $\ast$ = { $x$ , $y$ , $w$ , $h$ }
regression을 통하여 얻은 을 사용하여 bounding box = 를 G와 비슷하게 만드는 것이 목표
- $\hat{G_x}$ = $P_w d_x (P) + P_x$
- $\hat{G_y}$ = $P_h d_y (P) + P_y$
- $\hat{G_w}$ = $P_w exp(d_w (P))$
- $\hat{G_h}$ = $P_h exp(d_h (P))$

4.6 Qualitative results

result

5. The ILSVRC2013 detection dataset

5.1 Dataset overview

ILSVRC2013 dataset 사용

5.2 Region proposals

위의 PASCAL dataset과 같은 Selective Search 사용

5.3 Training data

dataset을 사용하여 ground-truth bounding box dataset을 생성

5.4 Validation and evaluation

다양한 조건들을 바꿔가며 실험

5.5 Ablation study

table4

5.6 Relationship to OverFeat

OverFeat는 R-CNN의 special case
- R-CNN의 selective search와 per-class bounding을 바꾸면 OverFeat와 비슷한 모델이 됨
- 학습속도는 OverFeat이 더 빠름

6. Semantic segmentation

6.1 CNN features for segmentation

full R-CNN
- 기존의 detection과 같이 region의 크기를 무시하고 wrapped된 image에 대하여 CNN 적용
fg R-CNN
- foreground mask에 대해서만 CNN을 적용
full+fg R-CNN
- full model과 fg model을 concatenate

DevelopHyun

R-CNN[1] Rich feature hierarchies for accurate object detection and semantic segmentation(2014) - Review

1. Abstract

2. Introduction

3. Object detection with R-CNN

3.1 Module design

3.1.1 Region Proposals

3.1.2 Feature Extraction

3.2 Test-time detection

3.2.1 Run-time Analysis

3.3 Training

3.3.1 Supervised Pre-training

3.3.2 Domain-Spcific fine-tuning

3.3.3 Object Category Classifiers

3.4 Results on PASCAL VOC 2010-12

3.5 Results on ILSVRC2013 detection

4. Visualization, ablation, and modes of error

4.1 Visualizing learned features

4.2 Ablation studies

4.2.1 Performance layer-by-layer, without fine-tuning

4.2.2 Performance layer-by-layer, with fine-tuning

4.2.3 Comparison to recent feature learning methods

4.3 Network architectures

4.4 Detection error analysis

4.5 Bounding-box regression

4.6 Qualitative results

5. The ILSVRC2013 detection dataset

5.1 Dataset overview

5.2 Region proposals

5.3 Training data

5.4 Validation and evaluation

5.5 Ablation study

5.6 Relationship to OverFeat

6. Semantic segmentation

6.1 CNN features for segmentation

6.2 Results on VOC 2011

7. Conclusion

8. Reference

Related Posts