1. Abstract

per-region subnetwork는 많은 시간을 소모하므로, R-FCN에서는 모든 region이 연산을 공유
- 연산 공유를 통해 발생하는 object detection의 translation variance(object의 위치가 바뀌면 detection의 결과도 바뀌어야 함)를 해결하기 위하여 position-sensitive score map 적용

2. Introduction

fig1

image classification
- translation invariance
- object의 label만 예측하면 되므로 object의 위치가 바뀌어도 크게 영향을 받지 않기 때문에 FCN을 통하여 모든 연산을 공유해도 큰 영향을 받지 않음
image detection
- translation variance
- object의 label과 position을 예측해야하며, object의 위치가 바뀌면 결과도 바뀌어야 함
- 하지만, FCN은 translation invariance한 특성을 가지고 있기 때문에 detection에 적합하지 않음
- 이를 위하여 per-region computation을 하면, 많은 시간과 연산을 필요로 하기 때문에 비효율적
R-FCN(Region-based Fully Convolutional Network)
- shared fully convolutional architecture
- position-sensitive score map을 통해 FCN에 translation variance한 성질을 통합
- position-sensitive score map은 position 정보를 encoding
- position-sensitive RoI pooling을 통하여 position-sensitive score map의 position정보를 예측에 사용
- detection을 위한 추가적인 fc layer 혹은 conv layer는 없음

3. Our approach

fig2

3.1 Overview

two-stage detector
not per-region computation

3.1.1 R-FCN

feature extraction by convolution
region proposal using feature map of 1 by RPN
produce position-sensitive score map using feature map of 1
- 1x1 convolution
- 각각의 $(C+1)$ 개의 class 마다 $k^2$ 개의 score map 생성
- 즉, $k^2 (C+1)$ channel 생성
- $k$ = spatial grid
- $(C+1)$ = # of class + background
- 예를 들어, $k$ =3이라면 $k^2$ 개의 score map은 각각 top-left, top-center, top-right, mid-left 등 각 grid의 score map을 의미
position-sensitive RoI pooling
- 각각의 RoI마다 RoI pooling을 하여 $k \times k times (C+1)$ tensor 생성
- 각각의 $k \times k$ 개의 RoI grid는 각 위치에 해당하는 $k^2$ 개의 score map를 각각 하나씩 pooling
- 예를 들어, $k$ =3이라면, top-left의 score map은 top-left의 grid에서만 pooling이 되고, top-center의 score map은 top-centor의 grid에서만 pooling이 됨

3.2 Backbone architecture

ResNet-101 사용
- global average pooling과 fc layer는 제거

3.3 Position-sensitive score maps & RoI pooling

Position-sensitive score map을 통하여 $k^2$ 개의 score map 생성(각 score map은 $(C+1)$ 개의 class score를 포함)
Position-sensitive RoI pooling
- $r_c (i,j \mid \theta)$ = $\sum_{(x,y) \in bin(i,j)} z_{i,j,c} (x+x_0, y+y_0 \mid \theta)/n$
- $z_{i,j,c}$ = grid에 해당하는 score map
- $(i,j)$ = grid(bin)의 위치 $bin(i,j)$ = grid(bin)에 속해있는 pixel
- $n$ = grid(bin)에 속해있는 pixel의 수
- $(x_0, y_0)$ = top-left corner of RoI
- $\theta$ = feature map
- $c$ = $c$ -th category
- 즉, 각 grid마다 담당하는 특정 score map에서 각 class score를 average pooling
Voting
- $r_c(\theta)$ = $\sum_{i,j}r_c (i,j \mid \theta)$
- 각 RoI의 class를 결정하기 위하여 각 grid의 class score를 class별로 집계
- 최종적으로 $C+1$ 개의 tensor를 생성
- softamx와 cross-entropy로 loss 계산
Bounding Box regression
- position-sensitive score map의 sibling network를 만들어 최종적으로 $4$$개의 tensor를 생성하는 방식
- position-sensitive score map에서 $4k^2 (C+1)$ channel을 생성하는 방식
- 전자의 경우는 simple하지만, 성능은 후자가 더 좋음

3.4 Training

Loss for each RoI
- $L(s,t_{x,y,w,h})$ = $L_{cls} (s_{c^{\ast}}) + \lambda[c^{\ast} > 0 ] L_{reg}(t, t^{\ast})$
- $L_{cls} (s_{c^{\ast}})$ = $-log(s_{c^{\ast}})$
- $L_{reg}(t, t^{\ast})$ = $smooth_{L1}$
- $\lambda[c^{\ast} > 0 ]$ = positive/negative indicator = positive이면 1, 아니면 0
IoU threshold = 0.5
OHEM
- 모든 RoI에 대해서 loss를 계산한 후, loss가 큰 $B$ 개의 RoI에 대해서만 학습

3.5 Inference

NMS
- IoU threshold=0.3

3.6 À trous and stride

stride를 줄여 feature map의 해상도를 높임
à trous trick 사용

3.7 Visualization

fig4

각각의 grid마다 사용되는 score map이 다른데, object랑 겹치게 되면 대부분의 grid가 높은 score를 기록

R-CNN 계열
SPPnet

5. Experiments

5.1 Experiments on PASCAL VOC

VOC 2007과 VOC 2012을 합쳐서 사용

5.1.1 Comparisons with Other Fully Convolutional Strategies

table2

5.1.2 Comparisons with Faster R-CNN Using ResNet-101

table3-4-5

5.1.3 On the Impact of Depth

table

5.1.4 On the Impact of Region Proposals

table

5.2 Experiments on MS COCO

table6

6. Conclusion and Future Work

image detection에서 FCN에 translation variance한 특성을 더해주는 것이 속도와 성능면에서 좋다.

DevelopHyun

R-FCN[1] R-FCN: Object Detection via Region based Fully Convolutional Networks(2016) - Review

1. Abstract

2. Introduction

3. Our approach

3.1 Overview

3.1.1 R-FCN

3.2 Backbone architecture

3.3 Position-sensitive score maps & RoI pooling

3.4 Training

3.5 Inference

3.6 À trous and stride

3.7 Visualization

5. Experiments

5.1 Experiments on PASCAL VOC

5.1.1 Comparisons with Other Fully Convolutional Strategies

5.1.2 Comparisons with Faster R-CNN Using ResNet-101

5.1.3 On the Impact of Depth

5.1.4 On the Impact of Region Proposals

5.2 Experiments on MS COCO

6. Conclusion and Future Work

7. Reference

Related Posts

DevelopHyun

1. Abstract

2. Introduction

3. Our approach

3.1 Overview

3.1.1 R-FCN

3.2 Backbone architecture

3.3 Position-sensitive score maps & RoI pooling

3.4 Training

3.5 Inference

3.6 À trous and stride

3.7 Visualization

4. Related Work

5. Experiments

5.1 Experiments on PASCAL VOC

5.1.1 Comparisons with Other Fully Convolutional Strategies

5.1.2 Comparisons with Faster R-CNN Using ResNet-101

5.1.3 On the Impact of Depth

5.1.4 On the Impact of Region Proposals

5.2 Experiments on MS COCO

6. Conclusion and Future Work

7. Reference

Related Posts