1. Abstract

기존 RNN Architecture는 previous time step에 대한 의존성 때문에 병렬처리가 불가능하기 때문에 속도가 느렸다. SRU는 연산을 recurrence에 독립적이게 만들어 병렬처리를 통한 학습시간 단축을 가능하게 한다. 연산 속도는 CNN과 비슷하며, LSTM보다는 5~10배정도 빠르다.

2. Introduction

intro

RNN은 sequential symbol들을 처리할 때 previous time step에 의존한다. 따라서 병렬처리가 힘들고 다른 NN에 비하여 속도가 많이 느리다. SRU는 연산량이 큰 작업을 previous time step의 영향을 받지않도록 재구성하여 병렬처리를 가능하게 했다. 또한, 많은 layer를 쌓는 것에 유리하며, layer가 많아질수록 성능이 좋아진다.

3. Method

3.1 SRU(Simple Recurrent Unit)

sru

matrix multiplication은 무거운 연산, point-wise multiplication은 상대적으로 가벼운 연산
- matrix multiplication은 previous time step에 독립적으로 연산해 병렬처리가 가능하도록 재구성
- point-wise multiplication만 previous time step에 의존적이게 하여 sequential data를 처리
Architecture
- $\widetilde{x_{t}}$ = $Wx_{t}$
- $f_{t}$ = $\sigma (W_{f}x + b_{f})$
- $r_{t}$ = $\sigma (W_{r}x + b_{r})$
- $c_{t}$ = $f_{t} \odot c_{t-1} + (1-f_{t}) \odot \widetilde{x_{t}}$
- $h_{t}$ = $r_{t} \odot g(c_{t}) + (1-r_{t}) \odot x_{t}$
- $x_{t}$ = input
- $f_{t}, r_{t}$ = forget gate, reset gate
- $c_{t}, h_{t}$ = internal state, output state

3.2 Relation to Common Architecture

LSTM에서는 gate 를 계산할 때 previous hidden state을 사용
- gate = $\sigma (W_{1}x + W_{2}h_{t-1} + b)$ 형태
- previous time step의 연산이 완전히 끝날 때까지 기다려야하므로 시간이 오래 걸림
SRU에서는 gate 연산에서 previous hidden state을 사용하지 않고, state 연산에서만 사용
- gate = $\sigma (Wx + b)$ 형태
- gate 연산이 previous time step에 독립적으로 처리되므로 병렬처리가 가능하고, 시간이 단축됨
- state 연산은 previous time step에 의존적이지만, 상대적으로 가벼운 연산이므로 오래 걸리지 않음

3.3 CUDA-level Optimization

matrix 연산
- $U^{T}$ = $\begin{bmatrix}W\\ W_{f}\\ W_{r}\end{bmatrix} [x_{1}, \cdots , x_{n}]$
- 모든 matrix연산을 묶어서 진행
element-wise 연산
- 모든 연산을 하나의 kernel 함수로 처리

RCNN
QRNN
KNN

5. Experiments

5.1 Classification

result1

5.2 Question Answering

result2

5.3 Language Modeling

result3

5.4 Machine Translation

result4

층을 많이 쌓을 수 있으며, 많이 쌓을수록 성능이 좋다.

5.5 Speech Recognition

result5

6. Appendix

6.1 Comparison of Model Variants and QRNN(Quasi-RNN)

qrnn

QRNN보다 최소 성능이 비슷하거나 조금 더 좋다.

7. Conclusion

SRU는 LSTM보다 빠르고 성능이 좋으며, layer를 쌓기도 유리하다.

8. Reference

https://arxiv.org/abs/1709.02755
Ybigta deepNLP-study

DevelopHyun

SRU[1] Training RNNs as Fast as CNNs(2017) - Review

1. Abstract

2. Introduction

3. Method

3.1 SRU(Simple Recurrent Unit)

3.2 Relation to Common Architecture

3.3 CUDA-level Optimization

5. Experiments

5.1 Classification

5.2 Question Answering

5.3 Language Modeling

5.4 Machine Translation

5.5 Speech Recognition

6. Appendix

6.1 Comparison of Model Variants and QRNN(Quasi-RNN)

7. Conclusion

8. Reference

Related Posts

DevelopHyun

1. Abstract

2. Introduction

3. Method

3.1 SRU(Simple Recurrent Unit)

3.2 Relation to Common Architecture

3.3 CUDA-level Optimization

4. Related Work

5. Experiments

5.1 Classification

5.2 Question Answering

5.3 Language Modeling

5.4 Machine Translation

5.5 Speech Recognition

6. Appendix

6.1 Comparison of Model Variants and QRNN(Quasi-RNN)

7. Conclusion

8. Reference

Related Posts