1. Abstract

1-D CNN
- 병렬처리가 가능해짐
Encoder위에 Decoder가 올려져있는 형태
- Encoder의 각 time step의 hidden layer가 바로 Decoder에 반영됨
- 따라서, Encoder에서 fixed sized representation으로 만들지 않아도 됨
- 즉, input sequence의 길이에 따라서 representation의 길이가 달라지므로 resolution을 보존할 수 있음
- input/target 사이의 거리가 짧아지므로 정보의 손실도 적으며, propagation에 유리
Dynamic Unfolding을 사용하여 input과 target의 길이에 따라 representation size을 조절
Dilation을 사용하여 receptive field를 넓힘
linear time내에 처리되며, memorization이 필요 없음

2. Introduction

이전 NLP모델들은 input sequence가 길어지면 시간이 super linear하게 증가한다는 단점이 있다.
Mechanism
- Encoder위에 Decoder를 쌓는 방식
- Dynamic Unfolding
- 1-D convolution
- Dilation
- masked CNN
연산들이 병렬적으로 처리됨
모든 input token 과 output token의 distance가 fixed depth이므로 token들이 거리가 짧으며 독립적

3. Neural Translation Model

distribution $p(t \mid s)$ = $\prod_{i=0}^{N} p(t_{i} \mid t_{1}, \cdots , t_{i-1}, s)$ 을 예측
distribution은 $s$ 가 주어졌을 때, $t$ 로 번역할 확률을 의미
$p(t_{i} \mid t_{1}, \cdots , t_{i-1}, s)$ 는 long-term dependency를 포함

3.1 Desiderata(필요조건)

연산시간이 input/target sequence에 대하여 linear해야함
- 병렬처리로 시간을 단축할 수 있다.
source representation의 size가 source sequence에 linear해야함
- 즉, source sequence가 길어지면, source representation도 길어져 정보를 보존할 수 있어야함
- resolution preserving
input과 target의 distance가 짧아야함
- forward/backward signal이 전달되는 경로가 짧아야함
- propagation signal이 더 전달이 잘되므로 학습에 유리
- 정보손실이 적어지므로 long-term dependency 학습에 유리

4. ByteNet

4.1 Encoder-Decoder Stacking

model

Encoder 위에 바로 Decoder가 있는 형태
source sequence가 fixed size representation으로 표현되지 않음
- 각 input sequence를 구성하는 각 element의 encoding결과가 바로 Decoder로 반영되어 target을 출력
- 이전의 Encoder-Decoder에서는 input sequence를 모두 입력받은 후, fixed size vector로 표현되어 Decoder에 반영
Encoder
- 1-D convolutional layer, dilation
- ex) 위의 그림에서는 $1 \times 3$ filter 적용, 1-st layer이후부터는 dilation 적용
- ex) 1st-layer의 경우, $s_{1,i}$ = $s_{i-1} \ast w_{1,1} + s_{i} \ast w_{1,2} + s_{i+1} \ast w_{1,3}$ 의 형식
- ex) 2nd-layer의 경우, $s_{2,i}$ = $s_{1,i-2} \ast w_{2,1} + s_{1,i} \ast w_{2,2} + s_{1,i+2} \ast w_{2,3}$ 의 형식
- ex) 3rd-layer의 경우, $s_{3,i}$ = $s_{2,i-4} \ast w_{3,1} + s_{2,i} \ast w_{3,2} + s_{1,i+4} \ast w_{3,3}$ 의 형식
- ex) $s_{l,i}$ = l-layer에서의 i번째 token, $w_{l,j}$ = l-layer에서의 j번째 weight
Decoder
- masked 1-D convolution layer, dilation
- Encoder와는 다르게 convolution을 할때에는 뒤의 토큰을 반영하지 않음
- 즉, $i$ -th token을 convolution할 때는 $i$ -th token 이전의 element들로만 convolution

4.2 Dynamic Unfolding

dynamic

Desiderata
- 대부분의 경우 target sequence size $\mid t \mid$ 보다 길어야함
- computation이 너무 증가해서는 안됨
- resolution preserving
Dynamic Unfolding
- $\widehat{t}$ = $a \mid s \mid + b$
- source length $\mid s \mid$ 에 linear하게 representation의 길이 $\widehat{t}$ 를 설정
- input sequence의 길이가 달라짐에 따라, source representation의 길이도 조절
- 즉, 정보량에 비례하여 representation 길이를 설정하게 해주기 때문에, 다양한 길이에 대처가 가능
- Decoder에서는 EOS가 나올 때 까지 generating
- 만약 $t$ 가 $\widehat{t}$ 보다 길이가 길 경우, 그 부분과 상응하는 representation은 zero vector로 생성되고, 이전 target의 정보만 사용하여 generating
- 논문에서는 English보다 German이 평균적으로 1.2배 길기 때문에 $a$ =1.2, $b$ =0 사용

4.3 Input Embedding Tensor

Autoregressive하므로 이전 학습의 token을 다음 학습의 input으로 넣어줌
target sequence $t$ = $t_{0}, \cdots, t_{n}$ 개 중에서 $t_{0}, \cdots, t_{n-1}$ 을 look-up table을 통해서 embedding
$t_{1}, \cdots, t_{n}$ 이 prediction의 target
Encoder의 결과와 이전 학습의 target token의 embedding 결과를 concatenate하여 $n \times 2d$ tensor로 만든다.

4.4 Masked One-dimensional Convolutions

decoding과정에서 미래의 정보는 가리고 convolution을 하는 것
- 이미 generating된 것이 무엇인지 파악하고 다음 token을 만들어야하므로 Masked-CNN 사용

4.5 Dilation

dilation

바로 인접한 input에 filter를 적용하는 것이 아니라, 일정한 간격으로 건너뛰며 filter를 적용하는 것
receptive field가 증가, 즉, 데이터를 넓게 반영할 수 있음

4.6 Residual Blocks

residual

1x1 convolution 사용
batch normalization 사용
용도에 따라 구성을 변형
- Machine Translation에는 ReLU 사용
- Language Modeling에는 Multiplicative Unit 사용

5. Model Comparison

5.1 Recurrent ByteNets

recurrent

CNN 대신 RNN을 사용하여 Encoder-Decoder를 구성가능
- Decoder만 RNN으로 바꾸거나, Encoder와 Decoder모두 RNN구조로 바꿀 수 있음

5.2 Comparison of Properties

result1

$Net_{S}, Net_{T}$ = Encoder, Decoder
$RP$ = resolution preserving
$Path_{S}, Path_{T}$ = Encoder, Decoder에서 input token과 output token의 최단거리

6. Character Prediction

result2

7. Character-Level Machine Translation

result3

7.1 Sample

result5

8. Conclusion

byteNet은 linear한 시간내에 연산을 할 수 있으며, memorization이 필요없게 만들었으며, sequence내 token들의 signal propagation path를 짧게 만들었다.

DevelopHyun

byteNet[1] Neural Machine Translation in Linear Time(2017) - Review

1. Abstract

2. Introduction

3. Neural Translation Model

3.1 Desiderata(필요조건)

4. ByteNet

4.1 Encoder-Decoder Stacking

4.2 Dynamic Unfolding

4.3 Input Embedding Tensor

4.4 Masked One-dimensional Convolutions

4.5 Dilation

4.6 Residual Blocks

5. Model Comparison

5.1 Recurrent ByteNets

5.2 Comparison of Properties

6. Character Prediction

7. Character-Level Machine Translation

7.1 Sample

8. Conclusion

9. Reference

Related Posts