1. Abstarct

pre-trained word vectors를 CNN에 사용하여 sentence classification에 적용하면 괜찮은 성능을 보여준다.

2. Introduction

pre-trained 된 word vector를 다양한 channel로 분석하는 architecture 제시
pre-trained vectors는 ‘universal’ feature extractors이므로 다양한 classification 작업에 적용 가능하다.

3. Model

model

input
- $k$ 차원의 vector $n$ 개로 이루어진 sentence = $n*k$ matrix로 표현가능
convolution
- window size를 바꿔가며 여러개의 filter를 생성
- 일반적인 CNN과 다른 것은 filter의 모양, m*m 크기의 filter를 사용하지 않음
- 단어가 k차원의 vector이므로 m*k크기의 filter를 사용, m-gram과 같은 효과
activation
- tanh, relu와 같은 non-linear 함수 사용
pooling
- max-over-time-pooling(max pooling)을 통하여 filter에서 가장 영향력이 큰 값을 추출
- 문장의 길이에 따라 pooling방식을 조절
output
- fully-connected layer로 구성하여 softmax 적용

3.1 Regulation

dropout
- output layer에 적용
- 베르누이분포로 random하게 선택된 unit만 back propagation으로 학습
- $y$ = $w( z \odot r ) + b$
- $r = [0,1,0,1, ...]$ 형태의 vector, 베르누이확률로 학습될 unit 선택
- $z$ = pooling으로 추출된 결과
l2-norm
- norm(w) > s인 경우, norm(w) = s가 되도록 보정

4. Datasets and Experimental Setup

word2vec을 사용하여 pre-trained word vector를 만듦

4.1 Model Variants

CNN-rand : 모든 단어를 random하게 initialize한다. 즉 pre-trained되지 않은 벡터 사용
CNN-static : 모든 단어를 학습한 후 벡터를 구성한다.(word2vec)
CNN-non-static : CNN-static에 사용된 vector가 fine-tunning된다.
CNN-multichannel
- CNN-static과 CNN-non-static을 동시에 적용한다.
- 각각의 vector는 channel이 되고, filter는 같이 사용된다.
- back propagation은 둘 중 하나에만 적용하여 하나는 static, 하나는 fine-tuning이 된다.

5. Results and Discussion

result1

5.1 Multi channel vs. Single Channel

Multi channel이 vector들이 weighted될 때, 원래 값으로부터 멀어지지 않게하여 overfitting을 방지할 것이라고 기대했지만, 실제 결과는 큰 차이가 없었다.

5.2 Static vs Non-static

word2vec은 문맥적인 것을 학습하기 때문에 ‘good’과 ‘bad’가 비슷한 공간에 위치하게 된다. 하지만 Non-static 방식을 쓰면 긍정/부정도 포함되어 vector가 학습되는 것을 알 수 있다. 즉, fine-tuning이 vector를 더 meaningful하게 만들어준다.

6. Conclusion

supervised pre-training of word vectors is an important ingredient in deep learning for NLP.

7. Reference

https://arxiv.org/abs/1408.5882
Ybigta deepNLP-study

DevelopHyun

wordCNN[1] Convolutional Neural Networks for Sentence Classification(2014) - Review