DBLSTM[1] Hybrid Speech Recognition With Deep Bidirectional LSTM(2013) - Review

1. Abstract

“This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system”

2. Introduction

여러 모델을 활용한 hybrid 모델이 acoustic에서 더 좋은 성능을 보여준다.
DBLSTM-HMM의 성능을 검증하는 것이 논문의 목적

3. Network Architecture

DBLSTM

LSTM, BRNN, DNN의 구조를 합쳐 DBLSTM 형성
모델을 어떻게 구성하냐에 따라서 한 cell이 output을 다시 어느 cell에 넣어줄지가 달라짐
- 위의 모델은 forward LSTM의 결과는 forward LSTM에, backward LSTM의 결과는 backward LSTM에 넣어줌
- 실제로 논문에서 사용한 모델에서는 첫 번째 forward, backward LSTM의 결과를 두번째 BLSTM의 forward, backward LSTM에 모두 넣어줌

4. Network Training

hybrid network를 학습시키는 방식으로 진행
hybrid와 차별점
- softmax의 cross-enrotpy error를 줄이지 않고, negative log-probability 를 줄이는 방향으로 학습
- acoustic context window를 사용할 필요가 없음
Gaussian noise를 network weight에 추가하는 모델도 구성해봄
- noise was added once per training sequence, rather than at every time step
- neural network를 simplify하는 역할(전달하는 정보를 조정해 generalizaion)

5. TIMIT Experiment

result1

PER = phoneme error rate
FER = frame error rate
CE = cross-entropy error
이전 모델과 비교하여 parameter가 비교적 안정적
noise를 넣어준 것이 test set에서 더 성능이 좋았다.

6. Wall Street Journal Experiment

result2

7. Discussion

DBLSTM은 주변 정보를 더 잘 이용하기 때문에 성능이 더 좋다.
연속적인 결과를 보여준다.
어느정도 noise로 결과를 좋게 만들 수 있다.
lattice를 decoding하며 얻는 distribution을 제대로 학습할 수가 없었다는 문제점 존재
- output이 여러 cell들의 합으로 나타내지므로 제대로 분포를 학습할 수가 없음
- cross-entropy model이 language model에서 큰 영향을 줄 수 없다.
DBLSTM은 전체 sequence의 context를 접근가능
- training data의 implicit word-level language model까지 학습
- decoding 과정에서 생기는 explicit language model에 간섭 가능하다는 단점이 존재
- TIMIT과 같이 biphone language를 modeling하는 경우 interference를 어느정도 무시
- WSJ같이 큰 corpus의 경우 큰 영향을 미침

8. Conclusion

“We have applied a hybrid HMM-Deep Bidirectional LSTM system to the TIMIT and Wall Street Journal speech databases. We found that the system gave state-of-the-art results on TIMIT, and outperformed GMM and deep network benchmarks on WSJ”

DevelopHyun