3 Chainer Implementation
In this section, the structure of code will be explained. In addition, an important tip of implementing the CRF loss layer will also be given. Finally, the Chainer (version 2.0) implementation source code will be released in the next article.
3.1 Overall Structure
As you can see, the code mainly includes three parts: Initialization, Loss Computation and Prediction labels for a sentence. (The full code will be released in the next article)
3.2 Add Two Extra Labels (START and END)
As described in 2.2, in the transition score matrix we added two START and END labels. This will affect the initialization of transition score matrix and the values of emission score matrix when we compute the loss of one certain sentence.
[Example]
Let’s say in our dataset, we only have one type of named entity, PERSON and so we actually have three labels (not including START and END): B-PERSON, I-PERSON and O.
Transition Score Matrix
After adding the two extra labels(START and END), when we initialize the transition score in the init function, we do like this:
|
|
The shape of “value” is (n_label + 2, n_label + 2). 2 is the number of START and END. In addition, the values of “value” are generated randomly.
Emission Score Matrix
You should know that, the output of BiLSTM layer is the emission score for a sentence as described in 2.1. For example, our sentence have 3 words and the output of BiLSTM should be looked like this:
Word\Label | B-Person | I-Person | O |
---|---|---|---|
w0 | 0.1 | 0.2 | 0.65 |
w1 | 0.33 | 0.18 | 0.99 |
w2 | 0.87 | 0.66 | 0.53 |
After we add the extra START and END label, the emission score matrix would be:
Word\Label | B-Person | I-Person | O | START | END |
---|---|---|---|---|---|
start | -1000 | -1000 | -1000 | 0 | -1000 |
w0 | 0.1 | 0.2 | 0.65 | -1000 | -1000 |
w1 | 0.33 | 0.18 | 0.99 | -1000 | -1000 |
w2 | 0.87 | 0.66 | 0.53 | -1000 | -1000 |
end | -1000 | -1000 | -1000 | -1000 | 0 |
As shown in the table above, we expanded the emission score matrix by adding two words(start and end) and their corresponing labels(START and END). The outputs of BiLSTM layer do not include the emission scores of the new added words, but we can specify these scores manually (i.e. $x_{start,START}=0$ and $x_{start, the Other Labels}=-1000$). The emission score of the other labels on word “start” should be a small value (e.g. -1000). If you would like to set another small value, it is totally fine and it will not affect the performance of our model.
3.3 Updated Overall Structure
Based on the explanation above, here is a more detailed pseudo-code:
Next
The next article would be the final one. In the next section, the CRF layer implemented by using Chainer 2.0 will be released with detailed comments. What is more, it will also be published on GitHub.
References
[1] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C., 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
https://arxiv.org/abs/1603.01360
Note
Please note that: The Wechat Public Account is available now! If you found this article is useful and would like to found more information about this series, please subscribe to the public account by your Wechat! (2020-04-03)
When you reprint or distribute this article, please include the original link address.