CRF Layer on the Top of BiLSTM - 7

3 Chainer Implementation

In this section, the structure of code will be explained. In addition, an important tip of implementing the CRF loss layer will also be given. Finally, the Chainer (version 2.0) implementation source code will be released in the next article.

3.1 Overall Structure

As you can see, the code mainly includes three parts: Initialization, Loss Computation and Prediction labels for a sentence. (The full code will be released in the next article)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class My_CRF():
def __init__():
#[Initialization]
'''
Randomly initialize transition scores
'''
def __call__(training_data_set):
#[Loss Function]
Total Cost = 0.0
#Compute CRF Loss
'''
for sentence in training_data_set:
1) The real path score of current sentence according the true labels
2) The log total score of all the possbile paths of current sentence
3) Compute the cost on this sentence using the results from 1) and 2)
4) Total Cost += Cost of this sentence
'''
return Total Cost
def argmax(new sentences):
#[Prediction]
'''
Predict labels for new sentences
'''

3.2 Add Two Extra Labels (START and END)

As described in 2.2, in the transition score matrix we added two START and END labels. This will affect the initialization of transition score matrix and the values of emission score matrix when we compute the loss of one certain sentence.

[Example]
Let’s say in our dataset, we only have one type of named entity, PERSON and so we actually have three labels (not including START and END): B-PERSON, I-PERSON and O.

Transition Score Matrix
After adding the two extra labels(START and END), when we initialize the transition score in the init function, we do like this:

1
2
n_label = 3 #B-PERSON, I-PERSON AND O
transitions = np.array(value, dtype=np.float32)

The shape of “value” is (n_label + 2, n_label + 2). 2 is the number of START and END. In addition, the values of “value” are generated randomly.

Emission Score Matrix
You should know that, the output of BiLSTM layer is the emission score for a sentence as described in 2.1. For example, our sentence have 3 words and the output of BiLSTM should be looked like this:

Word\Label B-Person I-Person O
w0 0.1 0.2 0.65
w1 0.33 0.18 0.99
w2 0.87 0.66 0.53

After we add the extra START and END label, the emission score matrix would be:

Word\Label B-Person I-Person O START END
start -1000 -1000 -1000 0 -1000
w0 0.1 0.2 0.65 -1000 -1000
w1 0.33 0.18 0.99 -1000 -1000
w2 0.87 0.66 0.53 -1000 -1000
end -1000 -1000 -1000 -1000 0

As shown in the table above, we expanded the emission score matrix by adding two words(start and end) and their corresponing labels(START and END). The outputs of BiLSTM layer do not include the emission scores of the new added words, but we can specify these scores manually (i.e. $x_{start,START}=0$ and $x_{start, the Other Labels}=-1000$). The emission score of the other labels on word “start” should be a small value (e.g. -1000). If you would like to set another small value, it is totally fine and it will not affect the performance of our model.

3.3 Updated Overall Structure

Based on the explanation above, here is a more detailed pseudo-code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class My_CRF():
def __init__(n_label):
#[Initialization]
'''
1) Randomly initialize transition score matrix.
The shape of this matrix is (n_label+2, n_label+2).
n_labels is the number of named entity classes in our dataset (e.g. B-Person, I-Person, O).
2 is the number of our added labels (i.e. START and END).
2) Moreover, we also set the small value as -1000 here.
'''
def __call__(training_data_set):
#[Loss Function]
Total Cost = 0.0
#Compute CRF Loss
for sentence in training_data_set:
'''
1) Extend the emission score matrix by adding words(start and end)
and adding labels(START and END)
2) Compute the real path score of current sentence according the
true labels (section 2.4)
3) Compute the log total score of all the possbile paths of current
sentence (section 2.5)
4) Compute the cost on this sentence using the results from 2)
and 3) that is -(real_path_score - all_path_score). (section 2.5)
5) Total Cost += the cost of current sentence
'''
return Total Cost
def argmax(new sentences):
#[Prediction]
for sentence in new_sentences:
'''
1) Extend the emission score matrix by adding words(start and end)
and adding labels(START and END)
2) Predict the labels for the current new sentence (section 2.6)
'''

Next

The next article would be the final one. In the next section, the CRF layer implemented by using Chainer 2.0 will be released with detailed comments. What is more, it will also be published on GitHub.

References

[1] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C., 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
https://arxiv.org/abs/1603.01360

Note

Please note that: The Wechat Public Account is available now! If you found this article is useful and would like to found more information about this series, please subscribe to the public account by your Wechat! (2020-04-03)
QR Code

When you reprint or distribute this article, please include the original link address.