# CRF Layer on the Top of BiLSTM - 8

#### 3.4 Demo

In this section, we will make two fake sentences which only have 2 words and 1 word respectively. Moreover, we will also randomly generate their true answers. Finally, we will show how to train the CRF Layer by using Chainer v2.0. All the codes including the CRF layer are avaialbe from GitHub.

Firstly, we import our own CRF Layer implmentation, ‘MyCRFLayer’.

We say that in our dataset we only have 2 labels (e.g. B-Person, O)

The following code block is generating 2 sentences, xs = [x1, x2]. Sentence x1 has two words and x2 has only one word.

It should be noticed that, the elements of x1 and x2 are not the word embeddings instead of the emission scores from the BiLSTM layer which is not implemented here.

For example,in sentence x1 we have two words $w_0$ and $w_1$ and x1 is a matrix with shape of (2,2). The first “2” means it has two words and the second “2” denotes that in our dataset we have two labels as shown in the table below.

x1 B-Person O
w0 0.1 0.65
w1 0.33 0.99

Next we should have the true labels for these two sentences.

Here are the ground truth generated randomly.

Ground Truth:
sentence 0: [0 0]
sentence 1: [1]


Although we do not really have our BiLSTM layer, it will not affect us to show how to train a model in chainer. We simulated the outputs of BiLSTM layer and the true answers. Therefore, we can use some optimizers to optimize our CRF layer.

In this article, we used the Stochastic Gradient Descent method to train our model. (If now you are not familar with training methods, you can learn it in future.) This optimizer will update the parameters (i.e. the transition matrix) in our CRF Layer according to the loss between predicted labels and ground truth labels.

The CRF layer is initialized by the number of labels (NOT including the extra added START and END).

Then we can start to train our CRF Layer.

As shown in the outputs of our code, the loss is decreasing and the CRF layer is learning (the predictions are becoming correct).

Predictions:
Epoch 0: (loss=3.06651592255)
sentence 0: [1 1]
sentence 1: [1]
Epoch 50: (loss=1.96822023392)
sentence 0: [1 1]
sentence 1: [1]
Epoch 100: (loss=1.51349794865)
sentence 0: [0 0]
sentence 1: [1]
Epoch 150: (loss=1.27118945122)
sentence 0: [0 0]
sentence 1: [1]
Epoch 200: (loss=1.09977662563)
sentence 0: [0 0]
sentence 1: [1]


#### 3.5 GitHub

The demo and CRF layer code are available on GitHub. You will see, the code might not be perfect. Because for easy understanding, some implementations are very naive. I believe it could be optimized to a more efficient algorithm.

To end this series, I must say thank you to everyone who are reading these articles. Moreover, hope you enjoy the explanation of CRF layer.

Probably it is too early to say, anyway, HAPPY CHRISTMAS and HAPPY NEW YEAR :)!

## References

[1] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C., 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
https://arxiv.org/abs/1603.01360