CRF Layer on the Top of BiLSTM - 5

2.5 The total score of all the paths

In the last section, we learned how to calculate the label path score of one path that is $e^{S_i}$. So far, we have one more problem which is needed to be solved, how to obtain the total score of all the paths ($ P_{total} = P_1 + P_2 + … + P_N = e^{S_1} + e^{S_2} + … + e^{S_N} $).

The simplest way to measure the total score is that: enumerating all the possible paths and sum their scores. Yes, you can calculate the total score in that way. However, it is very inefficient. The training time will be unbearable.

Read More

CRF Layer on the Top of BiLSTM - 4

2.4 Real path score

In section 2.3, we supposed that every possible path has a score $ P_{i} $ and there are totally $ N $ possible paths, the total score of all the paths is $ P_{total} = P_1 + P_2 + … + P_N = e^{S_1} + e^{S_2} + … + e^{S_N} $, $ e $ is the mathematical constant $ e $.

Obviously, there must be a path is the real one among all the possible paths. For exmaple, the real path of the sentence in section 1.2 is “START B-Person I-Person O B-Organization O END”. The others are incorrect such as “START B-Person B-Organization O I-Person I-Person B-Person”. $ e^{S_i} $ is the score of $ i^{th} $ path.

During the training process, the crf loss function only need two scores: the score of the real path and the total score of all the possbile paths. The proportion of the real path score among the scores of all the possible paths will be increased gradually.

The calculation of a real path score, $e^{S_i}$, is very straightforward.

Read More

CRF Layer on the Top of BiLSTM - 3

2.3 CRF loss function

The CRF loss function is consist of the real path score and the total score of all the possible paths. The real path should have the highest score among those of all the possible paths.

For example, if we have these labels in our dataset as shown in the table:

Label Index
B-Person 0
I-Person 1
B-Organization 2
I-Organization 3
O 4
START 5
END 6

We also have a sentence which has 5 words. The possible paths could be:

  • 1) START B-Person B-Person B-Person B-Person B-Person END
  • 2) START B-Person I-Person B-Person B-Person B-Person END
  • 10) START B-Person I-Person O B-Organization O END
  • N) O O O O O O O

Read More

CRF Layer on the Top of BiLSTM - 2

Review

In the previous section, we know that the CRF layer can learn some constraints from the training dataset to ensure the final predicted entity label sequences are valid.

The constrains could be:

  • The label of the first word in a sentence should start with “B-“ or “O”, not “I-“
  • “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid.
  • “O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label”

After you read this article, you will know why the CRF layer can learn those constrains.

Read More

CRF Layer on the Top of BiLSTM - 1

Outline

The article series will include:

  • Introduction - the general idea of the CRF layer on the top of BiLSTM for named entity recognition tasks
  • A Detailed Example - a toy example to explain how CRF layer works step-by-step
  • Chainer Implementation - a chainer implementation of the CRF Layer

Who could be the readers of this article series?
This article series is for students or someone else who is the beginner of natural language processing or any other AI related areas, I hope you can find what you do want to know from my articles. Moreover, please be free to provide any comments or suggestions to improve the series.

Prior Knowledge

The only thing you need to know is what is Named Entity Recognition. If you do not know neural networks, CRF or any other related knowledge, please DO NOT worry about that. I will explain everything as intuitive as possible.

1. Introduction

For a named entity recognition task, neural network based methods are very popular and common. For example, this paper[1] proposed a BiLSTM-CRF named entity recognition model which used word and character embeddings. I will take the model in this paper for an example to explain how CRF Layer works.

If you do not know the details of BiLSTM and CRF, just remember they are two different layers in a named entity recognition model.

Read More