# CRF Layer on the Top of BiLSTM - 5

#### 2.5 The total score of all the paths

In the last section, we learned how to calculate the label path score of one path that is $e^{S_i}$. So far, we have one more problem which is needed to be solved, how to obtain the total score of all the paths ($P_{total} = P_1 + P_2 + … + P_N = e^{S_1} + e^{S_2} + … + e^{S_N}$).

The simplest way to measure the total score is that: enumerating all the possible paths and sum their scores. Yes, you can calculate the total score in that way. However, it is very inefficient. The training time will be unbearable.

# CRF Layer on the Top of BiLSTM - 4

#### 2.4 Real path score

In section 2.3, we supposed that every possible path has a score $P_{i}$ and there are totally $N$ possible paths, the total score of all the paths is $P_{total} = P_1 + P_2 + … + P_N = e^{S_1} + e^{S_2} + … + e^{S_N}$, $e$ is the mathematical constant $e$.

Obviously, there must be a path is the real one among all the possible paths. For exmaple, the real path of the sentence in section 1.2 is “START B-Person I-Person O B-Organization O END”. The others are incorrect such as “START B-Person B-Organization O I-Person I-Person B-Person”. $e^{S_i}$ is the score of $i^{th}$ path.

During the training process, the crf loss function only need two scores: the score of the real path and the total score of all the possbile paths. The proportion of the real path score among the scores of all the possible paths will be increased gradually.

The calculation of a real path score, $e^{S_i}$, is very straightforward.

# CRF Layer on the Top of BiLSTM - 3

#### 2.3 CRF loss function

The CRF loss function is consist of the real path score and the total score of all the possible paths. The real path should have the highest score among those of all the possible paths.

For example, if we have these labels in our dataset as shown in the table:

Label Index
B-Person 0
I-Person 1
B-Organization 2
I-Organization 3
O 4
START 5
END 6

We also have a sentence which has 5 words. The possible paths could be:

• 1) START B-Person B-Person B-Person B-Person B-Person END
• 2) START B-Person I-Person B-Person B-Person B-Person END
• 10) START B-Person I-Person O B-Organization O END
• N) O O O O O O O

# CRF Layer on the Top of BiLSTM - 2

## Review

In the previous section, we know that the CRF layer can learn some constraints from the training dataset to ensure the final predicted entity label sequences are valid.

The constrains could be:

• The label of the first word in a sentence should start with “B-“ or “O”, not “I-“
• “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label. For example, “B-Person I-Person” is valid, but “B-Person I-Organization” is invalid.
• “O I-label” is invalid. The first label of one named entity should start with “B-“ not “I-“, in other words, the valid pattern should be “O B-label”

# CRF Layer on the Top of BiLSTM - 1

## Outline

The article series will include:

• Introduction - the general idea of the CRF layer on the top of BiLSTM for named entity recognition tasks
• A Detailed Example - a toy example to explain how CRF layer works step-by-step
• Chainer Implementation - a chainer implementation of the CRF Layer