# CRF Layer on the Top of BiLSTM - 5

#### 2.5 The total score of all the paths

In the last section, we learned how to calculate the label path score of one path that is $e^{S_i}$. So far, we have one more problem which is needed to be solved, how to obtain the total score of all the paths ($P_{total} = P_1 + P_2 + … + P_N = e^{S_1} + e^{S_2} + … + e^{S_N}$).

The simplest way to measure the total score is that: enumerating all the possible paths and sum their scores. Yes, you can calculate the total score in that way. However, it is very inefficient. The training time will be unbearable.

Before you explore the following content, I suggest that you can prepare a blank paper and a pen, and follow the steps listed in the toy example. I am sure this will help you to understand the algorithm details well. Moreover, you should know how to implement it by your favorite programming language.

Step 1: recall the CRF loss function

In section 2.3, we defined the CRF loss function as:

$Loss Function = \frac{P_{RealPath}}{P_1 + P_2 + … + P_N}$

Now We change the loss function into a log loss function:

$LogLossFunction = \log \frac{P_{RealPath}}{P_1 + P_2 + … + P_N}$

Because when we are training a model, usually our goal is to minimize our loss function, we add a negative sign:

$Log Loss Function$
$= - \log \frac{P_{RealPath}}{P_1 + P_2 + … + P_N}$

$= - \log \frac{e^{S_{RealPath}}}{e^{S_1} + e^{S_2} + … + e^{S_N}}$

$= - (\log(e^{S_{RealPath}}) - \log(e^{S_1} + e^{S_2} + … + e^{S_N}))$

$= - (S_{RealPath} - \log(e^{S_1} + e^{S_2} + … + e^{S_N}))$

$= - ( \sum_{i=1}^{N} x_{iy_i} + \sum_{i=1}^{N-1} t_{y_iy_{i+1}} - \log(e^{S_1} + e^{S_2} + … + e^{S_N}))$

In previous section, we have already known how to calculate the real path score, now we need to find an effective solution to calculating $\log(e^{S_1} + e^{S_2} + … + e^{S_N})$.

Step 2: recall the Emission and Transition Score

To simplify, we assume we are training our model from this toy sentence whose length is only 3:

$\mathbf{x} = [w_0, w_1, w_2]$

Moreover, in our dataset, we have two labels:
$LabelSet = \{l_1,l_2\}$

We also have the Emission scores obtained from the output of Bi-LSTM layer as described in section 2.1:

$\mathbf{l_1}$ $\mathbf{l_2}$
$\mathbf{w_0}$ $x_{01}$ $x_{02}$
$\mathbf{w_1}$ $x_{11}$ $x_{12}$
$\mathbf{w_2}$ $x_{21}$ $x_{22}$

$x_{ij}$ denotes the score of $w_i$ being labelled $l_j$.

Moreover, the Transition scores are from the CRF Layer as shown in section 2.2:

$\mathbf{l_1}$ $\mathbf{l_2}$
$\mathbf{l_1}$ $t_{11}$ $t_{12}$
$\mathbf{l_2}$ $t_{21}$ $t_{22}$

$t_{ij}$ is the score of label transition from label $i$ to label $j$.

Step 3: START FIGHTING! (Paper and Pen get ready)

Remeber: Our goal is: $\log(e^{S_1} + e^{S_2} + … + e^{S_N})$

The process is an accumlation of scores. The idea is similar with dynamic programing (if you do not know what is dynamic programing, you can also continue reading this article. I will explain the toy example step-by-step. But I strongly recommend you to learn the dynamic programing algorithm). In short, the total score of all the possible paths of $w_0$ is calculated. Then, we use the total score to calculate that of $w_0 → w_1$. Finally, we use the lastest total score to calculate that of $w_0 → w_1 → w_2$. The final total score is what we need.

In the following step, you will see two variables: obs and previous. previous stores the final results of previous steps. obs denotes the information from current word.

$w_0$:

$obs = [x_{01}, x_{02}]$
$previous = None$

If our sentence only has one word, $w_0$, we do not have the results from previous steps, therefore $previous$ is None. In addition, we can only observe the first word which is $obs = [x_{01}, x_{02}]$. $x_{01}$ and $x_{02}$ are the emission scores mentioned above.

You may think, what is the total score of all the possible paths of $w_0$? The answer is very simple and it is:

$TotalScore(w_0)=\log (e^{x_{01}} + e^{x_{02}})$

$w_0$ → $w_1$:

$obs = [x_{11}, x_{12}]$
$previous = [x_{01}, x_{02}]$

1) Expand the previous to:

2) Expand the obs to:

You may be wondering, why we need to expand previous and obs to matrices. Because maxtrices can make the calculation of total score efficient. You will see that very soon in the following procedure.

3) Sum previous, obs and transition scores:

Then:

Change the value of previous for the next iteration:

Actually, the second iteration has been finished. In case, some people would like to know how to calculate the total score of all the possible paths ($label_1$ → $label_1$, $label_1$ → $label_2$, $label_2$ → $label_1$, $label_2$ → $label_2$) from $w_0$ to $w_1$, you can do the calculation as follows.

We use the elements in the new previous:

$TotalScore(w_0 → w_1)$
$=\log (e^{previous[0]} + e^{previous[1]})$
$=\log (e^{\log(e^{x_{01}+x_{11}+t_{11}} + e^{x_{02}+x_{11}+t_{21}})}+ e^{\log(e^{x_{01}+x_{12}+t_{12}} + e^{x_{02}+x_{12}+t_{22}})} )$
$=\log(e^{x_{01}+x_{11}+t_{11}}+e^{x_{02}+x_{11}+t_{21}}+e^{x_{01}+x_{12}+t_{12}}+e^{x_{02}+x_{12}+t_{22}})$

Do you find something? Yes, this is exactly our goal, $\log(e^{S_1} + e^{S_2} + … + e^{S_N})$.

In the equation, we can see:

• $S_1 = x_{01}+x_{11}+t_{11}$ ($label_1$ → $label_1$)
• $S_2 = x_{02}+x_{11}+t_{21}$ ($label_2$ → $label_1$)
• $S_3 = x_{01}+x_{12}+t_{12}$ ($label_1$ → $label_2$)
• $S_4 = x_{02}+x_{12}+t_{22}$ ($label_2$ → $label_2$)

$w_0$ → $w_1$ → $w_2$:

If you are reading here, you are almost done. In fact, in this iteration, we will do the same procedure as described in the last iteration.

$obs = [x_{21}, x_{22}]$
$previous=[\log (e^{x_{01}+x_{11}+t_{11}} + e^{x_{02}+x_{11}+t_{21}}), \log (e^{x_{01}+x_{12}+t_{12}} + e^{x_{02}+x_{12}+t_{22}})]$

1) Expand the previous to:

2) Expand the obs to:

3) Sum previous, obs and transition scores:

Then:

Change the value of previous for the next iteration:

$previous = [$

$]$
$=[\log( (e^{x_{01}+x_{11}+t_{11}} + e^{x_{02}+x_{11}+t_{21}})e^{x_{21} + t_{11}} + (e^{x_{01}+x_{12}+t_{12}} + e^{x_{02}+x_{12}+t_{22}})e^{x_{21} + t_{21}} ),$
$\log( (e^{x_{01}+x_{11}+t_{11}} + e^{x_{02}+x_{11}+t_{21}})e^{x_{22} + t_{12}} + (e^{x_{01}+x_{12}+t_{12}} + e^{x_{02}+x_{12}+t_{22}})e^{x_{22} + t_{22}})]$

As described in the last iteration, we use the elements in the new previous to calculate the total score:

$TotalScore(w_0 → w_1 → w_2)$
$=\log (e^{previous[0]} + e^{previous[1]})$
$=\log (e^{\log( (e^{x_{01}+x_{11}+t_{11}} + e^{x_{02}+x_{11}+t_{21}})e^{x_{21} + t_{11}} + (e^{x_{01}+x_{12}+t_{12}} + e^{x_{02}+x_{12}+t_{22}})e^{x_{21} + t_{21}} )}$
$+e^{\log( (e^{x_{01}+x_{11}+t_{11}} + e^{x_{02}+x_{11}+t_{21}})e^{x_{22} + t_{12}} + (e^{x_{01}+x_{12}+t_{12}} + e^{x_{02}+x_{12}+t_{22}})e^{x_{22} + t_{22}})} )$

$=\log (e^{x_{01}+x_{11}+t_{11}+x_{21}+t_{11}}+e^{x_{02}+x_{11}+t_{21}+x_{21}+t_{11}}$
$+e^{x_{01}+x_{12}+t_{12}+x_{21}+t_{21}}+e^{x_{02}+x_{12}+t_{22}+x_{21}+t_{21}}$
$+e^{x_{01}+x_{11}+t_{11}+x_{22}+t_{12}}+e^{x_{02}+x_{11}+t_{21}+x_{22}+t_{12}}$
$+e^{x_{01}+x_{12}+t_{12}+x_{22}+t_{22}}+e^{x_{02}+x_{12}+t_{22}+x_{22}+t_{22}})$

Congratulations!

We achieve the goal, $\log(e^{S_1} + e^{S_2} + … + e^{S_N})$. There are three words in our toy sentence and two labels in our label set. Therefore, there should be totally 8 possible label paths.

Before you enjoy a cup of coffee or a sweet cake to have a rest, please allow me to say something. Although you find the process quite complicated, the implementation of this algorithm would be much much easier. One of the advantages of computer is to do repetitive work.

Now you can implement the CRF loss function by yourself and start to train your own model.

### Next

#### 2.6 Infer the labels for a new sentence

We have learnt the details of CRF loss function, the next step is how to infer the labels for a new sentence when we apply our model to a test set.

(Sorry for my late update, I will try my best to squeeze time for updating the following sections.)

## References

[1] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C., 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
https://arxiv.org/abs/1603.01360