# Super Machine Learning Revision Notes

### [Last Updated: 06/01/2019]

• basic concepts in machine learning (e.g. gradient descent, back propagation etc.)
• different algorithms and various popular models
• some practical tips and examples were learned from my own practice and some online courses such as Deep Learning AI.

If you a student who is studying machine learning, hope this article could help you to shorten your revision time and bring you useful inspiration. If you are not a student, hope this article would be helpful when you cannot recall some models or algorithms.

Moreover, you can also treat it as a “Quick Check Guide”. Please be free to use Ctrl+F to search any key words interested you.

Any comments and suggestions are most welcome!

# CRF Layer on the Top of BiLSTM - 8

#### 3.4 Demo

In this section, we will make two fake sentences which only have 2 words and 1 word respectively. Moreover, we will also randomly generate their true answers. Finally, we will show how to train the CRF Layer by using Chainer v2.0. All the codes including the CRF layer are avaialbe from GitHub.

# CRF Layer on the Top of BiLSTM - 7

#### 3 Chainer Implementation

In this section, the structure of code will be explained. In addition, an important tip of implementing the CRF loss layer will also be given. Finally, the Chainer (version 2.0) implementation source code will be released in the next article.

# CRF Layer on the Top of BiLSTM - 6

#### 2.6 Infer the labels for a new sentence

In the previous sections, we learned the structure of BiLSTM-CRF model and the details of CRF loss function. You can implement your own BiLSTM-CRF model by various opensource frameworks (Keras, Chainer, TensorFlow etc.). One of the greatest things is the backpropagation of on your model is automatically computed on these frameworks, therefore you do not need to implement the backpropagation by yourself to train your model (i.e. compute the gradients and to update parameters). Moreover, some frameworks have already implemented the CRF layer, so combining a CRF layer with your own model would be very easy by only adding about one line code.

In this section, we will explore how to infer the labels for a sentence during the test when our model is ready.

# CRF Layer on the Top of BiLSTM - 5

#### 2.5 The total score of all the paths

In the last section, we learned how to calculate the label path score of one path that is $e^{S_i}$. So far, we have one more problem which is needed to be solved, how to obtain the total score of all the paths ($P_{total} = P_1 + P_2 + … + P_N = e^{S_1} + e^{S_2} + … + e^{S_N}$).

The simplest way to measure the total score is that: enumerating all the possible paths and sum their scores. Yes, you can calculate the total score in that way. However, it is very inefficient. The training time will be unbearable.

# CRF Layer on the Top of BiLSTM - 4

#### 2.4 Real path score

In section 2.3, we supposed that every possible path has a score $P_{i}$ and there are totally $N$ possible paths, the total score of all the paths is $P_{total} = P_1 + P_2 + … + P_N = e^{S_1} + e^{S_2} + … + e^{S_N}$, $e$ is the mathematical constant $e$.

Obviously, there must be a path is the real one among all the possible paths. For exmaple, the real path of the sentence in section 1.2 is “START B-Person I-Person O B-Organization O END”. The others are incorrect such as “START B-Person B-Organization O I-Person I-Person B-Person”. $e^{S_i}$ is the score of $i^{th}$ path.

During the training process, the crf loss function only need two scores: the score of the real path and the total score of all the possbile paths. The proportion of the real path score among the scores of all the possible paths will be increased gradually.

The calculation of a real path score, $e^{S_i}$, is very straightforward.