2018-01-23

Super Machine Learning Revision Notes

[Last Updated: 06/01/2019]

This article aims to summarise:

basic concepts in machine learning (e.g. gradient descent, back propagation etc.)
different algorithms and various popular models
some practical tips and examples were learned from my own practice and some online courses such as Deep Learning AI.

If you a student who is studying machine learning, hope this article could help you to shorten your revision time and bring you useful inspiration. If you are not a student, hope this article would be helpful when you cannot recall some models or algorithms.

Moreover, you can also treat it as a “Quick Check Guide”. Please be free to use Ctrl+F to search any key words interested you.

Any comments and suggestions are most welcome!

Note

Please note that: The Wechat Public Account is available now! If you found this article is useful and would like to found more information about this series, please subscribe to the public account by your Wechat! (2020-04-03)
QR Code

Name	Function	Derivative
sigmoid	$g(z)=\frac{1}{1+e^{-z}}$	$g(z)(1-g(z))$
tanh	$tanh(z)$	$1-(tanh(z))^2$
		0, if $z<0$
Relu	$max(0,z)$	1, if $z>0$
		undefined, if $z=0$
		0.01, if $z<0$
Leaky Relu	$max(0.01z,z)$	1, if $z>0$
		undefined, if $z=0$

Other Methods	Equation
exponentially decay	$\alpha=0.95^{EpochNumber}\alpha_0$
epoch number related	$\alpha=\frac{k}{EpochNumber}\alpha_0$
mini-batch number related	$\alpha=\frac{k}{t}\alpha_0$
discrete stair case
manual decay	decrease learning rate manually day by day or hour by hour etc.

Priority	Hyper Parameter
1	learning rate $\alpha$
2	$\beta_1$, $\beta_2$ and $\epsilon$ (parameters of momentum and RMSprop)
2	the number of hidden units
2	mini-batch size
3	the number of layers
3	the number of learning rate decay


Human-Level Error	0.9%	0.9%	0.9%	0.9%
Training Set Error	1%	15%	15%	0.5%
Test Set Error	11%	16%	30%	1%
Comments	overfitting	underfitting	underfitting	good
	high variance	high bias	high bias and variance	low bias and variance


Human-Level Error	1%	7.5%
Training Set Error	8%	8%
Test Set Error	10%	10%
Comments	high bias	high variance


Human-Level Error	0%	0%	0%	0%
Train Error	1%	1%	10%	10%
Train-Dev Error	9%	1.5%	11%	11%
Dev Error	10%	10%	12%	20%
Problem	high variance	data mismatch	high bias	high bias + data mismatch

Image	Dog	Big Cat	Blurry	Comments
1	$\surd$
2			$\surd$
3		$\surd$	$\surd$
…	…	…	…	…
percentage	8%	43%	61%

Epoch	$\alpha$
1	0.1
2	0.67
3	0.5
4	0.4
5	…


$\beta$	0.9	0.99	0.999
$1-\beta$	0.1	0.01	0.001
$r$	-1	-2	-3

[Last Updated: 06/01/2019]

Note

Table of Contents

Activation Functions

Gradient Descent

- Computation Graph

- Backpropagation

- Gradients for L2 Regularization (weight decay)

- Vanishing/Exploding Gradients

- Mini-Batch Gradient Descent

- Stochastic Gradient Descent

- Choosing Mini-Batch Size

- Gradient Descent with Momentum (always faster than SGD)

- Gradient Descent with RMSprop

- Adam (put Momentum and RMSprop together)

- Learning Rate Decay Methods

Decay based on the number of epoch

- Batch Normalization

Batch Normalization at Train Time

Batch Normalization at Test Time

Paramters

- Learnable and Hyper Parameters

- Parameters Initialization

Small Initial Values

More Hidden Units, Smaller Weights

Xavier Initialization

- Hyper Parameter Tuning

Uniform sample for hidden units and layers

Sample on log scale

Regularization

- L2 Regularization (weight decay)

- L1 Regularization

- Dropout (inverted dropout)

- Early Stopping

Models

- Logistic Regression

- Multi-Class Classification (Softmax Regression)

Loss Function

- Transfer Learning

- Multi-Task Learning

- Convolutional Neural Network (CNN)

Filter/Kernel

- Stride

- Padding (valid and same convolutions)

- A Convolutional Layer

- 1*1 Convolution

- Pooling Layer (Max and Average Pooling)

- LeNet-5

- AlexNet

- VGG-16

- ResNet (More Advanced and Powerful)

- Inception Network

- Object Detection

- Classification with Localisation

- Landmark Detection

- Sliding Windows Detection Algorithm

- Region Proposal (R-CNN, only run detection on a few windows)

- YOLO Algorithm

- Bounding Box Predictions (Basics of YOLO)

- Intersection Over Union

- Non-max Suppression

- Anchor Boxes

- Face Verification

- One-Shot Learning (Learning a “similarity” function)

- Siamese Network (Learning difference/similar degree)

- Triplet Loss (See three pictures at one time)

- Face Recognition/Verification and Binary Classification

- Neural Style Transfer

- 1D and 3D Convolution Generalisations

Sequence Models

- Recurrent Neural Network Model

- Gated Recurrent Unit (GRU)

- GRU (Simplified)

- GRU (Full)

- Long Short Term Memory (LSTM)

- Bidirectional RNN

- Deep RNN Example

- Word Embedding

- One-Hot

- Embedding Matrix ($E$)