# Machine Learning

## Basic knowledge

### Bias and variance 1. Bias:

Represent fitting ability, a naive model will lead to high bias because of underfitting.

1. Variance

Represent stability, a complex model will lead to high variance beacause of overfitting.

$$Generalization error = Bias^2 + Variance + Irreducible Error$$

### Generative model and Discriminative Model

1. Discriminative Model

Learn a function or conditional probability model P(X|Y)(posterior probability) directly.

1. Generative Model
Learn a joint probability model P(X, Y) then to calculate P(Y|X)

### Search hyper-parameter

1. Grid search
2. Random search

### Euclidean distance and Cosine distance

Example: A=[2, 2, 2] B=[5, 5, 5] represents two review scores of three movie.
the Euclidean distance is $\sqrt{3^2 + 3^2 + 3^2}$, and the Cosine distance is $1$. As a result, Cosine distance can avoid of difference

After normalization, essentially they are the same,
$$D=(x-y)^2 = x^2+y^2-2|x||y|cosA = 2-2cosA，D=2(1-cosA)$$

### Confusion Matrix 1. accuracy: $ACC = \frac{TP+TN}{TP+FN+FP+FN}$
2. precison: $P = \frac{TP}{TP+FP}$
3. recall: $R = \frac{TP}{TP+FN}$
4. F1: $F_1 = \frac{2TP}{2TP+FP+FN}$

### deal with missing value

1. More missing value: drop feature column.
2. Less missing value: fill a value
1. Fill outlier: data.fillna(0)
2. Fill mean value: data.fillna(data.mean())

### Describe your project

1. Abstract reality to math problem
2. Describe your data
3. Proprocessing and feature selection
4. Model training and tuning

## Algorithm

### Logistic regreesion

#### Defination #### Loss: negative log los ### Decision Tree

1. ID3: use information gain
2. C4.5: use information gain rate

### Ensemble Learning

#### Boosting: AdaBoostGBDT

Seiral strategy, new learning machine is based on previous one

#### Bagging: Random forest and Dropout in Neural Network

Parallel strategy, no dependency between learning machines.

# Deep Learning

## Basic Knowledge

### Overfitting and underfitting

#### Deal with overfitting

1. Data enhancement

1. image: translation, rotation, scaling
2. GAN: generate new data
3. NLP: generate new data via neural machine translation
2. Decrease the complexity of model

1. neural network: decrease layer numbers and neuron numbers
2. decision tree: decrease tree depth and pruning
3. Constrain weight:

1. L1 regularization
2. L2 regularization
4. Ensemble learning:

1. Neural network: Dropout
2. Decision tree: random forest, GBDT
5. early stopping

#### Deal with underfitting

1. add new feature
2. add model complexity
3. decrease regularization

### Back-propagation TODO:https://github.com/imhuay/Algorithm_Interview_Notes-Chinese/blob/master/A-%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/A-%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E5%9F%BA%E7%A1%80.md MSE as loss function: another expression: ### Activation function: improve ability of expression

#### sigmoid(z)

$$\sigma(z)=\frac{1}{1+exp(-z)}, where the range is [0, 1]$$

the derivative of simoid is:
TODO: to f(x)
$$f’(x)=f(x)(1-f(x))$$

### Batch Normalization

Goal: restrict data point to same distribution through normalization data before each layer.

### Optimizers

#### SGD

Stochastic Gradient Descent, update weights each mini-batch

#### Momentum

Add former gradients with decay into current gradient.

#### Adagrad

Dynamically adjust learning rate when training.

Learning rate is in reverse ratio to the sum of parameters.

#### Adam

Dynamically adjust learning rate when training.

utilize first order moment estisourmation and second order moment estimation to make sure the steadiness.

#### How to deal with L1 not differentiable

Update parameters along the axis direction.

### How to initialize the neural network

Init network with Gaussian Distribution or Uniform Distribution.

Glorot Initializer:
$$W_{i,j}~U(-\sqrt{\frac{6}{m+n}}, \sqrt{\frac{6}{m+n}})$$

# Computer Vision

## Models and History

• 2015 VGGNet(16/19): Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015.
• 2015 GoogleNet:
• 2016 Inception-v1/v2/v3: Rethinking the Inception Architecture for Computer Vision, CVPR 2016.
• 2016 ResNet: Deep Residual Learning for Image Recognition, CVPR 2016.
• 2017 Xception: Xception: Deep Learning with Depthwise Separable Convolutions, CVPR 2017.
• 2017 InceptionResNet-v1/v2、Inception-v4
• 2017 MobileNet: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv 2017.
• 2017 DenseNet: Densely Connected Convolutional Networks, CVPR 2017.
• 2017 NASNet: Learning Transferable Architectures for Scalable Image Recognition, arXiv 2017.
• 2018 MobileNetV2: MobileNetV2: Inverted Residuals and Linear Bottlenecks, CVPR 2018.

# Practice experience

## Loss function decline to 0.0000

Because of overflow in Tensorflow or other framework. it is better to initialize parameters in a reasonable interval. The solution is Xavier initialization and Kaiming initialization.

## Do not normaolize the bias in neural network

That will lead to underfitting because of sparse $b$

## Do not set learning rate too large

When using Adam optimizer, try $10^{-3}$ to $10^{-4}$

## Do not forget to shuffle training data

For the sake of overfitting

## Do not use same label in a batch

For the sake of overfitting

## Do not use vanilla SGD optimizer

Avoid getting into saddle point

## Please checkout gradient in each layer

For the sake of potential gradient explosion, we need to use gradient clip to cut off gradient

## Problem of classification confidence

Symptom: When losses increasing, but the accuracy still increasing

For the sake of confidence: [0.9,0.01,0.02,0.07] in epoch 5 VS [0.5,0.4,0.05,0.05] in epoch 20.

Overall, this phenomenon is kind of overfitting.

## Do not use batch normalization layer with small batch size

The data in batch size can not represent the statistical feature over whole dataset。

## Improperly Use dropout in Conv layer may lead to worse performance

It is better to use dropout layer in a low probability such as 0.1 or 0.2.

Just like add some noise to Conv layer for normalization.

## Evaluation accuracy better than training accuracy

Because the distributions between training set and test set have large difference.

Try methods in transfer learning.

## KL divergence goes negative number

Need to pay attention to softmax for computing probability.

0%