CS231n Convolutional Neural Networks for Visual Recognition - 要点记录



  • 论坛排版不好看,可见我的博客
    CS231n Convolutional Neural Networks for Visual Recognition - 要点记录

    Lecture2

    1. Distance Metric

    1.1 L1 (Manhattan) distance

    d1(I1,I2)=pI1pI2pd_1(I_1,I_2)=\sum_p |I_1^p - I_2^p|

    1.2 L2 (Euclidean) distance

    d2(I1,I2)=p(I1pI2p)2d_2(I_1, I_2)=\sqrt{\sum_p(I_1^p-I_2^p)^2}


    Lecture3

    1. SVM Hinge loss

    Li=jyimax(0,sjsyi+1)L_i = \sum_{j\neq y_i}max(0, s_j - s_{y_i}+1)

    2. Regularization

    Prevent overfit.

    L(W)=1Ni=1NLi(f(xi,W),yi)+λR(W)L(W) = \frac1N \sum_{i=1}^N L_i(f(x_i, W), y_i) + \lambda R(W)

    2.1 L1 regularization

    R(W)=klWk,l2R(W) = \sum_k \sum_l W^2_{k,l}

    2.2 L2 regularization

    R(W)=klWk,lR(W) = \sum_k \sum_l |W_{k,l}|

    2.3 Elastic net (L1 + L2)

    R(W)=klβWk,l2+Wk,lR(W) = \sum_k \sum_l \beta W_{k,l}^2 + |W_{k,l}|

    3. Softmax -- score -> probabilities

    softmaxi(X)=exp(Xi)jNexp(Xj)softmax_i(X)=\frac{exp(X_i)}{\sum_j^N exp(X_j)}

    4. Softmax cross-entropy loss

    Li=log(softmaxi(X))L_i = -log(softmax_i(X))

    5. Gradient descent



  • Lecture4

    • Backpropagation
    • Chain rule

    Lecture5

    1. Fully Connected Layer -- change dims & length

    0_1533312327089_42b2e6af-943e-4e2d-b55a-d1056a29519e-image.png

    2. Convolution Layer

    3. Pooling Layer

    • Max pool
    • Average pool

    Lecture6

    1. Mini-batch SGD

    Loop:

    1. Sample a batch of data
    2. Forward computation & calculate loss
    3. Backprop to calculate gradients
    4. Optimizer to update the parameters using gradients

    2. Activation Functions

    • Sigmoid

      σ(x)=11+ex\sigma (x) = \frac{1}{1+e^{-x}}

    • tanh

      tanh(x)tanh(x)

    • ReLU

      max(0,x)max(0,x)

    0_1533312898795_cdb45c2b-6f24-46ea-b6a9-480d62d51418-image.png

    3. Vanishing Gradients & Exploding Gradients

    机器学习中梯度消失与梯度爆炸问题详解

    4. Weight Initialization

    • Small randon numbers -- NO!
    • Xavier initialization
      W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
      
    • ...

    5. Batch Normalization

    1. Compute the empirical mean and variance independently for each dimension
    2. Normalize

    x^(k)=x(k)E[x(k)]Var[x(k)]\hat x^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt {Var[x^{(k)}]}}

    Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.

    0_1533313764133_12cd2165-9c01-48a0-971f-55a7a1c0219f-image.png

    0_1533356513089_f9789dc0-5c43-435c-9b8c-78bc63b2d91d-image.png

    Features

    • Improves gradient flow through the network
    • Allows higher learning rates
    • Reduces the strong dependence on initialization
    • Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe


  • Lecture7

    1. Optimizer

    • SGD

    xt+1=xtαf(xt)x_{t+1}=x_t-\alpha \nabla f(x_t)

    • SGD + Momentum

    vt+1=ρvt+f(xt)v_{t+1}=\rho v_t+\nabla f(x_t)

    xt+1=xtαvt+1x_{t+1}=x_t-\alpha v_{t+1}

    • Nesterov Momentum

    vt+1=ρvtαf(xt+ρvt)v_{t+1}=\rho v_t-\alpha\nabla f(x_t+\rho v_t)

    xt+1=xt+vt+1x_{t+1}=x_t+v_{t+1}

    • AdaGrad

    dx=f(xt)dx=\nabla f(x_t)

    g=g+(dx)2g = g+(dx)^2

    xt+1=xtαdxg+107x_{t+1}=x_t - \alpha \frac{dx}{\sqrt g + 10^{-7}}

    • RMSProp

    dx=f(xt)dx=\nabla f(x_t)

    g=βg+(1β)(dx)2g = \beta g+(1-\beta)(dx)^2

    xt+1=xtαdxg+107x_{t+1}=x_t - \alpha \frac{dx}{\sqrt g + 10^{-7}}

    • Adam (almost)

      first_moment = 0
      second_moment = 0
      while True:
          dx = compute_gradient(x)
          first_moment = beta1 * first_moment + (1 - beta1) * dx
          second_moment = beta2 * sencond_moment + (1 - beta2) * dx * dx
          x -= learning_rate * first_moment / (np.sqrt(second_moment) + 1e-7)
      
    • Adam (full form)
      0_1533359223607_c45b7ca1-dee8-42e7-9fdd-475de3048f58-image.png

    2. Dropout

    0_1533359359049_b68f889a-cfac-4caa-9f3a-97a7ed978d74-image.png


 

Copyright © 2018 bbs.dian.org.cn All rights reserved.

与 Dian 的连接断开,我们正在尝试重连,请耐心等待