Neural Network Cheat Sheet

By Afshine Amidi and Shervine Amidi Overview. Architecture of a traditional RNN Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. Download the complete set of ultra HD AI cheat sheets. This PDF includes the most comprehensive set of cheat sheets for AI, Neural Networks, Machine Learning, Deep Learning & Data Science.

Neural Networks Journal
Neural Network Architecture Cheat Sheet
Neural Network Cheat Sheet Free
Neural Network Cheat Sheet Pdf

The following article is a cheat sheet on neural networks. My sources are based on the following course and article:

the excellent Machine Learning course on Coursera from Professor Andrew Ng, Stanford,
the very good article from Michael Nielsen, explaining the backpropagation algorithm.

Why the neural networks are powerful ?¶

Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data. Posted by Andrea Manero-Bastin on August 27, 2020 at 11:30pm; View Blog.
Check out this collection of high-quality deep learning cheat sheets, filled with valuable, concise information on a variety of neural network-related topics. Of data science for kids. Or 50% off hardcopy. By Matthew Mayo, KDnuggets. Shervine Amidi, graduate student at Stanford, and Afshine Amidi, of MIT and Uber - creators of a recent set of.

It is proven mathematically that:

Suppose we’re given a [continuous] function f(x) which we’d like to compute to within some desired accuracy ϵ>0. The guarantee is that by using enough hidden neurons we can always find a neural network whose output g(x) satisfies:|g(x)−f(x)|<ϵ, for all inputs x.

Michael Nielsen — From the following article

Conventions¶

Let’s define a neural network with the following convention:

L = total number of layers in the network.
$s_l$ = number of units (not counting bias unit) in layer l.
K = number of units in output layer ( = $s_L$ ).

With:

$$ forall l in {2, .., L-1}, a^{(l)} in mathbb{R}^{s_{l}+1} and begin{cases}a^{(l)}_0 = 1 (bias node)a^{(l)}_i = g(z^{(l)}_i), forall i in {1, .., s_l} end{cases}z^{(l)}_i = [ theta^{(l)}_i ]^T . a^{(l-1)}, forall i in {1, .., s_l} with theta^{(l)}_i in mathbb{R}^{s_{l-1}+1} and z^{(l)} in mathbb{R}^{s_{l}} $$

We define the matrix θ of the weights for the layer l as following:

$$theta^{(l)} in mathbb{R}^{s_l times (s_{(l-1)}+1)}$$$$theta^{(l)} = begin{bmatrix} [ theta^{(l)}_1 ]^T [ theta^{(l)}_2 ]^T vdots [ theta^{(l)}_{s_{l}} ]^Tend{bmatrix} =begin{bmatrix} theta_{1,0} & dots & theta_{1,j} & dots & theta_{1,s_{l-1}} vdots & & vdots & & vdots theta_{i,0} & dots & theta_{i,j} & dots & theta_{i,s_{l-1}} vdots & & vdots & & vdots theta_{s_l,0} & dots & theta_{s_l,j} & dots & theta_{s_l,s_{l-1}} end{bmatrix}$$

Hence, we have the following relation: $$a^{(l)} =left(begin{smallmatrix}1 g(theta^{(l)}.a^{(l-1)})end{smallmatrix}right) tag{0}$$

The cost function of a Neural Network¶

The training set is defined by: $ { (x^1,y^1), .., (x^m,y^m) } $

x and y are vectors, with respectively the same dimensions as the input and output layers of the neural network.

The cost function of a neural network is the following:

$$J(theta) = - frac{1}{m} sum_{i=1}^m sum_{k=1}^K left[ cost( a^{(L)}_k, y^{(i)}_k) right] + frac{lambda}{2m}sum_{l=2}^{L} sum_{j=1}^{s_l} sum_{i=1}^{s_{l+1}} ( theta_{i,j}^{(l)})^2$$

$a^{(L)}_k$ is the output of the neural network, and is dependent of the weights ? of the neural network.
Please note that the regularization term does not include the weights of the bias nodes.

Now, the objective is to train the neural network and find the minimum of the cost function J(?).

Mathematic reminder: the chain rule¶

Let’s define the functions f, g and h as following:

$$ f:mathbb{R}^n rightarrow mathbb{R} $$$$ g:mathbb{R}^p rightarrow mathbb{R}^n $$$$ h = f circ g $$

The derivative of h is given by the chain rule theorem:

$$forall_i in { !1, .. , !p }, frac{partial h}{partial x_i} = sum_{k=1}^{n} frac{partial f}{partial g_k} frac{partial g_k}{partial x_i}$$

(See the following course online on partial derivation from the MIT)

The backpropagation algorithm¶

We use the gradient descent to find the minimum of J on ?: $ minlimits_{theta} J(theta)$

The gradient descent requires to compute:

$$ frac{partial J(theta)}{partial theta^{(l)}_{i,j}} $$

In the following parts, we consider only the first part of J(θ) (as if the regularisation term λ=0). The partial derivative of the second term of J(θ) is easy to compute.

The following course from Andrej Karpathy gives an outstanding explaination how the partial derivatives works at the level of a node in the backpropagation algorithm.

Definition of ẟ¶

Let’s define the function ẟ. When ẟ of the layer l is multiplied by the output of the layer (l-1), we obtain the partial derivative of the cost function on θ.

Let’s use the chain rule and develop this derivative on z:

$$ frac{partial J(theta)}{partial theta^{(l)}_{i,j}} =sum^{s_l}_{k = 1} frac{partial J(theta)}{partial z^{(l)}_k} frac{partial z^{(l)}_k}{partial theta^{(l)}_{i,j}}$$

(Remind that J is dependent of z)

As: $$z^{(l)}_k = [ theta^{(l)}_k ]^T . a^{(l-1)} = sum_{p=0}^{s_{l-1}} theta^{(l)}_{k,p} times a^{(l-1)}_p$$

$$frac{partial z^{(l)}_k}{partial theta^{(l)}_{i,j}} = 0 for k ≠ i and p ≠ j in the sum.$$$$And frac{partial z^{(l)}_k}{partial theta^{(l)}_{i,j}} = a^{(l-1)}_j for k = i and p = j in the sum.$$

We define the output error ?: $$ delta^{(l)}_k = frac{partial J(theta)}{partial z^{(l)}_k}, forall k in {1, .., s_l},and delta^{(l)} = nabla_{z^{(l)}} J(theta) in mathbb{R}^{s_{l}}$$

So we have:

$$ frac{partial J(theta)}{partial theta^{(l)}_{i,j}} =delta^{(l)}_i . a^{(l-1)}_j$$

More specifically, for the derivatives of the bias node's weights, we have ($a^{(l)}_0 = 1, forall l$):

$$ frac{partial J(theta)}{partial theta^{(l)}_{i,0}} =delta^{(l)}_i$$

Value of ẟ for the layer L¶

Now let’s find ? for the output layer (layer L):

$$delta^L_i = frac{partial J(theta)}{partial z^{(L)}_i} = sum^{s_{L}}_{k = 1} frac{partial J(theta)}{partial a^{(L)}_k} frac{partial a^{(L)}_k}{partial z^{(L)}_i}$$

As:

$$a^{(l)}_k = g(z^{(l)}_k), frac{partial a^{(L)}_k}{partial z^{(L)}_i} = 0 if k ≠ i$$

Hence:

Neural Networks Journal

$$delta^L_i = frac{partial J(theta)}{partial a^{(L)}_i} frac{partial g(z^{(L)}_i)}{partial z^{(L)}_i} =frac{partial J(theta)}{partial a^{(L)}_i} . g'(z^{(L)}_i)$$

Relationship of ẟ between the layer l and the layer (l-1)¶

Now, we try to find a relation between ? of the layer l and ? of the next layer (l+1):

$$delta^{(l)}_i = frac{partial J(theta)}{partial z^{(l)}_i} = sum^{s_{l+1}}_{k = 1} frac{partial J(theta)}{partial z^{(l+1)}_k} frac{partial z^{(l+1)}_k}{partial z^{(l)}_i}$$

With:

$$z^{(l+1)}_k = [ theta^{(l+1)}_k ]^T . g(z^{(l)}_p) = sum_{p=0}^{s_{l}} theta^{(l+1)}_{k,p} times g(z^{(l)}_p)$$

And:

$$frac{partial z^{(l+1)}_k}{partial z^{(l)}_i} = theta^{(l+1)}_{k,i} . g'(z^{(l)}_i) for p = i, else 0.$$

Hence:

$$delta^{(l)}_i =sum^{s_{l+1}}_{k = 1} delta^{(l+1)}_k . theta^{(l+1)}_{k,i} . g'(z^{(l)}_i) $$

The meaning of this equation is the following:

The backpropagation algorithm explained¶

We have the following:

we have found a function ẟ for the layer l such that when we multiply this function by the output of the layer (l-1), we obtain the partial derivative of the cost function J on the weights θ of the layer l,
the function ẟ for the layer l has a relation with ẟ of the layer (l+1),
as we have a training set, we can compute the values of ẟ for the layer L.

So, we start to compute the values of ẟ for the layer L. As we have a relation between ẟ for the layer l and ẟ for the layer (l+1), we can compute the values for the layers (L-1), (L-2), …, 2.

We can then compute the partial derivative of the cost function J on the weights θ of the layer l, by multiplying ẟ for the layer l by the output of the layer (l-1).

The vectorized backpropagation’s equations¶

The first equation gives the partial derivatives of J with respect to θ. We have added the regularization term.

$$nabla_{theta^{(l)}} J(theta) = begin{bmatrix} frac{partial J}{partial theta^{(l)}_{1,0}} & dots & frac{partial J}{partial theta^{(l)}_{1,j}} & dots & frac{partial J}{partial theta^{(l)}_{1,s_{l-1}}} vdots & & vdots & & vdots frac{partial J}{partial theta^{(l)}_{i,0}} & dots & frac{partial J}{partial theta^{(l)}_{i,j}} & dots & frac{partial J}{partial theta^{(l)}_{i,s_{l-1}}} vdots & & vdots & & vdots frac{partial J}{partial theta^{(l)}_{s_l,0}} & dots & frac{partial J}{partial theta^{(l)}_{s_l,j}} & dots & frac{partial J}{partial theta^{(l)}_{s_l,s_{l-1}}} end{bmatrix}= delta^{(l)} otimes [a^{(l-1)}]^T + frac{lambda}{m} theta^{(l)}_{zero bias}tag{1}$$

With $theta^{(l)}_{zero bias}$ is $theta^{(l)}$ with $theta^{(l)}_{i,0} = 0, forall i in {1, .., s_l}$ (the regularization term does not include the bias node's weights).

The second equation gives the relation between ẟ in layer l and ẟ in layer (l+1):

$$delta^{(l)} = [(theta^{(l+1)}_{remove bias})^T . delta^{(l+1)}] odot g'(z^l) tag{2}$$

With $theta^{(l)}_{remove bias}$ is $theta^{(l)}$ with the column of the bias nodes' weights removed.

The third equation gives the value of ẟ for the layer L:

$$delta^{(L)} = nabla_{a^{(L)}} J(theta) odot g'(z^L) tag{3}$$

Conclusion¶

This cheat sheet explains the backpropagation algorithm used to train a neural network. I have created this article after following the great Machine Learning’s course of Professor Andrew Ng on Coursera. The conventions used in this article are not exactly the ones used in the course of Professor Ng, nor exactly the ones used in the article of Michael Nielsen.

If you notice any error, please do not hesitate to contact me.

Imports¶

General¶

Neural Network API¶

See autograd,nn,functionaland optim

Torchscript and JIT¶

See Torchscript

ONNX¶

See onnx

Vision¶

Seetorchvision

Distributed Training¶

See distributedandmultiprocessing

Tensors¶

Creation¶

See tensor

Dimensionality¶

See tensor

Algebra¶

See mathoperations

GPU Usage¶

See cuda

Neural Network Architecture Cheat Sheet

Deep Learning¶

See nn

Loss Functions¶

See lossfunctions

Activation Functions¶

See activationfunctions

Optimizers¶

See optimizers

Learning rate scheduling¶

See learning ratescheduler

Data Utilities¶

Datasets¶

Seedatasets

Dataloaders and DataSamplers¶

Seedataloader

Neural Network Cheat Sheet Free

Also see¶

Neural Network Cheat Sheet Pdf

Deep Learning with PyTorch: A 60 MinuteBlitz(pytorch.org)
PyTorch Forums(discuss.pytorch.org)
PyTorch for Numpyusers(github.com/wkentaro/pytorch-for-numpy-users)