Neural Networks (Geoffrey Hinton Course)

Some Simple Models or Neurons

$y$ output, $x_i$ input.

Linear Neurons

$y = b + \sum_{i} x_i w_i$

$w_i$ weights, $b$ bias

Binary Threshold Neurons

$z = \sum_{i} x_i w_i$

$y = 1$ if $z \geq \theta$, $0$ otherwise.

Or, equivalently,

$z = b + \sum_{i} x_i w_i$

$y = 1$ if $z \geq 0$, $0$ otherwise.

Rectified Linear Neurons

$z = b + \sum_{i} x_i w_i$

$y = z$ if $z > 0$, $0$ otherwise. (linear above zero, decision at zero.)

Sigmoid Neurons

Give a real-valued output that is a smooth and bounded function of their total input.

$z = b + \sum_{i} x_i w_i$

$y = \frac{1}{1 + e^{-z}}$

Stochastic Binary Neurons

Same equations as logistic units, but outputs $1$ (=spike) or $0$ randomly based on the probability. They treat the output of the logistic as the probability of producing a spike in a short time window.

$z = b + \sum_{i} x_i w_i$

$P(s = 1) = \frac{1}{1 + e^{-z}}$

We can do a similar trick for rectified linear units - in this case the output is treated as the Poisson rate for spikes.

Types of Learning

Supervised Learning

Learn to predict an output when given an input vector.

Regression: The target output is a real number or a whole vector of real numbers.
Classification: The target output is a class label.

How Supervised Learning Typically Works

Start by choosing a model-class: $y = f(x;W)$
- A model-class $f$ is a way of using some numerical parameters $W$ to map each input vector $x$ into a predicted output $y$.
Learning usually means adjusting the parameters to reduce the discrepancy between the target output, $t$, on each training case and the actual output, $y$, produced by the model.
- For regression $\frac{1}{2}(y-t)^2$ is often a sensible measure of the discrepancy.
- For classification there are other measures that are generally more sensible (they also work better).

Reinforcement Learning

Learn to select an action to maximize payoff.

The output is an action or sequence of actions and the only supervisory signal is an occasional scalar reward.
- The goal in selecting each action is to maximize the expected sum of the future rewards.
- We usually use a discount factor for delayed rewards so that we don't have to look too far into the future.
Reinforcement learning is difficult because:
- The rewards are typically delayed so it's hard to know where we went wrong (or right).
- A scalar reward does not supply much information.
Typically you can't learn millions of parameters using reinforcement learning (you can with supervised/unsupervised learning). Typically you learn dozens or thousands of parameters.
Will not be covered in this course.

Unsupervised Learning

Discover a good internal representation of the input.

For about 40 years unsupervised learning was largely ignored by the machine learning community (except for clustering).
It is hard to say what the aim of unsupervised learning is:
- One major aim is to create an internal representation of the input that is useful for subsequent supervised or reinforcement learning.
- You can compute the distance to a surface by using the disparity between two images. But you don't want to learn to compute disparities by stubbing your toe thousands of times.
Other goals:
- Providing a compact, low-dimensional representation of the input.
  - High-dimensional inputs typically live on or near a low-dimensional manifold (or several such manifolds)
  - Principal Component Analysis is a widely used linear method for finding a low-dimensional representation.
- Providing an economical high-dimensional representation of the input in terms of learned features.
  - Binary features
  - Real-valued features that are nearly all zero
- Finding sensible clusters in the input
  - This is an example of a very sparse code in which only one of the features is non-zero.

Neural Network Architectures

Feed-forward architecture: information comes into the input units and flows one direction through hidden layers until each reaches the output units.
Recurrent neural network: information can flow around in cycles.
Symmetrically connected network: weights are the same in both directions between two units.

Feed-forward Neural Networks

The commonest type of neural network
The first layer is the input and the last layer is the output
Called "deep" neural networks if there is more than one hidden layer.
They compute a series of transformations that change the similarities between cases - the activities of the neurons in each layer are a non-linear function of the activities in the layer below.

Recurrent Networks

Have directed cycles in their connection graph.
Have complicated dynamics — difficult to train.
More biologically realistic.
Very natural way to model sequential data
Have the ability to remember information in their hidden state for a long time (but hard to train to use this ability).

Symmetrically Connected Networks

Like recurrent networks, but the connections between units are symmetrical (same weight in both directions).
Much easier to analyze than recurrent networks (John Hopfield, et al)
More restricted in what they can do, because they obey an energy function. for example, they cannot model cycles.

The First Generation of Neural Networks

Standard Paradigm for Statistical Pattern Recognition

Convert the raw input vector into a vector of feature activations (using hand-written programs based on common-sense to define the features).
Learn how to weight each of the feature activations to get a single scalar quantity.
If this quantity is above some threshold, decide that the input vector is a positive example of the target class.

Perceptrons

input units => (by hand-coded programs) => feature units => decision unit

Popularized by Frank Rosenblatt in the early 1960s.
Lots of grand claims were made for what they could learn to do.
In 1969, Minsky and Papert showed Perceptrons' limitations (Group Invariance Theorem). Many people thought these limitations applied to all neural network models.
Perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.
Decision units in perceptrons are binary threshold neurons.

Perceptron Learning Procedure

Add an extra component with value 1 to each input vector ("bias" weight) and forget about the threshold. (bias) = -(threshold)
Pick training cases using any policy that ensures that every training case will keep getting picked.
- If the output unit is correct, leave its weights alone.
- If the output unit incorrectly outputs a zero, add the input vector to the weight vector.
- If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector.
This is guaranteed to find a set of weights that gets the right answer for all the training cases if any such set exists.

Geometrical View of Perceptrons

Weight space (each point corresponds to a particular setting for all the weights)
Each training cases represents a hyperplane through the origin (if we eliminate the threshold/bias). That hyperplane is perpendicular to the input vector.
- the weights must lie on one side of this hyperplane to get the answer correctly.
To get all the training cases right we need to find a point on the right side of all the planes (there may not be any such point).
If there are any weight vectors that get the right answer for all cases, they lie in a hyper-cone with its apex at the origin. This means the problem is convex.

Limitations of Perceptrons

If you are allowed to choose the features by hand and if you use enough features, you can do almost anything.
But once the hand-coded features have been determined, there are very strong limitations on what a perceptron cal learn.
- A binary threshold output unit cannot even tell if two single bit features are the same!
  - Positive cases (same): $(1,1) \rightarrow 1$, $(0,0) \rightarrow 1$
  - Negative cases (different): $(1,0) \rightarrow 0$, $(0,1) \rightarrow 0$
  - The four input-output pairs give four inequalities that are impossible to satisfy.
- Imagine 'data-space' in which the axes correspond to components of an input vector.
  - Each input vector is a point in this space, a weight vector defines a plane in data-space.
  - The weight plane is perpendicular to the weight vector and misses the origin by a distance equal to the threshold.
  - There are many cases where we cannot separate positive and negative cases only by a hyperplane. Those cases are called not linearly separable.
- Cannot discriminate simple patterns under translation with wrap-around.

Conclusion

Networks without hidden units are very limited in the input-output mappings they can learn to model. More layers of linear units do not help, it's still linear. Fixed output non-linearities are not enough.
We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets?
- We need an efficient way of adapting all the weights, not just the last layer. This is hard.
- Learning the weights going into hidden units is equivalent to learning features.
- This is difficult because nobody is telling us directly what the hidden units should do.

Linear Neurons

Learning Overview

$y = \sum_{i} w_i x_i = \mathbb{w}^T \mathbb{x}$

$y$: Neuron's estimate of the desired output, $\mathbb{w}$: weight vector, $\mathbb{x}$: input vector

Perceptron convergence procedure doesn't work for more complex networks (because the problem is not convex). Multi-layer neural networks do not use the perceptron learning procedure.
Instead of showing that the weights get closer to a good set of weights, we show that the actual output values get closer the target values.
The aim of learning is to minimize the error summed over all training cases. The error is the squared difference between the desired output and the actual output.

Why don't we solve it analytically?
- Scientific answer: We want a method that real neurons could use.
- Engineering answer: We want a method that can be generalized to multi-layer, non-linear neural networks.

Learning Procedure

The delta-rule for learning:
- $\Delta w_i = \epsilon x_i (t - y)$, $\epsilon$: learning rate, $t$: target value
- $E = \frac{1}{2} \sum_{n \in \text{training}} (t^n - y^n)^2$, $E$: error

The Error Surface for Linear Neurons

The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error.
- For a linear neuron with a squared error, it is a quadratic bowl.
- Vertical cross-sections are parabolas.
- Horizontal cross-sections are ellipses
For multi-layer, non-linear nets the error surface is much more complicated.
The simplest kind of batch learning does steepest descent on the error surface. This travels perpendicular to the contour lines.
The simplest kind of online learning zig-zags around the direction of steepest descent. This travels perpendicular to the constraint line of that particular training case.

Logistic Neurons

$z = b + \sum_{i} x_i w_i$, $y = \frac{1}{1 + e^{-z}}$

These give a real-valued output that is a smooth and bounded function of their total input.
The derivatives:
- $\frac{\partial z}{\partial w_i} = x_i$
- $\frac{\partial z}{\partial x_i} = w_i$
- $\frac{dz}{dy} = y(1-y)$
- By the chain rule, $\frac{\partial y}{\partial w_i} = x_i y (1-y)$.
- By the chain rule, $\frac{\partial E}{\partial w_i} = \sum_{n} \frac{\partial y^n}{\partial w_i} \frac{\partial E}{\partial y^n} = - \sum_{n} x_i^n y^n (1-y^n) (t^n - y^n)$.

The Backpropagation Algorithm

Other Ideas

Learning by Perturbing Weights

The ideas occurs to everyone who knows about evolution.
Randomly perturb one weight and see if it improves performance. If so, save the change.
- This is a form of reinforcement learning.
- Very inefficient. We need to do mutiple forward passes on a representative set of training cases just to change one weight.
- Towards the end of learning, large weight perturbations will nearly always make things worse.

Learning by Using Perturbations

We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes.
Not any better because we need lots of trials on each training cases to see the effect of changing one weight through the noise created by all the changes to other weights.
A better idea: Randomly perturb the activities of the hidden units. Once we know how we want a hidden activity to change on a given training case, we can compute how to change the weights.

The Idea behind Backpropagation

We don't know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.
We can compute error derivatives for all the hidden units efficiently at the same time.
- Once we have the error derivatives for the hidden activities, it's easy to get the error derivatives for the weights going into a hidden unit.

Sketch of the Backpropagation Algorithm

First convert the discrepance between each output and its target value into an error derivative.
Then compute error derivatives in each hidden layer from error derivatives in the layer above.
Then use error derivatives w.r.t. activities to get error derivatives w.r.t. the incoming weights.

Using the Derivatives Computed by Backpropagation

The backpropagation algorithm is an efficient way of computing the error derivative $\partial E / \partial w$ for every weight on a single training case.
To get a fully specified learning procedure, we still need to make a lot of other decisions about how to use these error derivatives:
- Opimization issues: How do we use the error derivatives on individual cases to discover a good set of weights?
- Generalization issues: How do we ensure that the learned weights work well for cases we did not see during training?

Optimization Issues

How often to update the weights:
- Online: after each training case.
- Full batch: after a full sweep through the training data.
- Mini-batch: after a small sample of training cases. This is typically used when training big nets on big data sets.
How much to update: (will be discussed further later)
- Use a fixed learning rate?
- Adapt the global learning rate?
- Adapt the learning rate on each connection esparately?
- Don't use steepest descent?

Generalization Issue (Overfitting)

The downside of using powerful models.
The training data contains information about the regularities in the mapping from input to output. But it also contains two types of noise:
- The target values may be unreliable (usually only a minor worry).
- There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen.
When we fit the model, it cannot tell which regularities are real and which are caused by sampling error, so it fits both kinds of regularity. If the model is very flexible it can model the sampling error really well. This is a disaster.

Ways to Reduce Overfitting

A large number of different methods have been developed:
- Weight-decay: keep weights small / zero.
- Weight-sharing: insist that many weights have the same value.
- Early stopping: Make a fake test set. As training the net, once the performance on the fake test set starts getting worse you stop training.
- Model averaging: Train many different neural nets and average them.
- Bayesian fitting of neural nets: really just a fancy form of model averaging.
- Dropout: Randomly omit a hidden unit.
- Generative pre-training: (Will be discussed towards the end of the course.)

Learning to Predict the Next Word

Diversion into Cognitive Science

There has been a long debate in cognitive science between two rival theories of what it means to have a concept:

The feature theory: A concept is a set of semantic features
- Good for explaining similarities between concepts.
- It's convenient -- a concept is a vector of feature activities.
The structuralist theory: The meaning of a concept lies in its relationships to other concepts.
- So conceptual knowledge is best expressed as a relational graph.
- Minsky used the limitations of perceptrons as evidence against feature vectors and in favor of relational graph representations.

Both sides are wrong.

These two theories need not be rivals. A neural net can use vectors of semantic features to implement a relational graph.
We may use explicit rules for conscious, deliberate reasoning, but we do a lot of commonsense, analogical reasoning by just "seeing" the answer with no conscious intervening steps.

Localist and Distributed Representations of Concepts

The obvious way to implement a relational graph in a neural net is to treat a neuron as a node in the graph and a connection as a binary relationship. But this "localist" method will not work:

We need many different types of relationship and the connections in a neural net do not have discrete labels.
We need ternary relationships as well as binary ones, e.g. A is between B and C

The right way to implement relational knowledge in a neural net is still an open issue.

But many neurons are probably used for each concept and each neuron is probably involved in many concepts. This is called a "distributed representation". It's a many-to-many mapping between concepts and neurons.

The Softmax Output Function (Another Diversion)

The squared error measure has some drawbacks. Is there a different cost function that works better?

Yes, force the outputs to represent a probability distribution across discrete alternatives.

Softmax

The output units in a softmax group uses a non-local non-linearity. Let $y_i$ be the output of a neuron in a softmax group, and $z_i$ (called logit) the input of the same neuron:

$$y_i = \frac{e^{z_i}}{\sum_{j \in \text{group}} e^{z_j} }$$

Note that $\sum y_i = 1$, so we force the $y_i$ to represent a probability distribution. We can show:

$$\frac{\partial y_i}{\partial z_i} = y_i (1 - y_i)$$

The right cost function to use with softmax is the negative log probability of the right answer. This is called the Cross Entropy cost function. ($t_j$ is the target value.)

$$C = - \sum_{j} t_j \log y_j$$

$C$ has a very big gradient when the target value is 1 and the output is almost zero.

A value of 0.000001 is much better than 0.000000001 (one-in-a-million chance vs one-in-a-billion chance).
The steepness of $dC/dy$ exactly balances the flatness of $dy/dz$.

$$\frac{\partial C}{\partial z_i} = \sum_{j} \frac{\partial C}{\partial y_j} \frac{\partial y_j}{\partial z_i} = y_i - t_i$$

Neuro-Probabilistic Language Models

We cannot identify phonemes perfectly in noisy speech.
People use their understanding of the meaning of the utterance to hear the right words.
- "We do this unconsciously when we wreck a nice beach"
- We are very good at it.
This means speech recognizers have to know which words are likely to come next and which are not.
- Fortunately words can be predicted quite well without full understanding.

The Standard "Trigram" Method

Take a huge amount of text and count the frequencies of all triples of words.

Use these frequencies to make bets on the relative probabilities of words given the previous two words.

Until very recently this was the state-of-the-art.
- We cannot use a much bigger context because there are too many possibilities to store and the counts would mostly be zero.
- We have to "back-off" to digrams when the count for a trigram is too small. The probability is not zero just because the count is zero!

A trigram model doesn't understand the similarities between words.
- The sentence "The cat got squashed in the garden on friday" should help us predict words in the sentence "The dog got flattened in the yard on monday".

To overcome this limitation, we need to use the semantic and syntactic features of previous words to predict the features of the next word.
- Using a feature representation also allows a context that contains many more previous words, e.g. 10.

Yoshua Bengio's neural net for predicting the next word

Object Recognition

Why Object Recognition is Difficult

Segmentation: Real scenes are cluttered with other objects.
Lighting: The intensities of the pixels are determined as much by the lighting as by the objects.
Deformation: Objects can deform in a variety of non-affine ways.
Affordances: Object classes are things defined by how they are used.
Viewpoint: Changes in viewpoint cause changes in images that standard learning methods cannot cope with.

Achieving Viewpoint Invariance

Viewpoint invariance is one of the main difficulties in making computers perceive.
Several Different approaches:
- Use redundant invariant features.
- Put a box around the object and use normalized pixels.
- Use replicated features with pooling. This is called convolutional neural nets (in later lectures).
- Use a hierarchy of parts that have explicit poses relative to the camera (in later lectures).

Invariant Feature Approach

Extract a large, redundant set of features that are invariant under transformations.
- e.g. pair of roughly parallel lines with a red dot between them (this is what baby herring gulls use to know where to peck for food).
With enough invariant features, there is only one way to assemble them into an object.
But for recognition, we must avoid forming features from parts of different objects.

Judicious Normalization Approach

Put a box around the object and use it as a coordinate frame for a set of normalized pixels.
- This solves the dimension-hopping problem. If we choose the box correctly, the same part of an object always occurs on the same normalized pixels.
- The box can provide invariance to many degrees of freedom: translation, rotation, scale, shear, stretch, ...

But choosing the box is difficult because of:
- Segmentation errors, occlusion, unusual orientations.

We need to recognize the shape to get the box right!

Brute Force Normalization Approach

When training the recognizer, use well-segmented, upright images to fit the correct box.

At test time try all possible boxes in a range of positions and scales.

- This approach is widely used for detecting upright things like faces and house numbers in unsegmented images.
- It is much more efficient if the recognizer can cope with some variation in position and scale so that we can use a coarse grid when trying all possible boxes.

Convolutional Nets for Digit Recognition

The Replicated Feature Approach

This is currently the dominant approach for neural networks.

Use many different copies of the same feature detector with different positions.

- Could also replicate across scale and orientation but it is tricky and expensive.
- Replication greatly reduces the number of free parameters to be learned.

Use several different feature types, each with its own map of replicated detectors.

Backpropagation with Weight Contraints

It's easy to modify the backpropagation algorithm to incorporate linear constraints between the weights.

We compute the gradients as usual, and then modify the gradients so that they satisfy the constraints - so if the weights started off satisfying the constraints, they will continue to satisfy them.

To constrain $w_1 = w_2$, we need $\Delta w_1 = \Delta w_2$.

Compute $\partial E / \partial w_1$ and $\partial E / \partial w_2$, then use $\partial E / \partial w_1 + \partial E / \partial w_2$ for both $w_1$ and $w_2$.

What Does Replicating the Feature Detectors Achieve?

Equivariant activities: Replicated features do not make the neural activities invariant to translation. The activities are equivariant.

Invariant knowledge: If a feature is useful in some locations during training, detectors for that feature will be available in all locations during testing.

Pooling the Outputs of Replicated Feature Detectors

Get a small amount of translational invariance at each level by averaging four neighboring replicated detectors to give a single output to the next level.
- This reduces the number of inputs to the next layer of feature extaction, thus allowing us to have many more different feature maps.
- Taking the maximum of the four works slightly better.

Problem: After several levels of pooling, we have lost information about the precise positions of things.
- This makes it impossible to use the precise spatial relationships between high-level parts for recognition.

Le Net

Yann LeCun and his collaborators developed a really good recognizer for handwritten digits by using backpropagation in a feedforward net with:
- Many hidden layers.
- Many maps of replicated units in each layer.
- Pooling of the outputs of nearby replicated units.
- A wide net that can cope with several characters at once even if they overlap.
- A clever way of training a complete system, not just a recognizer.

This net was used for reading ~10% of the checks in North America.

Look the impressive demos of LENET at http://yann.lecun.com

Priors and Prejudice

We can put our prior knowledge about the task into the network by designing appropriate:
- Connectivity
- Weight constraints
- Neuron activation functions

This is less intrusive than hand-designing the features
- But it still prejudices the network towards the particular way of solving the problem that we had in mind.

Alternatively, we can use our prior knowledge to create a whole lot more training data.
- This may require a lot of work (Hofman & Tresp, 1993). Read data + synthetic data.
- It may make learning take much longer.
- It allows optimization to discover clever ways of using the multi-layer network that we did not think of - and we may never fully understand how it does it.

The Brute Force Approach

LeNet uses knowledge about the invariances to design:
- the local connectivity
- the weight-sharing
- the pooling

This achieves about 80 errors.
- This can be reduced to about 40 errors by using many different transformations of the input and other tricks (Ranzato 2008).

Ciresan et. al. (2010) inject knowledge of invariances by creating a huge amount of carefully designed extra training data:
- For each training image, they produce many new training examples by applying many different transformations.
- They can then train a large, deep, dumb net on a GPU without much overfitting.
- They achieved about 35 errors.

With model averaging they can now get about 25 errors (which is about the same as human error rate).

How to Detect a Significant Drop in the Error Rate

Is 30 errors in 10,000 test cases significantly better than 40 errors?
- It all depends on the particular errors.
- The McNemar test uses the particular errors and can be much more powerful than a test that just uses the number of errors.

Convolutional Nets for Object Recognition

From Hand-Written Digits to 3D Objects

Recognizing real objects in color photographs downloaded from the web is much more complicated than recognizing hand-written digits:

- Hundred times as many classes (1000 vs 10)
- Hundred times as many pixels (256x256 color vs 28x28 gray)
- Two dimensional image of three-dimensional scene.
- Cluttered scenes requiring segmentation.
- Multiple objects in each image.

Will the same type of convolutional neural network work?

The ILSVRC-2012 Competition on ImageNet

The dataset has 1.2 million high-resolution training images.

The classification task:
- Get the "correct" class in your top 5 bets. There are 1000 classes.

The localization task:
- For each bet, put a box around the object. Your box must have at least 50% overlap with the correct box.

Some of the best existing computer vision methods were tried on this dataset by leading computer vision groups from Oxford, INRIA, XRCE, ...
- Computer vision systems use complicated multi-stage systems.
- The early stages are typically hand-tuned by optimizing a few parameters.

Error Rates on the ILSVRC-2012 Competition

University of Tokyo: 26.1% (classification)
Oxford University Computer Vision Group: 26.9% (classification)
INRIA(French national research institute in CS) + XRCE(Xerox Research Center Europe): 27.0% (classification)
University of Amsterdam: 29.5% (classification)
University of Toronto (Alex Krizhevsky): 16.4% (classification)

A Neural Network for ImageNet

Alex Krizhevsky (NIPS 2012) developed a very deep convolutional neural net of the type pioneered by Yann LeCun. Its architecture was:
- 7 hidden layers not counting some max pooling layers.
- The early layers were convolutional.
- The last two layers were globally connected.

The activation functions were:
- Rectified linear units in every hidden layer. These train much faster and are more expressive than logistic units.
- Competitive normalization to suppress hidden activities when nearby units have stronger activities. This helps with variations in intensity.

Tricks that Significantly Improve Generalization

Train on random 224x224 patches from the 256x256 images to get more data. Also use left-right reflections of the images.
- At test time, combine the opinions from ten different patches: The four 224x224 corner patches plus the central 224x224 patch plus the reflections of those five patches.

Use "dropout" to regularize the weights in the globally connected layers (which contain most of the parameters).
- Dropout means that half of the hidden units in a layer are randomly removed for each training example.
- This stops hidden units from relying too much on other hidden units.

The Hardward Required for Alex Krizhevsky's Net

He uses a very efficient implementation of convolutional nets on two Nvidia GTX 580 Graphics Processor Units (over 1000 fast little cores)
- GPUs are very good for matrix-matrix multiplies.
- GPUs have very high bandwidth to memory.
- This allows him to train the network in a week.
- It also makes it quick to combine results from 10 patches at test time.

We can spread a network over many cores if we can communicate the states fast enough.

As cores get cheaper and datasets get bigger, big neural nets will improve faster than old-fashioned (i.e. pre Oct 2012) computer vision systems.

Finding Roads in High-Resolution Images

Vlad Mnih (ICML 2012) used a non-convolutional net with local fields and multiple layers of rectified linear units to find roads in cluttered aerial images.
- It takes a large image patch and predicts a binary road label for the central 16x16 pixels.
- There is lots of labeled training data available for this task.

The task is hard for many reasons:
- Occlusion by buildings, trees, and cars.
- Shadows, Lighting changes.
- Minoro viewpoint changes.

The worst problems are incorrect labels:
- Badly registered maps
- Arbitrary decisions about what counts as a road (vs laneway etc.)

Mini-Batch Gradient Descent

Overview

The Error Surface: Lies in a space with a horizontal axis for each weight and one vertical axis for the error.
- For a linear neuron with a squared error, it is a quadratic bowl where vertical cross-sections are parabolas and horizontal cross-sections are ellipses.

For multi-layer, non-linear nets the error surface is much more complicated. But locally, a piece of a quadratic bowl is usually a very good approximation.

Going downhill reduces the error, but the direction of steepest descent does not point at the minimum unless the ellipse is a circle.
- The gradient is big in the direction in which we only want to travel a small distance, small in the direction in which we want to travel a large distance.

If the learning rate is big, the weights slosh to and fro across the ravine. If the learning rate is too big, this oscillation diverges.

What we would like to achieve: move quickly in directions with small but consistent gradients, and move slowly in directions with big but inconsistent gradients.

Stochastic Gradient Descent

If the dataset is highly redundant, the gradient on the first half is almost identical to the gradient on the second half.
- So, instead of computing the full gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half.
- The extreme version of this approach updates weights after each case. It's called "online" learning.

Mini-batches are usually better than online.
- Less computation is used updating the weights.
- Computing the gradient for many cases simultaneously uses matrix-matrix multiplies which are very efficient, especially on GPUs.

Mini-batches need to be balanced for classes (one way to approximate this is to pick points randomly).

Two types of learning algorithm

If we use the full gradient computed from all the training cases, there are many clever ways to speed up learning (e.g. non-linear conjugate gradient).

For large neural networks with very large and highly redundant training sets, it is nearly always best to use mini-batch learning.

Basic Mini-Batch Gradient Descent Algorithm

Guess an initial learning rate.
- If the error keeps getting worse or oscillates wildly, reduce the learning rate.
- If the error is falling fairly consistently but slowly, increase the learning rate.

Write a simple program to automate this way of adjusting the learning rate.

Toward the end of mini-batch learning, it nearly always helps to turn down the learning rate.
- This removes fluctuations in the final weights caused by the variations between mini-batches.

Turn down the learning rate when the error stops decreasing.

Note: Compute the error in a separate validation set.

Tricks for Mini-Batch Gradient Descent

Initializing The Weights

If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights, they will always get exactly the same gradient.

- So they can never learn to be different features.
- We break symmetry by initializing the weights to have small random values.

If a hidden unit has a big fan-in, small changes on many of its incoming weights can cause the learning to overshoot.
- We generally want smaller incoming weights when the fan-in is big, so initialize the weights to be proportional to $\sqrt{\text{fan-in}}$.

We can also scale the learning rate the same way.

Shifting the Inputs

When using steepest descent, shifting the input values makes a big difference.
- It usually helps to transform each component of the input vector so that it has zero mean over the whole training set.

The hyperbolic tangent (which is $2\times \text{logistic} - 1$) produces hidden activations that are roughly zero mean.
- In this respect it's better than the logistic.

Scaling the Inputs

When using steepest descent, scaling the input values makes a big difference.
- It usually helps to transform each component of the input vector so that is has unit variance over the whole training set.

A More Thorough Method: Decorrelate the Input Components

For a linear neuron, we get a big win by decorrelating each component of the input from the other input components.

There are several different ways to decorrelate inputs. A reasonable method is to use Principal Components Analysis.
- Drop the principal components with the smallest eigenvalues ― This achieves some dimensionality reduction.
- Divide the remaining principal components by the square roots of their eigenvalues. For a linear neuron, this converts an axis aligned elliptical error surface into a circular one.

For a circular error surface, the gradient points straight towards the minimum.

Common Problems That Occur in Multilayer Networks

If we start with a very big learning rate, the weights of each hidden unit will all become very big and positive or very big and negative.
- The error derivatives for the hidden units will all become tiny and the error will not decrease.
- This is usually a plateau, but people often mistake it for a local minimum.

In classification networks that use a squared error or a cross-entropy error, the best guessing strategy is to make each output unit always produce an output equal to the proportion of time it should be a 1.
- The network find this strategy quickly and may take a long time to improve on it by making use of the input.
- This is another plateau that looks like a local minimum.

Be Careful about Turning Down the Learning Rate

Turning down the learning rate reduces the random fluctuations in the error due to the different gradients on different mini-batches.
- So we get a quick win
- But then we get slower learning.

Don't turn down the learning rate too soon!

Four Ways to Speed up Mini-Batch Learning

Use "momentum"
- Instead of using the gradient to change the position of the weight "particle, use it to change the velocity.

Use separate adaptive learning rates for each parameter.
- Slowly adjust the rate using the consistency of the gradient for that parameter.

rmsprop: Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.
- This is the mini-batch version of just using the sign of the gradient.

Take a fancy method from the optimization literature that makes use of curvature information (not this lecture)
- Adapt it to work for neural nets.
- Adapt it to work for mini-batches.

The Momentum Method

Intuition

Imagine a ball on the error surface. The location of the ball in the horizontal plane represents the weight vector.

The ball starts off by following the gradient, but once it has velocity, it no longer does steepest descent.
Its momentum makes it keep going in the previous direction.

It damps oscillations in directions of high curvature by combining gradients with opposite signs.

It builds up speed in directions with a gentle but consistent gradient.

Equations

$$\mathbb{v}(t) = \alpha \mathbb{v}(t-1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

The effect of the gradient is to increment the previous velocity. The velocity also decays by $\alpha$ which is slightly less then 1.

The weight change is equal to the current velocity.

$$\Delta \mathbb{w}(t) = \mathbb{v} (t)$$

$$ = \alpha \mathbb{v} (t - 1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

$$ = \alpha \Delta \mathbb{w} (t - 1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

The weight change can be expressed in terms of the previous weight change and the current gradient.

Behavior

If the error surface is a tilted plane, the ball reaches a terminal velocity $\mathbb{v}(\infty)$.
- If the momentum is close to 1, this is much faster than simple gradient descent.

$$\mathbb{v}(\infty) = \frac{1}{1 - \alpha} \Big( - \epsilon \frac{\partial E}{\partial \mathbb{w}} \Big)$$

At the beginning of learning there may be very large gradients.
- So it pays to use a small momentum (e.g. 0.5).
- Once the large gradients have disappeared and the weights are stuck in a ravine the momentum can be smoothly raised to its final value (e.g. 0.9 or even 0.99)

This allows us to learn at a rate that would cause divergent oscillations without the momentum.

Better Type of Momentum (Nesterov 1983)

The standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient.
Ilya Sutskever (2012 unpublished) suggested a new form of momentum that often works better.
- Inspired by the Nesterov method for optimizing convex functions.
First make a big jump in the direction of the previous accumulated gradient.
Then measure the gradient where you end up and make a correction.
- It's better to correct a mistake after you have made it!

Adaptive Learning Rates for Each Connection

Intuition

In a multilayer net, the appropriate learning rates can vary widely between weights:
- The magnitudes of the gradients are often very different for different layers, especially if the initial weights are small.
- The fan-in of a unit determines the size of the "overshoot" effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error.

So user a global learning rate (set by hand) multiplied by an appropriate local gain that is determined empirically for each weight.

One Way to Determine the Individual Learning Rates

$$\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}$$

Start with a local gain of 1 for every weight. i.e. $g_{ij}(0) = 1$
Increase the local gain if the gradient for that weight does not change sign.
Use small additive increases and multiplicative decreases.
- This ensures that big gains decay rapidly when oscillations start.
- If the gradient is totally random, the gain will hover around 1 when we increase by plus $\delta$ half the time and decrease by times $1 - \delta$ half the time.

$$\text{if } \Big( \frac{\partial E}{\partial w_{ij}}(t) \cdot \frac{\partial E}{\partial w_{ij}}(t-1) \Big) > 0$$ $$\text{then } g_{ij}(t) = g_{ij}(t - 1) + 0.05$$ $$\text{else } g_{ij}(t) = g_{ij}(t - 1) * 0.95$$

Tricks for Making Adaptive Learning Rates Work Better

Limit the gains to lie in some reasonable range, e.g. [0.1, 10] or [0.01, 100].

Use full batch learning or very big mini-batches.
- This ensures that changes in the sign of the gradient are not mainly due to the sampling error of a mini batch.

Adaptive learning rates can be combined with momentum.
- Use the agreement in sign between the current gradient for a weight and the velocity for that weight (Jacobs, 1989).

Adaptive learning rates only deal with axis-aligned effects.
- Momentum does not care about the alignment of the axes.

RMSPROP: Divide The Gradient By Running Average of Its Recent Magnitude

RPROP: Using only the sign of the gradient

The magnitude of the gradient can be very different for different weights and can change during learning.
- This makes it hard to choose a single global learning rate.

For full batch learning, we can deal with this variation by only using the sign of the gradient.
- The weight updates are all of the same magnitude.
- This escapes from plateaus with tiny gradients quickly.

RPROP: This combines the idea of only using the sign of the gradient with the idea of adapting the step size separately for each weight.
- Increase the step size for a weight multiplicatively (e.g. times 1.2) if the signs of its last two gradients agree.
- Otherwise decrease the step size multiplicatively (e.g. times 0.5).
- Limit the step sizes to be less than 50 and more than a millionth (Mike Shuster's advice).

Why RPROP Does Not Work with Mini-Batches

The idea behind stochastic gradient descent is that when the learning rate is small, it averages the gradients over successive mini-batches.
- Consider a weight that gets a gradient of +0.1 on nine mini-batches and a gradient of -0.9 on the tenth mini-batch.
- We want this weight to stay roughly where it is.

RPROP would increment the weight nine times and decrement it once by about the same amount (assuming any adaptation of the step sizes is small on this time-scale), so the weight would grow a lot.
Is there a way to combine:
- The robustness of RPROP.
- The efficiency of mini-batches.
- The effective averaging of gradients over mini-batches.

RMSPROP: Mini-Batch Version of RPROP

RPROP is equivalent to using the gradient but also dividing by the size of the gradient.
- The problem with mini-batch RPROP is that we divide by a different number for each mini-batch. So why not force the number we divide by to be very similar for adjacent mini-batches?

RMSPROP: Keep a moving average of the squared gradient for each weight.

$$MeanSquare(w,t) = 0.9 \cdot MeanSquare(w, t-1) + 0.1 \Big( \frac{\partial E}{\partial w}(t) \Big) ^2$$

Dividing the gradient by $\sqrt{MeanSquare(w,t)}$ makes the learning work much better (Tijmen Tieleman, unpublished).

Further Developements of RMSPROP

Combining RMSPROP with standard momentum
- Momentum does not help as much as it normally does. Needs more investigation.

Combining RMSPROP with Nesterov momentum (Sutskever 2012)
- It works best if the RMS of the recent gradients is used to divide the correction rather than the jump in the direction of accumulated corrections.

Combining RMSPROP with adaptive learning rates for each connection
- Needs more investigation.

Other methods related to RMSPROP
- Yann LeCun's group has a fancy version in "No more pesky learning rates"

Summary of Learning Methods for Neural Networks

For small datasets (e.g. 10,000 cases) or bigger datasets without much redundancy, use a full-batch method.
- Conjugate gradient, LBFGS, ...
- Adaptive learning rates, RPROP, ...

For big, redundant datasets, use mini-batches.
- Try gradient descent with momentum.
- Try RMSPROP (with momentum?)
- Try LeCun's latest recipe.

Why there is no simple recipe:
1. Neural nets differ a lot:
  - Very deep nets (especially ones with narrow bottlenecks).
  - Recurrent nets.
  - Wide shallow nets.
2. Tasks differ a lot:
  - Some require very accurate weights, some don't.
  - Some have many very rare cases (e.g. words).