Difference between revisions of "Neural Networks (Geoffrey Hinton Course)"
Line 207: | Line 207: | ||
== Logistic Neurons == | == Logistic Neurons == | ||
− | $z = b + \sum_{i} x_i w_i$, $y = \frac{1}{1 + e^{-z}$ | + | $z = b + \sum_{i} x_i w_i$, $y = \frac{1}{1 + e^{-z}}$ |
* These give a real-valued output that is a smooth and bounded function of their total input. | * These give a real-valued output that is a smooth and bounded function of their total input. |
Revision as of 20:32, 6 November 2016
Some Simple Models or Neurons
$y$ output, $x_i$ input.
Linear Neurons
$y = b + \sum_{i} x_i w_i$
$w_i$ weights, $b$ bias
Binary Threshold Neurons
$z = \sum_{i} x_i w_i$
$y = 1$ if $z \geq \theta$, $0$ otherwise.
Or, equivalently,
$z = b + \sum_{i} x_i w_i$
$y = 1$ if $z \geq 0$, $0$ otherwise.
Rectified Linear Neurons
$z = b + \sum_{i} x_i w_i$
$y = z$ if $z > 0$, $0$ otherwise. (linear above zero, decision at zero.)
Sigmoid Neurons
Give a real-valued output that is a smooth and bounded function of their total input.
$z = b + \sum_{i} x_i w_i$
$y = \frac{1}{1 + e^{-z}}$
Stochastic Binary Neurons
Same equations as logistic units, but outputs $1$ (=spike) or $0$ randomly based on the probability. They treat the output of the logistic as the probability of producing a spike in a short time window.
$z = b + \sum_{i} x_i w_i$
$P(s = 1) = \frac{1}{1 + e^{-z}}$
We can do a similar trick for rectified linear units - in this case the output is treated as the Poisson rate for spikes.
Types of Learning
Supervised Learning
Learn to predict an output when given an input vector.
- Regression: The target output is a real number or a whole vector of real numbers.
- Classification: The target output is a class label.
How supervised Learning Typically Works
- Start by choosing a model-class: $y = f(x;W)$
- A model-class $f$ is a way of using some numerical parameters $W$ to map each input vector $x$ into a predicted output $y$.
- Learning usually means adjusting the parameters to reduce the discrepancy between the target output, $t$, on each training case and the actual output, $y$, produced by the model.
- For regression $\frac{1}{2}(y-t)^2$ is often a sensible measure of the discrepancy.
- For classification there are other measures that are generally more sensible (they also work better).
Reinforcement Learning
Learn to select an action to maximize payoff.
- The output is an action or sequence of actions and the only supervisory signal is an occasional scalar reward.
- The goal in selecting each action is to maximize the expected sum of the future rewards.
- We usually use a discount factor for delayed rewards so that we don't have to look too far into the future.
- Reinforcement learning is difficult because:
- The rewards are typically delayed so it's hard to know where we went wrong (or right).
- A scalar reward does not supply much information.
- Typically you can't learn millions of parameters using reinforcement learning (you can with supervised/unsupervised learning). Typically you learn dozens or thousands of parameters.
- Will not be covered in this course.
Unsupervised Learning
Discover a good internal representation of the input.
- For about 40 years unsupervised learning was largely ignored by the machine learning community (except for clustering).
- It is hard to say what the aim of unsupervised learning is:
- One major aim is to create an internal representation of the input that is useful for subsequent supervised or reinforcement learning.
- You can compute the distance to a surface by using the disparity between two images. But you don't want to learn to compute disparities by stubbing your toe thousands of times.
- Other goals:
- Providing a compact, low-dimensional representation of the input.
- High-dimensional inputs typically live on or near a low-dimensional manifold (or several such manifolds)
- Principal Component Analysis is a widely used linear method for finding a low-dimensional representation.
- Providing an economical high-dimensional representation of the input in terms of learned features.
- Binary features
- Real-valued features that are nearly all zero
- Finding sensible clusters in the input
- This is an example of a very sparse code in which only one of the features is non-zero.
- Providing a compact, low-dimensional representation of the input.
Neural Network Architectures
- Feed-forward architecture: information comes into the input units and flows one direction through hidden layers until each reaches the output units.
- Recurrent neural network: information can flow around in cycles.
- Symmetrically connected network: weights are the same in both directions between two units.
Feed-forward Neural Networks
- The commonest type of neural network
- The first layer is the input and the last layer is the output
- Called "deep" neural networks if there is more than one hidden layer.
- They compute a series of transformations that change the similarities between cases - the activities of the neurons in each layer are a non-linear function of the activities in the layer below.
Recurrent Networks
- Have directed cycles in their connection graph.
- Have complicated dynamics — difficult to train.
- More biologically realistic.
- Very natural way to model sequential data
- Have the ability to remember information in their hidden state for a long time (but hard to train to use this ability).
Symmetrically Connected Networks
- Like recurrent networks, but the connections between units are symmetrical (same weight in both directions).
- Much easier to analyze than recurrent networks (John Hopfield, et al)
- More restricted in what they can do, because they obey an energy function. for example, they cannot model cycles.
The First Generation of Neural Networks
Standard Paradigm for Statistical Pattern Recognition
- Convert the raw input vector into a vector of feature activations (using hand-written programs based on common-sense to define the features).
- Learn how to weight each of the feature activations to get a single scalar quantity.
- If this quantity is above some threshold, decide that the input vector is a positive example of the target class.
Perceptrons
input units => (by hand-coded programs) => feature units => decision unit
- Popularized by Frank Rosenblatt in the early 1960s.
- Lots of grand claims were made for what they could learn to do.
- In 1969, Minsky and Papert showed Perceptrons' limitations (Group Invariance Theorem). Many people thought these limitations applied to all neural network models.
- Perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.
- Decision units in perceptrons are binary threshold neurons.
Perceptron Learning Procedure
- Add an extra component with value 1 to each input vector ("bias" weight) and forget about the threshold. (bias) = -(threshold)
- Pick training cases using any policy that ensures that every training case will keep getting picked.
- If the output unit is correct, leave its weights alone.
- If the output unit incorrectly outputs a zero, add the input vector to the weight vector.
- If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector.
- This is guaranteed to find a set of weights that gets the right answer for all the training cases if any such set exists.
Geometrical View of Perceptrons
- Weight space (each point corresponds to a particular setting for all the weights)
- Each training cases represents a hyperplane through the origin (if we eliminate the threshold/bias). That hyperplane is perpendicular to the input vector.
- the weights must lie on one side of this hyperplane to get the answer correctly.
- To get all the training cases right we need to find a point on the right side of all the planes (there may not be any such point).
- If there are any weight vectors that get the right answer for all cases, they lie in a hyper-cone with its apex at the origin. This means the problem is convex.
Limitations of Perceptrons
- If you are allowed to choose the features by hand and if you use enough features, you can do almost anything.
- But once the hand-coded features have been determined, there are very strong limitations on what a perceptron cal learn.
- A binary threshold output unit cannot even tell if two single bit features are the same!
- Positive cases (same): $(1,1) \rightarrow 1$, $(0,0) \rightarrow 1$
- Negative cases (different): $(1,0) \rightarrow 0$, $(0,1) \rightarrow 0$
- The four input-output pairs give four inequalities that are impossible to satisfy.
- Imagine 'data-space' in which the axes correspond to components of an input vector.
- Each input vector is a point in this space, a weight vector defines a plane in data-space.
- The weight plane is perpendicular to the weight vector and misses the origin by a distance equal to the threshold.
- There are many cases where we cannot separate positive and negative cases only by a hyperplane. Those cases are called not linearly separable.
- Cannot discriminate simple patterns under translation with wrap-around.
- A binary threshold output unit cannot even tell if two single bit features are the same!
Conclusion
- Networks without hidden units are very limited in the input-output mappings they can learn to model. More layers of linear units do not help, it's still linear. Fixed output non-linearities are not enough.
- We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets?
- We need an efficient way of adapting all the weights, not just the last layer. This is hard.
- Learning the weights going into hidden units is equivalent to learning features.
- This is difficult because nobody is telling us directly what the hidden units should do.
Linear Neurons
Learning Overview
$y = \sum_{i} w_i x_i = \mathbb{w}^T \mathbb{x}$
$y$: Neuron's estimate of the desired output, $\mathbb{w}$: weight vector, $\mathbb{x}$: input vector
- Perceptron convergence procedure doesn't work for more complex networks (because the problem is not convex). Multi-layer neural networks do not use the perceptron learning procedure.
- Instead of showing that the weights get closer to a good set of weights, we show that the actual output values get closer the target values.
- The aim of learning is to minimize the error summed over all training cases. The error is the squared difference between the desired output and the actual output.
- Why don't we solve it analytically?
- Scientific answer: We want a method that real neurons could use.
- Engineering answer: We want a method that can be generalized to multi-layer, non-linear neural networks.
Learning Procedure
- The delta-rule for learning:
- $\Delta w_i = \epsilon x_i (t - y)$, $\epsilon$: learning rate, $t$: target value
- $E = \frac{1}{2} \sum_{n \in \text{training}} (t^n - y^n)^2$, $E$: error
The Error Surface for Linear Neurons
- The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error.
- For a linear neuron with a squared error, it is a quadratic bowl.
- Vertical cross-sections are parabolas.
- Horizontal cross-sections are ellipses
- For multi-layer, non-linear nets the error surface is much more complicated.
- The simplest kind of batch learning does steepest descent on the error surface. This travels perpendicular to the contour lines.
- The simplest kind of online learning zig-zags around the direction of steepest descent. This travels perpendicular to the constraint line of that particular training case.
Logistic Neurons
$z = b + \sum_{i} x_i w_i$, $y = \frac{1}{1 + e^{-z}}$
- These give a real-valued output that is a smooth and bounded function of their total input.