Spanish 101

2017-02-19T06:46:15Z

Tedyun: /* Definite Articles */

== The Basics of Spanish ==

=== Gender of Nouns ===

==== The general rule ====

If a word ends in "o" it is usually masculine. If a word ends in "a" it is usually feminine.

"the" = "el" for masculine, "la" for feminine.

"a" = "un" in masculine, "una" is feminine

* el mundo = world
* el chico = boy
* el colegio = school
* el coche = car
* el miedo = fear

* la chica = girl
* la piscina = swimming pool
* la mesa = table
* la libertad = liberty

==== Masculine to Feminine ====

No general rule.

* el gato => la gata (cat)
* el perro => la perra (dog)
* el chico (boy) => la chica (girl)
* el hermano (brother) => la hermana (sister)

Note. Words that have feminine meanings or connotations tend to be a feminine word. Exceptions:

* el vestido (dress)
* la corbata (tie)

More examples:

* el doctor => la doctora (doctor)
* el profesor => la profesora (teacher)

Exceptional Examples. In some cases, the actual noun does not change, but the article does in accordance to whom it is referring to.

* el estudiante => la estudiante (student)
* el artista => la artista (artist)

==== Exceptional masculine nouns ending in "a" ====

* el problema (problem)
* el mapa (map)
* el tema (issue)
* el sistema (system)

==== Exceptional feminine nouns ending in "o" ====

* la mano (hand)
* la radio (radio)

==== More Feminine Noun rules ====

Nouns ending in '''-sión''', '''-ción''', '''-dad''', '''-tad''', '''-tud''', '''-umbre''' are feminine.

* la televisión (TV)
* la decisión (decision)
* la conversación (conversation)
* la ciudad (city)
* la actitud (attitude)
* la libertad (liberty)
* la certidumbre (certainty)

=== Plural Forms of Nouns ===

==== Definite Articles in the Plural Form ====

* "el" becomes "los"
* "la" becomes "las"

==== Vowel Ending ====

If a noun ends in a vowel, then it becomes plural by adding the letter "s".

* casa (house) => casas
* el libro (book) => los libros
* la pluma (pen) => las plumas
* el chico (boy) => los chicos
* la señora (women referred to older women) => las señoras

==== Consonant Ending ====

If a noun ends in a consonant, it becomes plural by adding "es".

* país (country) => países
* el borrador (draft) => los borradores
* la universidad (university) => las universidades
* el profesor (teacher) => los profesores
* la ciudad (city) => las ciudades

==== "-ión" Ending ====

If a noun ends in "-ión", add "-es" and drop the written accent.

* el avión (aeroplane) => los aviones (Note. this is ''masculine'', exception of the above rule.)
* la conversación (conversation) => las conversaciones
* la sección (section) => las secciones
* la televisión (TV) => las televisiones

==== "-z" Ending ====

Noun ends in "-z", add "-es" and change the "z" to "c".

* el lápiz => los lápices
* la voz (voice) => las voces
* el tapiz (wallpaper) => los tapices
* la actriz (actress) => la actrices

==== Maschuline and Feminine Plural Nouns Combined ====

When the plural refers to two or more nouns of different genders, the masculine plural is used.

* 1 camello (camel) + 4 camellas = 5 camellos

==== Compound Nouns ====

Compound nouns are always masculine, therefor in the plural form the definite article changes from "el" to "los".

* abre (open) + latas (cans) = abrelatas (can opener)
** el abrelatas => los abrelatas

=== Definite Articles ===

el, la, los, las.

The definite article in Spanish must agree with the gender and number of the noun that follows it.

==== Spanish Definite Article ====

* El colegio = The school

* La manzana = The apple

* Los chicos = The boys

* Las chicas = The girls

==== Spanish's Unique Uses of the Definite Article ====

The following topics require the definite article:

* Days of the week
* Seasons of the year
* With more than one nouns
* Before the names of some countries
* With generic nouns
* With names of languages
* With clothing, body parts and other personal items
* With infinitives used as subjects
* Before names of streets
* With personal titles
* In certain set phrases

==== Days of the Week ====

* el lunes (Monday), el jueves (Thursday)
* Note that the days of the week are all masculine because they all refer to "el día" (day) which is masculine.
* The days of the week are preceded by either "el" or "los" depending on whether the day is singular or plural.
** Voy a la biblioteca el viernes = I am going to the library on Friday.
** Voy a la biblioteca los viernes = I go to the library on Fridays.
* When using "ser" which means "to be", the definite article is not used.
** Today is Sunday = Hoy es domingo.

==== Season of the Year ====

* Seasons usually have the definite article.
** Prefiero los veranos = I prefer summers. (prefrir = to prefer)
* It can be optional if after "de", "en" or verb form of "ser".
** No quiero estudiar en invierno = I do not want to study in the winter.

==== With More Than One Nouns ====

* Hoy, leo el libro y la revista = Today I am reading the book and the magazine.
* Quiero la camisa y los pantalones = I want the shirt and trousers.

==== Generic Nouns ====

* This refers to nouns that are a concept or to a substance/ideas in general, no specific.
** Los árabes son ricos = Arabs are rich.
** La manzana es sana = Apple is healthy.

* Exception: The article is omitted after the preposition "de" which could be used to describe possession, or when the noun following "de" describes the first noun and therefore doesn't refer to a specific person/thing.
** Los zapatos de mujeres = women's shoes.
** Los zapatos de las mujeres = the shoes of the women.

==== Names of Languages ====

* El francés = French
* El alemán = German
* El español = Spanish

* Exceptions
*# When the language is immediately followed by "en" which means "in".
*#* En español... = In Spanish...
*# For a verb that is often used to describe languages such as "hablar" (to talk), "aprender" (to understand), "estudiar" (to study), and "saber" (to know).
*#* Hablo español = I speak Spanish.
*# The article is required after an adverb/preposition other than en.
*#* Hablo bien el español = I speak Spanish well.
*# The article is required when used with other verbs.
*#* Prefiero el francés = I prefer French.

==== Clothing, Human Body and Personal Items ====

* In Spanish, it is normal to use the definite article instead of the possessive adjective like in English.
** ¡Abre los ojos! = Open your eyes!
** Perdió los zapatos = He lost his shoes.

==== Verb Infinitives ====

Infinitive in English is the verb name which translate into "to x". For example "comprar" is an infinitive form of the verb which means "to buy".

In Spanish, we can say "smoking is prohibited" with using the verb infinitive and the Spanish definitive article.

* El fumar está prohobido = Smoking is prohibited.

==== Prior to the Names of Certain Countries ====

* Only for certain countries:
** For example, el Reino Unido (the UK), la India (India), la China (China)

* Voy a México = I am going to Mexico.
* We use the definite article when there is an adjective describing the noun
** Voy al México bello = I am going to beautiful Mexico.
* Note. "al" = "a" + "el".

==== Street Names ====

* Streets, avenues, plazas and similar places are usually preceded by a definite article.
** La Casa Blanca es la residencia de Barack Obama = The White House is the residence of Barack Obama.

==== Personal Titles ====

A definite article is used before the personal titles when you are talking about people not when you are speaking directly to them.

* La señora Lopez está ocupada = Mrs. Lopez is busy.

Spanish 101

2017-02-19T06:35:16Z

Tedyun: /* Definite Articles */

Spanish 101

2017-02-19T06:34:08Z

Main Page

2017-02-09T06:22:50Z

Tedyun: /* Status */

== Status ==

* Ongoing
# Statistics, Machine Learning, and Data Science
# [[Neural Networks (Geoffrey Hinton Course)]]
# [[Spanish 101]]

* Intermittent (at least once a month)

# Cryptography
# Weekly Articles (Atlantic, Economist)
# Novels

* Backlog

# TypeScript and Other Web-related technologies
# Spanish
# Japanese
# French
# English Advanced Writing/Reading
# Physics
# Numberical Analysis

== Job Options ==

# Software Developer
# Data Scientist
# Hedge Fund
# Academics - Applied Math
# Academics - Computer Science
# Academics - Business Schools, e.g. Operations Research
# Academics - Mathematics

== Priorities ==

== Projects ==

# [[Data Science]]
# [[Oracle]]
# [[Git Tips]]
# [[Machine Learning]]
# [[OLAP]]
# [[Visualization]]
# [[D3.js]]
# [[MapReduce]]
# [[NoSQL]]
# [[Information Theory, Pattern Recognition, and Neural Networks]]
# [[Optics]]
# [[Multi-Pivot Quicksort]]

== Things to Do NOW ==

# [[HubAnalytics]]
# Look into Differential poset problem
# Learn [[Discrete Morse Theory]]
# Solve the P//G problem at least for rank 2 graph!! NO
# Submit rainbow graph paper to a journal
# Apply for Conferences
# Apply for Internship: http://www.gs.com
# Study Machine Learning: http://courses.csail.mit.edu/6.867/wiki/index.php?title=6.867_Machine_Learning_(2011_Fall)
# Study Graphical Models: https://stellar.mit.edu/S/course/6/fa11/6.438/
# CFA level 1
# Coding theory and combinatorial optimization (linear programming)

== Teaching ==

# [[18.02 Calculus]]

== Housing ==

# [[Housing]]

== List of Projects ==

# [[Affine Balanced Labellings]]
# [[Affine Quasisymmetric Functions]]
# [[Differential Posets]]
# [[Permutation RSK distribution]]
# [[Hopf Algebras]]
# [[Groups Acting on Posets]]
# <strike>Stanley's Cycle Problem: http://math.mit.edu/~ahmorales/mywiki/doku.php?id=cycles:stanley_s_cycle_problem</strike>
# [[Eigenvalue Problem]]
# [[Markov Chains and Schubert Polynomials]]
# [[Toric Arrangements and Arithmetic Matroids]]
# [[Restricted Partition Posets and Specht Modules]]

== List of Big Problems ==

# Finding the percolation threshold
# Prove that 3-sum problem takes at least $n^2 \log n$ or find faster algorithm.
# Shellsort increment sequence and time complexity http://en.wikipedia.org/wiki/Shellsort
# Find efficient and "symmetric" algorithm for deletion in Binary Search Tree.
# Find a deterministic (fast) algorithm to get the square root of a number modulo $p$ where $p$ is a prime and $p \equiv 1$ mod 4.
# Is breaking RSA as hard as factoring? (major open problem) In other words, computing an $e$-th root modulo $N=pq$ (given $gcd(e, \phi(N)) = 1$) is as hard as factoring $N$.
# In the RSA setting, given an efficient algorithm to compute the 3rd root modulo $N$ (i.e. $e = 3$), is there an efficient algorithm to factor $N$? (The answer is true for $e = 2$ but $gcd(2, \phi(N)) \neq 1$ in this case.)

== Theories I need to learn ==

# Review algebraic geometry with cohomology theory
# Coxeter groups, root systems, Hecke Algebras
# Crystals
# Promotion and Evacuation
# Fulton's Young Tableaux
# [[Tropical RSK Correspondence]]

== Dictionary ==

# [[Promotion and Evacuation]]
# [[Hecke Algebras]]
# [[Grothendieck Groups]]
# [[Ito Calculus]]
# [[Machine Learning]]

== Random Ideas ==

* When proving a structure constant of some multiplication with conjectural structure constant C, define $\mu(X_a, X_B) := \sum C_{a,b}^c X_c$ and show that $\mu(1,X_a) = X_a$ and $\mu(X_r X_a,X_b) = \mu(X_r, X_a X_b)$ for $X_r, r\in S$ where $S$ generates R. Then $\mu$ should be the multiplication.

== Android App Ideas ==

* Convert zip codes to city, State. (first need to build the database.)

== Papers to read ==

* On the cd-index and gamma-vector of S*-shellable CW-spheres - Satoshi Murai, Eran Nevo
We show that the $\gamma$-vector of the order complex of any polytope is the f-vector of a balanced simplicial complex. This is done by proving this statement for a subclass of Stanley's S-shellable spheres which includes all polytopes. The proof shows that certain parts of the cd-index, when specializing $c=1$ and considering the resulted polynomial in $d$, are the f-polynomials of simplicial complexes that can be colored with "few" colors. We conjecture that the cd-index of a regular CW-sphere is itself the flag f-vector of a colored simplicial complex in a certain sense.
http://arxiv.org/pdf/1102.0096v1

* KP solitons, total positivity, and cluster algebras - Yuji Kodamaa and Lauren K. Williams
Soliton solutions of the KP equation have been studied since 1970, when Kadomtsev and Petviashvili [Kadomtsev BB, Petviashvili VI (1970) Sov Phys Dokl 15:539–541] proposed a two-dimensional nonlinear dispersive wave equation now known as the KP equation. It is well-known that the Wronskian approach to the KP equation provides a method to construct soliton solutions. The regular soliton solutions that one obtains in this way come from points of the totally nonnegative part of the Grassmannian. In this paper we explain how the theory of total positivity and cluster algebras provides a framework for understanding these soliton solutions to the KP equation. We then use this framework to give an explicit construction of certain soliton contour graphs and solve the inverse problem for soliton solutions coming from the totally positive part of the Grassmannian.
http://arxiv.org/abs/1106.0023

== List of Problems ==

*Prove positivity for linear Laurent phenomenon algebras & polytopality of linear Laurent phenomenon complexes
http://arxiv.org/abs/1206.2611

*Bunkbed conjecture - Svante linusson
http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=7930738
http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=7930750

*Bona's Conjecture on Product of Cycles
$u \cdot v = w$ where $u$ and $v$ are long cycles, probability that 1 and 2 are in the same cycle of $w$

*Conjecture of Thomas Lam on Affine NilCoxeter Algebras
http://arxiv.org/pdf/1007.2871v1

*Pieri rule for toric Schur functions and Affine Stanley symmetric functions
http://arxiv.org/pdf/math/0501335v1

*Let $n\geq 2$ and $t \geq 0$. Let $f(n,t)$ be the number of sequences with $n$ $x$'s and $2t$ $a_{ij}$'s, where $1\leq i < j \leq n$, such that each $a_{ij}$ occurs between the i-th $x$ and the j-th $x$ in the sequence. (Thus the total number of terms in each sequence is $n + 2t\binom{n}{2}$.) Then,
$f(n,t) = \frac{(n+tn(n-1))!}{n!t!^n(2t)!^{\binom{n}{2}}} \displaystyle\prod_{j=1}^{n} \frac{((j-1)t)!^2(jt)!}{(1+(n+j-2)t)!}$. (Problem 27 of http://math.mit.edu/~rstan/bij.pdf )

*The n-cube $C_n$ (as a graph) is the graph with vertex set $\{ 0,1\} ^n$ (i.e., all binary n-tuples), with an edge between u and v if they differ in exactly one coordinate. Thus $C_n$ has $2^n$ vertices and $n2^{n-1}$ edges. Then,

$c(C_n) = 2^{2^n-n-1}\displaystyle\prod_{k=1}^{n} k^{\binom{n}{k}}$,

where c(G) is the number of spanning trees of G.

*Find a bijection between a staircase tableaux with no $\delta$ of size $n$ and perfect matchings of $\{1,2,\ldots,2n+2\}$. http://math.mit.edu/~rstan/papers/staircase2.pdf

*Problem of finding volumes (Ehrhart polynomials) of valuation polytope (EC1 2nd ed. Exer. 4.62)

*Prove that $J(P)$ where $P = \{(a_0,a_1), (a_0,b_1), (a_1,a_2), (a_1,b_2), \ldots \}$ is the minimal distributive lattice with the property that it has antichains that has one element at each rank. ("An extremal problem for finite topologies and distributive lattices" - Richard Stanley http://math.mit.edu/~rstan/pubs/pubfiles/13.pdf)

* Diffie-Hellman key exchange protocol provides a "non-attractive" way to exchanges keys between two people. Find such protocol for general $n$ people. ($n=3$ case solved by Joux, "A One Round Protocol for Tripartite Diffie–Hellman" http://cgi.di.uoa.gr/~aggelos/crypto/page4/assets/joux-tripartite.pdf)

== Others ==

# [[new desktop]]
# [[Lenovo x220]]
# math dept sage notebook server: http://localhost:8000/home/admin/ from runge.mit.edu
# office sage server: https://mathstation051.mit.edu:8000/
# home: https://tedyun.mit.edu:8000/
# repn theory of quivers http://www-fourier.ujf-grenoble.fr/~mbrion/notes_quivers_rev.pdf
# [[The Federal Budget Deficit: Causes, Consequences, and Potential Remedies]]
# [[Computer Science Master's Program]]
# [[Emacs]]
# [[Networking 101]]
# [[New England Tour]]
# [[NSA Back Door to NIST]]

Neural Networks (Geoffrey Hinton Course)

2017-01-15T04:07:54Z

Tedyun: /* Summary of Learning Methods for Neural Networks */

== Some Simple Models or Neurons ==

$y$ output, $x_i$ input.

=== Linear Neurons ===

$y = b + \sum_{i} x_i w_i$

$w_i$ weights, $b$ bias

=== Binary Threshold Neurons ===

$z = \sum_{i} x_i w_i$

$y = 1$ if $z \geq \theta$, $0$ otherwise.

Or, equivalently,

$z = b + \sum_{i} x_i w_i$

$y = 1$ if $z \geq 0$, $0$ otherwise.

=== Rectified Linear Neurons ===

$z = b + \sum_{i} x_i w_i$

$y = z$ if $z > 0$, $0$ otherwise. (linear above zero, decision at zero.)

=== Sigmoid Neurons ===

Give a real-valued output that is a smooth and bounded function of their total input.

$z = b + \sum_{i} x_i w_i$

$y = \frac{1}{1 + e^{-z}}$

=== Stochastic Binary Neurons ===

Same equations as logistic units, but outputs $1$ (=spike) or $0$ randomly based on the probability. They treat the output of the logistic as the probability of producing a spike in a short time window.

$z = b + \sum_{i} x_i w_i$

$P(s = 1) = \frac{1}{1 + e^{-z}}$

We can do a similar trick for rectified linear units - in this case the output is treated as the Poisson rate for spikes.

== Types of Learning ==

=== Supervised Learning ===
Learn to predict an output when given an input vector.
* Regression: The target output is a real number or a whole vector of real numbers.
* Classification: The target output is a class label.

==== How Supervised Learning Typically Works ====
# Start by choosing a model-class: $y = f(x;W)$
#* A model-class $f$ is a way of using some numerical parameters $W$ to map each input vector $x$ into a predicted output $y$.
# Learning usually means adjusting the parameters to reduce the discrepancy between the target output, $t$, on each training case and the actual output, $y$, produced by the model.
#* For regression $\frac{1}{2}(y-t)^2$ is often a sensible measure of the discrepancy.
#* For classification there are other measures that are generally more sensible (they also work better).

=== Reinforcement Learning ===
Learn to select an action to maximize payoff.

* The output is an action or sequence of actions and the only supervisory signal is an occasional scalar reward.
** The goal in selecting each action is to maximize the expected sum of the future rewards.
** We usually use a discount factor for delayed rewards so that we don't have to look too far into the future.
* Reinforcement learning is difficult because:
** The rewards are typically delayed so it's hard to know where we went wrong (or right).
** A scalar reward does not supply much information.
* Typically you can't learn millions of parameters using reinforcement learning (you can with supervised/unsupervised learning). Typically you learn dozens or thousands of parameters.
* Will not be covered in this course.

=== Unsupervised Learning ===
Discover a good internal representation of the input.

* For about 40 years unsupervised learning was largely ignored by the machine learning community (except for clustering).
* It is hard to say what the aim of unsupervised learning is:
** One major aim is to create an internal representation of the input that is useful for subsequent supervised or reinforcement learning.
** You can compute the distance to a surface by using the disparity between two images. But you don't want to learn to compute disparities by stubbing your toe thousands of times.
* Other goals:
** Providing a compact, low-dimensional representation of the input.
*** High-dimensional inputs typically live on or near a low-dimensional manifold (or several such manifolds)
*** Principal Component Analysis is a widely used linear method for finding a low-dimensional representation.
** Providing an economical high-dimensional representation of the input in terms of learned features.
*** Binary features
*** Real-valued features that are nearly all zero
** Finding sensible clusters in the input
*** This is an example of a very sparse code in which only one of the features is non-zero.

== Neural Network Architectures ==

# Feed-forward architecture: information comes into the input units and flows one direction through hidden layers until each reaches the output units.
# Recurrent neural network: information can flow around in cycles.
# Symmetrically connected network: weights are the same in both directions between two units.

=== Feed-forward Neural Networks ===

* The commonest type of neural network
* The first layer is the input and the last layer is the output
* Called "deep" neural networks if there is more than one hidden layer.
* They compute a series of transformations that change the similarities between cases - the activities of the neurons in each layer are a non-linear function of the activities in the layer below.

=== Recurrent Networks ===

* Have directed cycles in their connection graph.
* Have complicated dynamics — difficult to train.
* More biologically realistic.
* Very natural way to model sequential data
* Have the ability to remember information in their hidden state for a long time (but hard to train to use this ability).

=== Symmetrically Connected Networks ===

* Like recurrent networks, but the connections between units are symmetrical (same weight in both directions).
* Much easier to analyze than recurrent networks (John Hopfield, et al)
* More restricted in what they can do, because they obey an energy function. for example, they cannot model cycles.

== The First Generation of Neural Networks ==

=== Standard Paradigm for Statistical Pattern Recognition ===

* Convert the raw input vector into a vector of feature activations (using hand-written programs based on common-sense to define the features).
* Learn how to weight each of the feature activations to get a single scalar quantity.
* If this quantity is above some threshold, decide that the input vector is a positive example of the target class.

=== Perceptrons ===

input units => (by hand-coded programs) => feature units => decision unit

* Popularized by Frank Rosenblatt in the early 1960s.
* Lots of grand claims were made for what they could learn to do.
* In 1969, Minsky and Papert showed Perceptrons' limitations ('''Group Invariance Theorem'''). Many people thought these limitations applied to all neural network models.
* Perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.
* Decision units in perceptrons are '''binary threshold neurons'''.

==== Perceptron Learning Procedure ====

# Add an extra component with value 1 to each input vector ("bias" weight) and forget about the threshold. (bias) = -(threshold)
# Pick training cases using any policy that ensures that every training case will keep getting picked.
#* If the output unit is correct, leave its weights alone.
#* If the output unit incorrectly outputs a zero, add the input vector to the weight vector.
#* If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector.
# This is guaranteed to find a set of weights that gets the right answer for all the training cases ''if any such set exists''.

==== Geometrical View of Perceptrons ====

* Weight space (each point corresponds to a particular setting for all the weights)
* Each training cases represents a hyperplane through the origin (if we eliminate the threshold/bias). That hyperplane is perpendicular to the ''input'' vector.
** the weights must lie on one side of this hyperplane to get the answer correctly.
* To get all the training cases right we need to find a point on the right side of all the planes (there may not be any such point).
* If there are any weight vectors that get the right answer for all cases, they lie in a hyper-cone with its apex at the origin. This means the problem is '''convex'''.

==== Limitations of Perceptrons ====

* If you are allowed to choose the features by hand and if you use enough features, you can do almost anything.
* But once the hand-coded features have been determined, there are very strong limitations on what a perceptron cal learn.
** A binary threshold output unit cannot even tell if two single bit features are the same!
*** Positive cases (same): $(1,1) \rightarrow 1$, $(0,0) \rightarrow 1$
*** Negative cases (different): $(1,0) \rightarrow 0$, $(0,1) \rightarrow 0$
*** The four input-output pairs give four inequalities that are impossible to satisfy.
** Imagine 'data-space' in which the axes correspond to components of an input vector.
*** Each input vector is a point in this space, a weight vector defines a plane in data-space.
*** The weight plane is perpendicular to the weight vector and misses the origin by a distance equal to the threshold.
*** There are many cases where we cannot separate positive and negative cases only by a hyperplane. Those cases are called '''not linearly separable'''.
** Cannot discriminate simple patterns under translation with wrap-around.

==== Conclusion ====

* Networks without hidden units are very limited in the input-output mappings they can learn to model. More layers of linear units do not help, it's still linear. Fixed output non-linearities are not enough.
* We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets?
** We need an efficient way of adapting all the weights, not just the last layer. This is hard.
** Learning the weights going into hidden units is equivalent to learning features.
** This is difficult because nobody is telling us directly what the hidden units should do.

== Linear Neurons ==

=== Learning Overview ===

$y = \sum_{i} w_i x_i = \mathbb{w}^T \mathbb{x}$

$y$: Neuron's estimate of the desired output, $\mathbb{w}$: weight vector, $\mathbb{x}$: input vector

* Perceptron convergence procedure doesn't work for more complex networks (because the problem is not ''convex''). Multi-layer neural networks do not use the perceptron learning procedure.
* Instead of showing that the weights get closer to a good set of weights, we show that the actual output values get closer the target values.
* The aim of learning is to minimize the error summed over all training cases. The error is the squared difference between the desired output and the actual output.

* Why don't we solve it analytically?
** Scientific answer: We want a method that real neurons could use.
** Engineering answer: We want a method that can be generalized to multi-layer, non-linear neural networks.

=== Learning Procedure ===

* The ''delta-rule'' for learning:
** $\Delta w_i = \epsilon x_i (t - y)$, $\epsilon$: learning rate, $t$: target value
** $E = \frac{1}{2} \sum_{n \in \text{training}} (t^n - y^n)^2$, $E$: error

=== The Error Surface for Linear Neurons ===

* The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error.
** For a linear neuron with a squared error, it is a quadratic bowl.
** Vertical cross-sections are parabolas.
** Horizontal cross-sections are ellipses
* For multi-layer, non-linear nets the error surface is much more complicated.
* The simplest kind of batch learning does steepest descent on the error surface. This travels perpendicular to the contour lines.
* The simplest kind of online learning zig-zags around the direction of steepest descent. This travels perpendicular to the constraint line of that particular training case.

== Logistic Neurons ==

$z = b + \sum_{i} x_i w_i$, $y = \frac{1}{1 + e^{-z}}$

* These give a real-valued output that is a smooth and bounded function of their total input.
* The derivatives:
** $\frac{\partial z}{\partial w_i} = x_i$
** $\frac{\partial z}{\partial x_i} = w_i$
** $\frac{dz}{dy} = y(1-y)$
** By the chain rule, $\frac{\partial y}{\partial w_i} = x_i y (1-y)$.
** By the chain rule, $\frac{\partial E}{\partial w_i} = \sum_{n} \frac{\partial y^n}{\partial w_i} \frac{\partial E}{\partial y^n} = - \sum_{n} x_i^n y^n (1-y^n) (t^n - y^n)$.

== The Backpropagation Algorithm ==

=== Other Ideas ===

==== Learning by Perturbing Weights ====

* The ideas occurs to everyone who knows about evolution.
* Randomly perturb one weight and see if it improves performance. If so, save the change.
** This is a form of reinforcement learning.
** '''Very inefficient'''. We need to do mutiple forward passes on a representative set of training cases just to change one weight.
** Towards the end of learning, large weight perturbations will nearly always make things worse.

==== Learning by Using Perturbations ====

* We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes.
* Not any better because we need lots of trials on each training cases to see the effect of changing one weight through the noise created by all the changes to other weights.
* A better idea: Randomly perturb the ''activities'' of the hidden units. Once we know how we want a hidden activity to change on a given training case, we can compute how to change the weights.

=== The Idea behind Backpropagation ===

* We don't know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.
* We can compute error derivatives for all the hidden units efficiently at the same time.
** Once we have the error derivatives for the hidden activities, it's easy to get the error derivatives for the weights going into a hidden unit.

=== Sketch of the Backpropagation Algorithm ===

* First convert the discrepance between each output and its target value into an error derivative.
* Then compute error derivatives in each hidden layer from error derivatives in the layer above.
* Then use error derivatives w.r.t. activities to get error derivatives w.r.t. the incoming weights.

=== Using the Derivatives Computed by Backpropagation ===

* The backpropagation algorithm is an efficient way of computing the error derivative $\partial E / \partial w$ for every weight on a single training case.
* To get a fully specified learning procedure, we still need to make a lot of other decisions about how to use these error derivatives:
** Opimization issues: How do we use the error derivatives on individual cases to discover a good set of weights?
** Generalization issues: How do we ensure that the learned weights work well for cases we did not see during training?

==== Optimization Issues ====

* How often to update the weights:
** Online: after each training case.
** Full batch: after a full sweep through the training data.
** Mini-batch: after a small sample of training cases. This is typically used when training big nets on big data sets.
* How much to update: (will be discussed further later)
** Use a fixed learning rate?
** Adapt the global learning rate?
** Adapt the learning rate on each connection esparately?
** Don't use steepest descent?

==== Generalization Issue (Overfitting) ====

* The downside of using powerful models.
* The training data contains information about the regularities in the mapping from input to output. But it also contains two types of noise:
** The target values may be unreliable (usually only a minor worry).
** There is '''sampling error'''. There will be accidental regularities just because of the particular training cases that were chosen.
* When we fit the model, it cannot tell which regularities are real and which are caused by sampling error, so it fits both kinds of regularity. If the model is very flexible it can model the sampling error really well. ''This is a disaster''.

==== Ways to Reduce Overfitting ====

* A large number of different methods have been developed:
** Weight-decay: keep weights small / zero.
** Weight-sharing: insist that many weights have the same value.
** Early stopping: Make a fake test set. As training the net, once the performance on the fake test set starts getting worse you stop training.
** Model averaging: Train many different neural nets and average them.
** Bayesian fitting of neural nets: really just a fancy form of model averaging.
** Dropout: Randomly omit a hidden unit.
** Generative pre-training: (Will be discussed towards the end of the course.)

== Learning to Predict the Next Word ==

=== Diversion into Cognitive Science ===

There has been a long debate in cognitive science between two rival theories of what it means to have a ''concept'':

* The feature theory: A concept is a set of semantic features
** Good for explaining similarities between concepts.
** It's convenient -- a concept is a vector of feature activities.
* The structuralist theory: The meaning of a concept lies in its relationships to other concepts.
** So conceptual knowledge is best expressed as a relational graph.
** Minsky used the limitations of perceptrons as evidence against feature vectors and in favor of relational graph representations.

Both sides are wrong.

* These two theories need not be rivals. A neural net can use vectors of semantic features to implement a relational graph.
* We may use explicit rules for conscious, deliberate reasoning, but we do a lot of commonsense, analogical reasoning by just "seeing" the answer with no conscious intervening steps.

==== Localist and Distributed Representations of Concepts ====

The obvious way to implement a relational graph in a neural net is to treat a neuron as a node in the graph and a connection as a binary relationship. But this "localist" method will not work:

* We need many different types of relationship and the connections in a neural net do not have discrete labels.
* We need ternary relationships as well as binary ones, e.g. A is between B and C

The right way to implement relational knowledge in a neural net is still an open issue.

* But many neurons are probably used for each concept and each neuron is probably involved in many concepts. This is called a "distributed representation". It's a many-to-many mapping between concepts and neurons.

=== The Softmax Output Function (Another Diversion) ===

The squared error measure has some drawbacks. Is there a different cost function that works better?

Yes, force the outputs to represent a probability distribution across discrete alternatives.

==== Softmax ====

The output units in a softmax group uses a non-local non-linearity. Let $y_i$ be the output of a neuron in a softmax group, and $z_i$ (called ''logit'') the input of the same neuron:

$$y_i = \frac{e^{z_i}}{\sum_{j \in \text{group}} e^{z_j} }$$

Note that $\sum y_i = 1$, so we force the $y_i$ to represent a probability distribution. We can show:

$$\frac{\partial y_i}{\partial z_i} = y_i (1 - y_i)$$

The right cost function to use with softmax is the negative log probability of the right answer. This is called the ''Cross Entropy'' cost function. ($t_j$ is the target value.)

$$C = - \sum_{j} t_j \log y_j$$

$C$ has a very big gradient when the target value is 1 and the output is almost zero.
* A value of 0.000001 is much better than 0.000000001 (one-in-a-million chance vs one-in-a-billion chance).
* The steepness of $dC/dy$ exactly balances the flatness of $dy/dz$.

$$\frac{\partial C}{\partial z_i} = \sum_{j} \frac{\partial C}{\partial y_j} \frac{\partial y_j}{\partial z_i} = y_i - t_i$$

=== Neuro-Probabilistic Language Models ===

* We cannot identify phonemes perfectly in noisy speech.
* People use their understanding of the meaning of the utterance to hear the right words.
** "We do this unconsciously when we wreck a nice beach"
** We are very good at it.
* This means speech recognizers have to know which words are likely to come next and which are not.
** Fortunately words can be predicted quite well without full understanding.

==== The Standard "Trigram" Method ====

* Take a huge amount of text and count the frequencies of all triples of words.

* Use these frequencies to make bets on the relative probabilities of words given the previous two words.

* Until very recently this was the state-of-the-art.
** We cannot use a much bigger context because there are too many possibilities to store and the counts would mostly be zero.
** We have to "back-off" to digrams when the count for a trigram is too small. The probability is not zero just because the count is zero!

* A trigram model doesn't understand the similarities between words.
** The sentence "The cat got squashed in the garden on friday" should help us predict words in the sentence "The dog got flattened in the yard on monday".

* To overcome this limitation, we need to use the semantic and syntactic features of previous words to predict the features of the next word.
** Using a feature representation also allows a context that contains many more previous words, e.g. 10.

* Yoshua Bengio's neural net for predicting the next word

== Object Recognition ==

=== Why Object Recognition is Difficult ===

* Segmentation: Real scenes are cluttered with other objects.
* Lighting: The intensities of the pixels are determined as much by the lighting as by the objects.
* Deformation: Objects can deform in a variety of non-affine ways.
* Affordances: Object classes are things defined by how they are used.
* Viewpoint: Changes in viewpoint cause changes in images that standard learning methods cannot cope with.

=== Achieving Viewpoint Invariance ===

* Viewpoint invariance is one of the main difficulties in making computers perceive.
* Several Different approaches:
** Use redundant invariant features.
** Put a box around the object and use normalized pixels.
** Use replicated features with pooling. This is called ''convolutional neural nets'' (in later lectures).
** Use a hierarchy of parts that have explicit poses relative to the camera (in later lectures).

==== Invariant Feature Approach ====

* Extract a large, redundant set of features that are invariant under transformations.
** e.g. pair of roughly parallel lines with a red dot between them (this is what baby herring gulls use to know where to peck for food).
* With enough invariant features, there is only one way to assemble them into an object.
* But for recognition, we must avoid forming features from parts of different objects.

==== Judicious Normalization Approach ====

* Put a box around the object and use it as a coordinate frame for a set of normalized pixels.
** This solves the dimension-hopping problem. If we choose the box correctly, the same part of an object always occurs on the same normalized pixels.
** The box can provide invariance to many degrees of freedom: translation, rotation, scale, shear, stretch, ...

* But choosing the box is difficult because of:
** Segmentation errors, occlusion, unusual orientations.

* We need to recognize the shape to get the box right!

==== Brute Force Normalization Approach ====

* When training the recognizer, use well-segmented, upright images to fit the correct box.

* At test time try all possible boxes in a range of positions and scales.

** This approach is widely used for detecting upright things like faces and house numbers in unsegmented images.
** It is much more efficient if the recognizer can cope with some variation in position and scale so that we can use a coarse grid when trying all possible boxes.

=== Convolutional Nets for Digit Recognition ===

==== The Replicated Feature Approach ====

* This is currently the dominant approach for neural networks.

* Use many different copies of the same feature detector with different positions.

** Could also replicate across scale and orientation but it is tricky and expensive.
** Replication greatly reduces the number of free parameters to be learned.

* Use several different feature types, each with its own map of replicated detectors.

==== Backpropagation with Weight Contraints ====

* It's easy to modify the backpropagation algorithm to incorporate linear constraints between the weights.

* We compute the gradients as usual, and then modify the gradients so that they satisfy the constraints - so if the weights started off satisfying the constraints, they will continue to satisfy them.

To constrain $w_1 = w_2$, we need $\Delta w_1 = \Delta w_2$.

Compute $\partial E / \partial w_1$ and $\partial E / \partial w_2$, then use $\partial E / \partial w_1 + \partial E / \partial w_2$ for both $w_1$ and $w_2$.

==== What Does Replicating the Feature Detectors Achieve? ====

* '''Equivariant''' activities: Replicated features do ''not'' make the neural activities invariant to translation. The activities are equivariant.

* '''Invariant''' knowledge: If a feature is useful in some locations during training, detectors for that feature will be available in all locations during testing.

==== Pooling the Outputs of Replicated Feature Detectors ====

* Get a small amount of translational invariance at each level by averaging four neighboring replicated detectors to give a single output to the next level.
** This reduces the number of inputs to the next layer of feature extaction, thus allowing us to have many more different feature maps.
** Taking the maximum of the four works slightly better.

* Problem: After several levels of pooling, we have lost information about the precise positions of things.
** This makes it impossible to use the precise spatial relationships between high-level parts for recognition.

==== Le Net ====

* Yann LeCun and his collaborators developed a really good recognizer for handwritten digits by using backpropagation in a feedforward net with:
** Many hidden layers.
** Many maps of replicated units in each layer.
** Pooling of the outputs of nearby replicated units.
** A wide net that can cope with several characters at once even if they overlap.
** A clever way of training a complete system, not just a recognizer.

* This net was used for reading ~10% of the checks in North America.

* Look the impressive demos of LENET at http://yann.lecun.com

==== Priors and Prejudice ====

* We can put our prior knowledge about the task into the network by designing appropriate:
** Connectivity
** Weight constraints
** Neuron activation functions

* This is less intrusive than hand-designing the features
** But it still prejudices the network towards the particular way of solving the problem that we had in mind.

* Alternatively, we can use our prior knowledge to create a whole lot more training data.
** This may require a lot of work (Hofman & Tresp, 1993). Read data + synthetic data.
** It may make learning take much longer.
** It allows optimization to discover clever ways of using the multi-layer network that we did not think of - and we may never fully understand how it does it.

==== The Brute Force Approach ====

* LeNet uses knowledge about the invariances to design:
** the local connectivity
** the weight-sharing
** the pooling

* This achieves about 80 errors.
** This can be reduced to about 40 errors by using many different transformations of the input and other tricks (Ranzato 2008).

* Ciresan et. al. (2010) inject knowledge of invariances by creating a huge amount of carefully designed extra training data:
** For each training image, they produce many new training examples by applying many different transformations.
** They can then train a large, deep, dumb net on a GPU without much overfitting.
** They achieved about 35 errors.

* With model averaging they can now get about 25 errors (which is about the same as human error rate).

==== How to Detect a Significant Drop in the Error Rate ====

* Is 30 errors in 10,000 test cases significantly better than 40 errors?
** It all depends on the particular errors.
** The McNemar test uses the particular errors and can be much more powerful than a test that just uses the number of errors.

=== Convolutional Nets for Object Recognition ===

==== From Hand-Written Digits to 3D Objects ====

* Recognizing real objects in color photographs downloaded from the web is much more complicated than recognizing hand-written digits:

** Hundred times as many classes (1000 vs 10)
** Hundred times as many pixels (256x256 color vs 28x28 gray)
** Two dimensional image of three-dimensional scene.
** Cluttered scenes requiring segmentation.
** Multiple objects in each image.

* Will the same type of convolutional neural network work?

==== The ILSVRC-2012 Competition on ImageNet ====

* The dataset has 1.2 million high-resolution training images.

* The classification task:
** Get the "correct" class in your top 5 bets. There are 1000 classes.

* The localization task:
** For each bet, put a box around the object. Your box must have at least 50% overlap with the correct box.

* Some of the best existing computer vision methods were tried on this dataset by leading computer vision groups from Oxford, INRIA, XRCE, ...
** Computer vision systems use complicated multi-stage systems.
** The early stages are typically hand-tuned by optimizing a few parameters.

==== Error Rates on the ILSVRC-2012 Competition ====

* University of Tokyo: 26.1% (classification)
* Oxford University Computer Vision Group: 26.9% (classification)
* INRIA(French national research institute in CS) + XRCE(Xerox Research Center Europe): 27.0% (classification)
* University of Amsterdam: 29.5% (classification)
* University of Toronto ('''Alex Krizhevsky'''): '''16.4%''' (classification)

==== A Neural Network for ImageNet ====

* Alex Krizhevsky (NIPS 2012) developed a very deep convolutional neural net of the type pioneered by Yann LeCun. Its ''architecture'' was:
** 7 hidden layers not counting some max pooling layers.
** The early layers were convolutional.
** The last two layers were globally connected.

* The ''activation functions'' were:
** Rectified linear units in every hidden layer. These train much faster and are more expressive than logistic units.
** Competitive normalization to suppress hidden activities when nearby units have stronger activities. This helps with variations in intensity.

==== Tricks that Significantly Improve Generalization ====

* Train on random 224x224 patches from the 256x256 images to get more data. Also use left-right reflections of the images.
** At test time, combine the opinions from ten different patches: The four 224x224 corner patches plus the central 224x224 patch plus the reflections of those five patches.

* Use "dropout" to regularize the weights in the globally connected layers (which contain most of the parameters).
** Dropout means that half of the hidden units in a layer are randomly removed for each training example.
** This stops hidden units from relying too much on other hidden units.

==== The Hardward Required for Alex Krizhevsky's Net ====

* He uses a very efficient implementation of convolutional nets on two Nvidia GTX 580 Graphics Processor Units (over 1000 fast little cores)
** GPUs are very good for matrix-matrix multiplies.
** GPUs have very high bandwidth to memory.
** This allows him to train the network in a week.
** It also makes it quick to combine results from 10 patches at test time.

* We can spread a network over many cores if we can communicate the states fast enough.

* As cores get cheaper and datasets get bigger, big neural nets will improve faster than old-fashioned (i.e. pre Oct 2012) computer vision systems.

==== Finding Roads in High-Resolution Images ====

* Vlad Mnih (ICML 2012) used a non-convolutional net with local fields and multiple layers of rectified linear units to find roads in cluttered aerial images.
** It takes a large image patch and predicts a binary road label for the central 16x16 pixels.
** There is lots of labeled training data available for this task.

* The task is hard for many reasons:
** Occlusion by buildings, trees, and cars.
** Shadows, Lighting changes.
** Minoro viewpoint changes.

* The worst problems are incorrect labels:
** Badly registered maps
** Arbitrary decisions about what counts as a road (vs laneway etc.)

== Mini-Batch Gradient Descent ==

=== Overview ===

* The Error Surface: Lies in a space with a horizontal axis for each weight and one vertical axis for the error.
** For a linear neuron with a squared error, it is a quadratic bowl where vertical cross-sections are parabolas and horizontal cross-sections are ellipses.

* For multi-layer, non-linear nets the error surface is much more complicated. But locally, a piece of a quadratic bowl is usually a very good approximation.

* Going downhill reduces the error, but the direction of steepest descent does not point at the minimum unless the ellipse is a circle.
** The gradient is big in the direction in which we only want to travel a small distance, small in the direction in which we want to travel a large distance.

* If the learning rate is big, the weights slosh to and fro across the ravine. If the learning rate is too big, this oscillation diverges.

* What we would like to achieve: move quickly in directions with small but consistent gradients, and move slowly in directions with big but inconsistent gradients.

==== Stochastic Gradient Descent ====

* If the dataset is highly redundant, the gradient on the first half is almost identical to the gradient on the second half.
** So, instead of computing the full gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half.
** The extreme version of this approach updates weights after each case. It's called "online" learning.

* Mini-batches are usually better than online.
** Less computation is used updating the weights.
** Computing the gradient for many cases simultaneously uses matrix-matrix multiplies which are very efficient, especially on GPUs.

* Mini-batches need to be balanced for classes (one way to approximate this is to pick points randomly).

==== Two types of learning algorithm ====

* If we use the full gradient computed from all the training cases, there are many clever ways to speed up learning (e.g. non-linear conjugate gradient).

* For large neural networks with very large and highly redundant training sets, it is nearly always best to use mini-batch learning.

==== Basic Mini-Batch Gradient Descent Algorithm ====

* Guess an initial learning rate.
** If the error keeps getting worse or oscillates wildly, reduce the learning rate.
** If the error is falling fairly consistently but slowly, increase the learning rate.

* Write a simple program to automate this way of adjusting the learning rate.

* Toward the end of mini-batch learning, it nearly always helps to turn down the learning rate.
** This removes fluctuations in the final weights caused by the variations between mini-batches.

* Turn down the learning rate when the error stops decreasing.

* '''Note''': Compute the error in a separate validation set.

=== Tricks for Mini-Batch Gradient Descent ===

==== Initializing The Weights ====

* If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights, they will always get exactly the same gradient.

** So they can never learn to be different features.
** We break symmetry by initializing the weights to have small random values.

* If a hidden unit has a big fan-in, small changes on many of its incoming weights can cause the learning to overshoot.
** We generally want smaller incoming weights when the fan-in is big, so initialize the weights to be proportional to $\sqrt{\text{fan-in}}$.

* We can also scale the learning rate the same way.

==== Shifting the Inputs ====

* When using steepest descent, shifting the input values makes a big difference.
** It usually helps to transform each component of the input vector so that it has zero mean over the whole training set.

* The hyperbolic tangent (which is $2\times \text{logistic} - 1$) produces hidden activations that are roughly zero mean.
** In this respect it's better than the logistic.

==== Scaling the Inputs ====

* When using steepest descent, scaling the input values makes a big difference.
** It usually helps to transform each component of the input vector so that is has unit variance over the whole training set.

==== A More Thorough Method: Decorrelate the Input Components ====

* For a linear neuron, we get a big win by decorrelating each component of the input from the other input components.

* There are several different ways to decorrelate inputs. A reasonable method is to use Principal Components Analysis.
** Drop the principal components with the smallest eigenvalues ― '''This achieves some dimensionality reduction'''.
** Divide the remaining principal components by the square roots of their eigenvalues. For a linear neuron, this converts an axis aligned elliptical error surface into a circular one.

* For a circular error surface, the gradient points straight towards the minimum.

==== Common Problems That Occur in Multilayer Networks ====

* If we start with a very big learning rate, the weights of each hidden unit will all become very big and positive or very big and negative.
** The error derivatives for the hidden units will all become tiny and the error will not decrease.
** This is usually a plateau, but people often mistake it for a local minimum.

* In classification networks that use a squared error or a cross-entropy error, the best guessing strategy is to make each output unit always produce an output equal to the proportion of time it should be a 1.
** The network find this strategy quickly and may take a long time to improve on it by making use of the input.
** This is another plateau that looks like a local minimum.

==== Be Careful about Turning Down the Learning Rate ====

* Turning down the learning rate reduces the random fluctuations in the error due to the different gradients on different mini-batches.
** So we get a quick win
** But then we get slower learning.

* Don't turn down the learning rate too soon!

==== Four Ways to Speed up Mini-Batch Learning ====

* Use "momentum"
** Instead of using the gradient to change the ''position'' of the weight "particle, use it to change the ''velocity''.

* Use separate adaptive learning rates for each parameter.
** Slowly adjust the rate using the consistency of the gradient for that parameter.

* '''rmsprop''': Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.
** This is the mini-batch version of just using the sign of the gradient.

* Take a fancy method from the optimization literature that makes use of curvature information (not this lecture)
** Adapt it to work for neural nets.
** Adapt it to work for mini-batches.

=== The Momentum Method ===

==== Intuition ====

Imagine a ball on the error surface. The location of the ball in the horizontal plane represents the weight vector.
# The ball starts off by following the gradient, but once it has velocity, it no longer does steepest descent.
# Its momentum makes it keep going in the previous direction.

* It damps oscillations in directions of high curvature by combining gradients with opposite signs.

* It builds up speed in directions with a gentle but consistent gradient.

==== Equations ====

$$\mathbb{v}(t) = \alpha \mathbb{v}(t-1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

The effect of the gradient is to increment the previous velocity. The velocity also decays by $\alpha$ which is slightly less then 1.

The weight change is equal to the current velocity.

$$\Delta \mathbb{w}(t) = \mathbb{v} (t)$$

$$ = \alpha \mathbb{v} (t - 1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

$$ = \alpha \Delta \mathbb{w} (t - 1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

The weight change can be expressed in terms of the previous weight change and the current gradient.

==== Behavior ====

* If the error surface is a tilted plane, the ball reaches a terminal velocity $\mathbb{v}(\infty)$.
** If the momentum is close to 1, this is much faster than simple gradient descent.

$$\mathbb{v}(\infty) = \frac{1}{1 - \alpha} \Big( - \epsilon \frac{\partial E}{\partial \mathbb{w}} \Big)$$

* At the beginning of learning there may be very large gradients.
** So it pays to use a small momentum (e.g. 0.5).
** Once the large gradients have disappeared and the weights are stuck in a ravine the momentum can be smoothly raised to its final value (e.g. 0.9 or even 0.99)

* This allows us to learn at a rate that would cause divergent oscillations without the momentum.

==== Better Type of Momentum (Nesterov 1983) ====

* The standard momentum method ''first'' computes the gradient at the current location and ''then'' takes a big jump in the direction of the updated accumulated gradient.
* Ilya Sutskever (2012 unpublished) suggested a new form of momentum that often works better.
** Inspired by the Nesterov method for optimizing convex functions.
* ''First'' make a big jump in the direction of the previous accumulated gradient.
* ''Then'' measure the gradient where you end up and make a correction.
** It's better to correct a mistake ''after'' you have made it!

=== Adaptive Learning Rates for Each Connection ===

==== Intuition ====

* In a multilayer net, the appropriate learning rates can vary widely between weights:
** The magnitudes of the gradients are often very different for different layers, especially if the initial weights are small.
** The fan-in of a unit determines the size of the "overshoot" effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error.

* So user a global learning rate (set by hand) multiplied by an appropriate local gain that is determined empirically for each weight.

==== One Way to Determine the Individual Learning Rates ====

$$\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}$$

* Start with a local gain of 1 for every weight. i.e. $g_{ij}(0) = 1$
* Increase the local gain if the gradient for that weight does not change sign.
* Use small additive increases and multiplicative decreases.
** This ensures that big gains decay rapidly when oscillations start.
** If the gradient is totally random, the gain will hover around 1 when we increase by ''plus'' $\delta$ half the time and decrease by ''times'' $1 - \delta$ half the time.

$$\text{if } \Big( \frac{\partial E}{\partial w_{ij}}(t) \cdot \frac{\partial E}{\partial w_{ij}}(t-1) \Big) > 0$$
$$\text{then } g_{ij}(t) = g_{ij}(t - 1) + 0.05$$
$$\text{else } g_{ij}(t) = g_{ij}(t - 1) * 0.95$$

==== Tricks for Making Adaptive Learning Rates Work Better ====

* Limit the gains to lie in some reasonable range, e.g. [0.1, 10] or [0.01, 100].

* Use full batch learning or very big mini-batches.
** This ensures that changes in the sign of the gradient are not mainly due to the sampling error of a mini batch.

* Adaptive learning rates can be combined with momentum.
** Use the agreement in sign between the current gradient for a weight and the velocity for that weight (Jacobs, 1989).

* Adaptive learning rates only deal with axis-aligned effects.
** Momentum does not care about the alignment of the axes.

=== RMSPROP: Divide The Gradient By Running Average of Its Recent Magnitude ===

==== RPROP: Using only the sign of the gradient ====

* The magnitude of the gradient can be very different for different weights and can change during learning.
** This makes it hard to choose a single global learning rate.

* For ''full batch learning'', we can deal with this variation by only using the sign of the gradient.
** The weight updates are all of the same magnitude.
** This escapes from plateaus with tiny gradients quickly.

* RPROP: This combines the idea of only using the sign of the gradient with the idea of adapting the step size separately for each weight.
** Increase the step size for a weight ''multiplicatively'' (e.g. times 1.2) if the signs of its last two gradients agree.
** Otherwise decrease the step size multiplicatively (e.g. times 0.5).
** Limit the step sizes to be less than 50 and more than a millionth (Mike Shuster's advice).

==== Why RPROP Does Not Work with Mini-Batches ====

* The idea behind stochastic gradient descent is that when the learning rate is small, it averages the gradients over successive mini-batches.
** Consider a weight that gets a gradient of +0.1 on nine mini-batches and a gradient of -0.9 on the tenth mini-batch.
** We want this weight to stay roughly where it is.

* RPROP would increment the weight nine times and decrement it once by about the same amount (assuming any adaptation of the step sizes is small on this time-scale), so the weight would grow a lot.
* Is there a way to combine:
** The robustness of RPROP.
** The efficiency of mini-batches.
** The effective averaging of gradients over mini-batches.

==== RMSPROP: Mini-Batch Version of RPROP ====

* RPROP is equivalent to using the gradient but also dividing by the size of the gradient.
** The problem with mini-batch RPROP is that we divide by a different number for each mini-batch. So why not force the number we divide by to be very similar for adjacent mini-batches?

* RMSPROP: Keep a moving average of the squared gradient for each weight.
$$MeanSquare(w,t) = 0.9 \cdot MeanSquare(w, t-1) + 0.1 \Big( \frac{\partial E}{\partial w}(t) \Big) ^2$$

* Dividing the gradient by $\sqrt{MeanSquare(w,t)}$ makes the learning work much better (Tijmen Tieleman, unpublished).

==== Further Developements of RMSPROP ====

* Combining RMSPROP with standard momentum
** Momentum does not help as much as it normally does. Needs more investigation.

* Combining RMSPROP with Nesterov momentum (Sutskever 2012)
** It works best if the RMS of the recent gradients is used to divide the correction rather than the jump in the direction of accumulated corrections.

* Combining RMSPROP with adaptive learning rates for each connection
** Needs more investigation.

* Other methods related to RMSPROP
** Yann LeCun's group has a fancy version in "No more pesky learning rates"

=== Summary of Learning Methods for Neural Networks ===

* For small datasets (e.g. 10,000 cases) or bigger datasets without much redundancy, use a full-batch method.
** Conjugate gradient, LBFGS, ...
** Adaptive learning rates, RPROP, ...

* For big, redundant datasets, use mini-batches.
** Try gradient descent with momentum.
** Try RMSPROP (with momentum?)
** Try LeCun's latest recipe.

* Why there is no simple recipe:
*# '''Neural nets differ a lot''':
*#* Very deep nets (especially ones with narrow bottlenecks).
*#* Recurrent nets.
*#* Wide shallow nets.
*# '''Tasks differ a lot''':
*#* Some require very accurate weights, some don't.
*#* Some have many very rare cases (e.g. words).

Neural Networks (Geoffrey Hinton Course)

2017-01-15T04:07:23Z

Tedyun: /* Summary of Learning Methods for Neural Networks */

== Some Simple Models or Neurons ==

$y$ output, $x_i$ input.

=== Linear Neurons ===

$y = b + \sum_{i} x_i w_i$

$w_i$ weights, $b$ bias

=== Binary Threshold Neurons ===

$z = \sum_{i} x_i w_i$

$y = 1$ if $z \geq \theta$, $0$ otherwise.

Or, equivalently,

$z = b + \sum_{i} x_i w_i$

$y = 1$ if $z \geq 0$, $0$ otherwise.

=== Rectified Linear Neurons ===

$z = b + \sum_{i} x_i w_i$

$y = z$ if $z > 0$, $0$ otherwise. (linear above zero, decision at zero.)

=== Sigmoid Neurons ===

Give a real-valued output that is a smooth and bounded function of their total input.

$z = b + \sum_{i} x_i w_i$

$y = \frac{1}{1 + e^{-z}}$

=== Stochastic Binary Neurons ===

Same equations as logistic units, but outputs $1$ (=spike) or $0$ randomly based on the probability. They treat the output of the logistic as the probability of producing a spike in a short time window.

$z = b + \sum_{i} x_i w_i$

$P(s = 1) = \frac{1}{1 + e^{-z}}$

We can do a similar trick for rectified linear units - in this case the output is treated as the Poisson rate for spikes.

== Types of Learning ==

=== Supervised Learning ===
Learn to predict an output when given an input vector.
* Regression: The target output is a real number or a whole vector of real numbers.
* Classification: The target output is a class label.

==== How Supervised Learning Typically Works ====
# Start by choosing a model-class: $y = f(x;W)$
#* A model-class $f$ is a way of using some numerical parameters $W$ to map each input vector $x$ into a predicted output $y$.
# Learning usually means adjusting the parameters to reduce the discrepancy between the target output, $t$, on each training case and the actual output, $y$, produced by the model.
#* For regression $\frac{1}{2}(y-t)^2$ is often a sensible measure of the discrepancy.
#* For classification there are other measures that are generally more sensible (they also work better).

=== Reinforcement Learning ===
Learn to select an action to maximize payoff.

* The output is an action or sequence of actions and the only supervisory signal is an occasional scalar reward.
** The goal in selecting each action is to maximize the expected sum of the future rewards.
** We usually use a discount factor for delayed rewards so that we don't have to look too far into the future.
* Reinforcement learning is difficult because:
** The rewards are typically delayed so it's hard to know where we went wrong (or right).
** A scalar reward does not supply much information.
* Typically you can't learn millions of parameters using reinforcement learning (you can with supervised/unsupervised learning). Typically you learn dozens or thousands of parameters.
* Will not be covered in this course.

=== Unsupervised Learning ===
Discover a good internal representation of the input.

* For about 40 years unsupervised learning was largely ignored by the machine learning community (except for clustering).
* It is hard to say what the aim of unsupervised learning is:
** One major aim is to create an internal representation of the input that is useful for subsequent supervised or reinforcement learning.
** You can compute the distance to a surface by using the disparity between two images. But you don't want to learn to compute disparities by stubbing your toe thousands of times.
* Other goals:
** Providing a compact, low-dimensional representation of the input.
*** High-dimensional inputs typically live on or near a low-dimensional manifold (or several such manifolds)
*** Principal Component Analysis is a widely used linear method for finding a low-dimensional representation.
** Providing an economical high-dimensional representation of the input in terms of learned features.
*** Binary features
*** Real-valued features that are nearly all zero
** Finding sensible clusters in the input
*** This is an example of a very sparse code in which only one of the features is non-zero.

== Neural Network Architectures ==

# Feed-forward architecture: information comes into the input units and flows one direction through hidden layers until each reaches the output units.
# Recurrent neural network: information can flow around in cycles.
# Symmetrically connected network: weights are the same in both directions between two units.

=== Feed-forward Neural Networks ===

* The commonest type of neural network
* The first layer is the input and the last layer is the output
* Called "deep" neural networks if there is more than one hidden layer.
* They compute a series of transformations that change the similarities between cases - the activities of the neurons in each layer are a non-linear function of the activities in the layer below.

=== Recurrent Networks ===

* Have directed cycles in their connection graph.
* Have complicated dynamics — difficult to train.
* More biologically realistic.
* Very natural way to model sequential data
* Have the ability to remember information in their hidden state for a long time (but hard to train to use this ability).

=== Symmetrically Connected Networks ===

* Like recurrent networks, but the connections between units are symmetrical (same weight in both directions).
* Much easier to analyze than recurrent networks (John Hopfield, et al)
* More restricted in what they can do, because they obey an energy function. for example, they cannot model cycles.

== The First Generation of Neural Networks ==

=== Standard Paradigm for Statistical Pattern Recognition ===

* Convert the raw input vector into a vector of feature activations (using hand-written programs based on common-sense to define the features).
* Learn how to weight each of the feature activations to get a single scalar quantity.
* If this quantity is above some threshold, decide that the input vector is a positive example of the target class.

=== Perceptrons ===

input units => (by hand-coded programs) => feature units => decision unit

* Popularized by Frank Rosenblatt in the early 1960s.
* Lots of grand claims were made for what they could learn to do.
* In 1969, Minsky and Papert showed Perceptrons' limitations ('''Group Invariance Theorem'''). Many people thought these limitations applied to all neural network models.
* Perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.
* Decision units in perceptrons are '''binary threshold neurons'''.

==== Perceptron Learning Procedure ====

# Add an extra component with value 1 to each input vector ("bias" weight) and forget about the threshold. (bias) = -(threshold)
# Pick training cases using any policy that ensures that every training case will keep getting picked.
#* If the output unit is correct, leave its weights alone.
#* If the output unit incorrectly outputs a zero, add the input vector to the weight vector.
#* If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector.
# This is guaranteed to find a set of weights that gets the right answer for all the training cases ''if any such set exists''.

==== Geometrical View of Perceptrons ====

* Weight space (each point corresponds to a particular setting for all the weights)
* Each training cases represents a hyperplane through the origin (if we eliminate the threshold/bias). That hyperplane is perpendicular to the ''input'' vector.
** the weights must lie on one side of this hyperplane to get the answer correctly.
* To get all the training cases right we need to find a point on the right side of all the planes (there may not be any such point).
* If there are any weight vectors that get the right answer for all cases, they lie in a hyper-cone with its apex at the origin. This means the problem is '''convex'''.

==== Limitations of Perceptrons ====

* If you are allowed to choose the features by hand and if you use enough features, you can do almost anything.
* But once the hand-coded features have been determined, there are very strong limitations on what a perceptron cal learn.
** A binary threshold output unit cannot even tell if two single bit features are the same!
*** Positive cases (same): $(1,1) \rightarrow 1$, $(0,0) \rightarrow 1$
*** Negative cases (different): $(1,0) \rightarrow 0$, $(0,1) \rightarrow 0$
*** The four input-output pairs give four inequalities that are impossible to satisfy.
** Imagine 'data-space' in which the axes correspond to components of an input vector.
*** Each input vector is a point in this space, a weight vector defines a plane in data-space.
*** The weight plane is perpendicular to the weight vector and misses the origin by a distance equal to the threshold.
*** There are many cases where we cannot separate positive and negative cases only by a hyperplane. Those cases are called '''not linearly separable'''.
** Cannot discriminate simple patterns under translation with wrap-around.

==== Conclusion ====

* Networks without hidden units are very limited in the input-output mappings they can learn to model. More layers of linear units do not help, it's still linear. Fixed output non-linearities are not enough.
* We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets?
** We need an efficient way of adapting all the weights, not just the last layer. This is hard.
** Learning the weights going into hidden units is equivalent to learning features.
** This is difficult because nobody is telling us directly what the hidden units should do.

== Linear Neurons ==

=== Learning Overview ===

$y = \sum_{i} w_i x_i = \mathbb{w}^T \mathbb{x}$

$y$: Neuron's estimate of the desired output, $\mathbb{w}$: weight vector, $\mathbb{x}$: input vector

* Perceptron convergence procedure doesn't work for more complex networks (because the problem is not ''convex''). Multi-layer neural networks do not use the perceptron learning procedure.
* Instead of showing that the weights get closer to a good set of weights, we show that the actual output values get closer the target values.
* The aim of learning is to minimize the error summed over all training cases. The error is the squared difference between the desired output and the actual output.

* Why don't we solve it analytically?
** Scientific answer: We want a method that real neurons could use.
** Engineering answer: We want a method that can be generalized to multi-layer, non-linear neural networks.

=== Learning Procedure ===

* The ''delta-rule'' for learning:
** $\Delta w_i = \epsilon x_i (t - y)$, $\epsilon$: learning rate, $t$: target value
** $E = \frac{1}{2} \sum_{n \in \text{training}} (t^n - y^n)^2$, $E$: error

=== The Error Surface for Linear Neurons ===

* The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error.
** For a linear neuron with a squared error, it is a quadratic bowl.
** Vertical cross-sections are parabolas.
** Horizontal cross-sections are ellipses
* For multi-layer, non-linear nets the error surface is much more complicated.
* The simplest kind of batch learning does steepest descent on the error surface. This travels perpendicular to the contour lines.
* The simplest kind of online learning zig-zags around the direction of steepest descent. This travels perpendicular to the constraint line of that particular training case.

== Logistic Neurons ==

$z = b + \sum_{i} x_i w_i$, $y = \frac{1}{1 + e^{-z}}$

* These give a real-valued output that is a smooth and bounded function of their total input.
* The derivatives:
** $\frac{\partial z}{\partial w_i} = x_i$
** $\frac{\partial z}{\partial x_i} = w_i$
** $\frac{dz}{dy} = y(1-y)$
** By the chain rule, $\frac{\partial y}{\partial w_i} = x_i y (1-y)$.
** By the chain rule, $\frac{\partial E}{\partial w_i} = \sum_{n} \frac{\partial y^n}{\partial w_i} \frac{\partial E}{\partial y^n} = - \sum_{n} x_i^n y^n (1-y^n) (t^n - y^n)$.

== The Backpropagation Algorithm ==

=== Other Ideas ===

==== Learning by Perturbing Weights ====

* The ideas occurs to everyone who knows about evolution.
* Randomly perturb one weight and see if it improves performance. If so, save the change.
** This is a form of reinforcement learning.
** '''Very inefficient'''. We need to do mutiple forward passes on a representative set of training cases just to change one weight.
** Towards the end of learning, large weight perturbations will nearly always make things worse.

==== Learning by Using Perturbations ====

* We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes.
* Not any better because we need lots of trials on each training cases to see the effect of changing one weight through the noise created by all the changes to other weights.
* A better idea: Randomly perturb the ''activities'' of the hidden units. Once we know how we want a hidden activity to change on a given training case, we can compute how to change the weights.

=== The Idea behind Backpropagation ===

* We don't know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.
* We can compute error derivatives for all the hidden units efficiently at the same time.
** Once we have the error derivatives for the hidden activities, it's easy to get the error derivatives for the weights going into a hidden unit.

=== Sketch of the Backpropagation Algorithm ===

* First convert the discrepance between each output and its target value into an error derivative.
* Then compute error derivatives in each hidden layer from error derivatives in the layer above.
* Then use error derivatives w.r.t. activities to get error derivatives w.r.t. the incoming weights.

=== Using the Derivatives Computed by Backpropagation ===

* The backpropagation algorithm is an efficient way of computing the error derivative $\partial E / \partial w$ for every weight on a single training case.
* To get a fully specified learning procedure, we still need to make a lot of other decisions about how to use these error derivatives:
** Opimization issues: How do we use the error derivatives on individual cases to discover a good set of weights?
** Generalization issues: How do we ensure that the learned weights work well for cases we did not see during training?

==== Optimization Issues ====

* How often to update the weights:
** Online: after each training case.
** Full batch: after a full sweep through the training data.
** Mini-batch: after a small sample of training cases. This is typically used when training big nets on big data sets.
* How much to update: (will be discussed further later)
** Use a fixed learning rate?
** Adapt the global learning rate?
** Adapt the learning rate on each connection esparately?
** Don't use steepest descent?

==== Generalization Issue (Overfitting) ====

* The downside of using powerful models.
* The training data contains information about the regularities in the mapping from input to output. But it also contains two types of noise:
** The target values may be unreliable (usually only a minor worry).
** There is '''sampling error'''. There will be accidental regularities just because of the particular training cases that were chosen.
* When we fit the model, it cannot tell which regularities are real and which are caused by sampling error, so it fits both kinds of regularity. If the model is very flexible it can model the sampling error really well. ''This is a disaster''.

==== Ways to Reduce Overfitting ====

* A large number of different methods have been developed:
** Weight-decay: keep weights small / zero.
** Weight-sharing: insist that many weights have the same value.
** Early stopping: Make a fake test set. As training the net, once the performance on the fake test set starts getting worse you stop training.
** Model averaging: Train many different neural nets and average them.
** Bayesian fitting of neural nets: really just a fancy form of model averaging.
** Dropout: Randomly omit a hidden unit.
** Generative pre-training: (Will be discussed towards the end of the course.)

== Learning to Predict the Next Word ==

=== Diversion into Cognitive Science ===

There has been a long debate in cognitive science between two rival theories of what it means to have a ''concept'':

* The feature theory: A concept is a set of semantic features
** Good for explaining similarities between concepts.
** It's convenient -- a concept is a vector of feature activities.
* The structuralist theory: The meaning of a concept lies in its relationships to other concepts.
** So conceptual knowledge is best expressed as a relational graph.
** Minsky used the limitations of perceptrons as evidence against feature vectors and in favor of relational graph representations.

Both sides are wrong.

* These two theories need not be rivals. A neural net can use vectors of semantic features to implement a relational graph.
* We may use explicit rules for conscious, deliberate reasoning, but we do a lot of commonsense, analogical reasoning by just "seeing" the answer with no conscious intervening steps.

==== Localist and Distributed Representations of Concepts ====

The obvious way to implement a relational graph in a neural net is to treat a neuron as a node in the graph and a connection as a binary relationship. But this "localist" method will not work:

* We need many different types of relationship and the connections in a neural net do not have discrete labels.
* We need ternary relationships as well as binary ones, e.g. A is between B and C

The right way to implement relational knowledge in a neural net is still an open issue.

* But many neurons are probably used for each concept and each neuron is probably involved in many concepts. This is called a "distributed representation". It's a many-to-many mapping between concepts and neurons.

=== The Softmax Output Function (Another Diversion) ===

The squared error measure has some drawbacks. Is there a different cost function that works better?

Yes, force the outputs to represent a probability distribution across discrete alternatives.

==== Softmax ====

The output units in a softmax group uses a non-local non-linearity. Let $y_i$ be the output of a neuron in a softmax group, and $z_i$ (called ''logit'') the input of the same neuron:

$$y_i = \frac{e^{z_i}}{\sum_{j \in \text{group}} e^{z_j} }$$

Note that $\sum y_i = 1$, so we force the $y_i$ to represent a probability distribution. We can show:

$$\frac{\partial y_i}{\partial z_i} = y_i (1 - y_i)$$

The right cost function to use with softmax is the negative log probability of the right answer. This is called the ''Cross Entropy'' cost function. ($t_j$ is the target value.)

$$C = - \sum_{j} t_j \log y_j$$

$C$ has a very big gradient when the target value is 1 and the output is almost zero.
* A value of 0.000001 is much better than 0.000000001 (one-in-a-million chance vs one-in-a-billion chance).
* The steepness of $dC/dy$ exactly balances the flatness of $dy/dz$.

$$\frac{\partial C}{\partial z_i} = \sum_{j} \frac{\partial C}{\partial y_j} \frac{\partial y_j}{\partial z_i} = y_i - t_i$$

=== Neuro-Probabilistic Language Models ===

* We cannot identify phonemes perfectly in noisy speech.
* People use their understanding of the meaning of the utterance to hear the right words.
** "We do this unconsciously when we wreck a nice beach"
** We are very good at it.
* This means speech recognizers have to know which words are likely to come next and which are not.
** Fortunately words can be predicted quite well without full understanding.

==== The Standard "Trigram" Method ====

* Take a huge amount of text and count the frequencies of all triples of words.

* Use these frequencies to make bets on the relative probabilities of words given the previous two words.

* Until very recently this was the state-of-the-art.
** We cannot use a much bigger context because there are too many possibilities to store and the counts would mostly be zero.
** We have to "back-off" to digrams when the count for a trigram is too small. The probability is not zero just because the count is zero!

* A trigram model doesn't understand the similarities between words.
** The sentence "The cat got squashed in the garden on friday" should help us predict words in the sentence "The dog got flattened in the yard on monday".

* To overcome this limitation, we need to use the semantic and syntactic features of previous words to predict the features of the next word.
** Using a feature representation also allows a context that contains many more previous words, e.g. 10.

* Yoshua Bengio's neural net for predicting the next word

== Object Recognition ==

=== Why Object Recognition is Difficult ===

* Segmentation: Real scenes are cluttered with other objects.
* Lighting: The intensities of the pixels are determined as much by the lighting as by the objects.
* Deformation: Objects can deform in a variety of non-affine ways.
* Affordances: Object classes are things defined by how they are used.
* Viewpoint: Changes in viewpoint cause changes in images that standard learning methods cannot cope with.

=== Achieving Viewpoint Invariance ===

* Viewpoint invariance is one of the main difficulties in making computers perceive.
* Several Different approaches:
** Use redundant invariant features.
** Put a box around the object and use normalized pixels.
** Use replicated features with pooling. This is called ''convolutional neural nets'' (in later lectures).
** Use a hierarchy of parts that have explicit poses relative to the camera (in later lectures).

==== Invariant Feature Approach ====

* Extract a large, redundant set of features that are invariant under transformations.
** e.g. pair of roughly parallel lines with a red dot between them (this is what baby herring gulls use to know where to peck for food).
* With enough invariant features, there is only one way to assemble them into an object.
* But for recognition, we must avoid forming features from parts of different objects.

==== Judicious Normalization Approach ====

* Put a box around the object and use it as a coordinate frame for a set of normalized pixels.
** This solves the dimension-hopping problem. If we choose the box correctly, the same part of an object always occurs on the same normalized pixels.
** The box can provide invariance to many degrees of freedom: translation, rotation, scale, shear, stretch, ...

* But choosing the box is difficult because of:
** Segmentation errors, occlusion, unusual orientations.

* We need to recognize the shape to get the box right!

==== Brute Force Normalization Approach ====

* When training the recognizer, use well-segmented, upright images to fit the correct box.

* At test time try all possible boxes in a range of positions and scales.

** This approach is widely used for detecting upright things like faces and house numbers in unsegmented images.
** It is much more efficient if the recognizer can cope with some variation in position and scale so that we can use a coarse grid when trying all possible boxes.

=== Convolutional Nets for Digit Recognition ===

==== The Replicated Feature Approach ====

* This is currently the dominant approach for neural networks.

* Use many different copies of the same feature detector with different positions.

** Could also replicate across scale and orientation but it is tricky and expensive.
** Replication greatly reduces the number of free parameters to be learned.

* Use several different feature types, each with its own map of replicated detectors.

==== Backpropagation with Weight Contraints ====

* It's easy to modify the backpropagation algorithm to incorporate linear constraints between the weights.

* We compute the gradients as usual, and then modify the gradients so that they satisfy the constraints - so if the weights started off satisfying the constraints, they will continue to satisfy them.

To constrain $w_1 = w_2$, we need $\Delta w_1 = \Delta w_2$.

Compute $\partial E / \partial w_1$ and $\partial E / \partial w_2$, then use $\partial E / \partial w_1 + \partial E / \partial w_2$ for both $w_1$ and $w_2$.

==== What Does Replicating the Feature Detectors Achieve? ====

* '''Equivariant''' activities: Replicated features do ''not'' make the neural activities invariant to translation. The activities are equivariant.

* '''Invariant''' knowledge: If a feature is useful in some locations during training, detectors for that feature will be available in all locations during testing.

==== Pooling the Outputs of Replicated Feature Detectors ====

* Get a small amount of translational invariance at each level by averaging four neighboring replicated detectors to give a single output to the next level.
** This reduces the number of inputs to the next layer of feature extaction, thus allowing us to have many more different feature maps.
** Taking the maximum of the four works slightly better.

* Problem: After several levels of pooling, we have lost information about the precise positions of things.
** This makes it impossible to use the precise spatial relationships between high-level parts for recognition.

==== Le Net ====

* Yann LeCun and his collaborators developed a really good recognizer for handwritten digits by using backpropagation in a feedforward net with:
** Many hidden layers.
** Many maps of replicated units in each layer.
** Pooling of the outputs of nearby replicated units.
** A wide net that can cope with several characters at once even if they overlap.
** A clever way of training a complete system, not just a recognizer.

* This net was used for reading ~10% of the checks in North America.

* Look the impressive demos of LENET at http://yann.lecun.com

==== Priors and Prejudice ====

* We can put our prior knowledge about the task into the network by designing appropriate:
** Connectivity
** Weight constraints
** Neuron activation functions

* This is less intrusive than hand-designing the features
** But it still prejudices the network towards the particular way of solving the problem that we had in mind.

* Alternatively, we can use our prior knowledge to create a whole lot more training data.
** This may require a lot of work (Hofman & Tresp, 1993). Read data + synthetic data.
** It may make learning take much longer.
** It allows optimization to discover clever ways of using the multi-layer network that we did not think of - and we may never fully understand how it does it.

==== The Brute Force Approach ====

* LeNet uses knowledge about the invariances to design:
** the local connectivity
** the weight-sharing
** the pooling

* This achieves about 80 errors.
** This can be reduced to about 40 errors by using many different transformations of the input and other tricks (Ranzato 2008).

* Ciresan et. al. (2010) inject knowledge of invariances by creating a huge amount of carefully designed extra training data:
** For each training image, they produce many new training examples by applying many different transformations.
** They can then train a large, deep, dumb net on a GPU without much overfitting.
** They achieved about 35 errors.

* With model averaging they can now get about 25 errors (which is about the same as human error rate).

==== How to Detect a Significant Drop in the Error Rate ====

* Is 30 errors in 10,000 test cases significantly better than 40 errors?
** It all depends on the particular errors.
** The McNemar test uses the particular errors and can be much more powerful than a test that just uses the number of errors.

=== Convolutional Nets for Object Recognition ===

==== From Hand-Written Digits to 3D Objects ====

* Recognizing real objects in color photographs downloaded from the web is much more complicated than recognizing hand-written digits:

** Hundred times as many classes (1000 vs 10)
** Hundred times as many pixels (256x256 color vs 28x28 gray)
** Two dimensional image of three-dimensional scene.
** Cluttered scenes requiring segmentation.
** Multiple objects in each image.

* Will the same type of convolutional neural network work?

==== The ILSVRC-2012 Competition on ImageNet ====

* The dataset has 1.2 million high-resolution training images.

* The classification task:
** Get the "correct" class in your top 5 bets. There are 1000 classes.

* The localization task:
** For each bet, put a box around the object. Your box must have at least 50% overlap with the correct box.

* Some of the best existing computer vision methods were tried on this dataset by leading computer vision groups from Oxford, INRIA, XRCE, ...
** Computer vision systems use complicated multi-stage systems.
** The early stages are typically hand-tuned by optimizing a few parameters.

==== Error Rates on the ILSVRC-2012 Competition ====

* University of Tokyo: 26.1% (classification)
* Oxford University Computer Vision Group: 26.9% (classification)
* INRIA(French national research institute in CS) + XRCE(Xerox Research Center Europe): 27.0% (classification)
* University of Amsterdam: 29.5% (classification)
* University of Toronto ('''Alex Krizhevsky'''): '''16.4%''' (classification)

==== A Neural Network for ImageNet ====

* Alex Krizhevsky (NIPS 2012) developed a very deep convolutional neural net of the type pioneered by Yann LeCun. Its ''architecture'' was:
** 7 hidden layers not counting some max pooling layers.
** The early layers were convolutional.
** The last two layers were globally connected.

* The ''activation functions'' were:
** Rectified linear units in every hidden layer. These train much faster and are more expressive than logistic units.
** Competitive normalization to suppress hidden activities when nearby units have stronger activities. This helps with variations in intensity.

==== Tricks that Significantly Improve Generalization ====

* Train on random 224x224 patches from the 256x256 images to get more data. Also use left-right reflections of the images.
** At test time, combine the opinions from ten different patches: The four 224x224 corner patches plus the central 224x224 patch plus the reflections of those five patches.

* Use "dropout" to regularize the weights in the globally connected layers (which contain most of the parameters).
** Dropout means that half of the hidden units in a layer are randomly removed for each training example.
** This stops hidden units from relying too much on other hidden units.

==== The Hardward Required for Alex Krizhevsky's Net ====

* He uses a very efficient implementation of convolutional nets on two Nvidia GTX 580 Graphics Processor Units (over 1000 fast little cores)
** GPUs are very good for matrix-matrix multiplies.
** GPUs have very high bandwidth to memory.
** This allows him to train the network in a week.
** It also makes it quick to combine results from 10 patches at test time.

* We can spread a network over many cores if we can communicate the states fast enough.

* As cores get cheaper and datasets get bigger, big neural nets will improve faster than old-fashioned (i.e. pre Oct 2012) computer vision systems.

==== Finding Roads in High-Resolution Images ====

* Vlad Mnih (ICML 2012) used a non-convolutional net with local fields and multiple layers of rectified linear units to find roads in cluttered aerial images.
** It takes a large image patch and predicts a binary road label for the central 16x16 pixels.
** There is lots of labeled training data available for this task.

* The task is hard for many reasons:
** Occlusion by buildings, trees, and cars.
** Shadows, Lighting changes.
** Minoro viewpoint changes.

* The worst problems are incorrect labels:
** Badly registered maps
** Arbitrary decisions about what counts as a road (vs laneway etc.)

== Mini-Batch Gradient Descent ==

=== Overview ===

* The Error Surface: Lies in a space with a horizontal axis for each weight and one vertical axis for the error.
** For a linear neuron with a squared error, it is a quadratic bowl where vertical cross-sections are parabolas and horizontal cross-sections are ellipses.

* For multi-layer, non-linear nets the error surface is much more complicated. But locally, a piece of a quadratic bowl is usually a very good approximation.

* Going downhill reduces the error, but the direction of steepest descent does not point at the minimum unless the ellipse is a circle.
** The gradient is big in the direction in which we only want to travel a small distance, small in the direction in which we want to travel a large distance.

* If the learning rate is big, the weights slosh to and fro across the ravine. If the learning rate is too big, this oscillation diverges.

* What we would like to achieve: move quickly in directions with small but consistent gradients, and move slowly in directions with big but inconsistent gradients.

==== Stochastic Gradient Descent ====

* If the dataset is highly redundant, the gradient on the first half is almost identical to the gradient on the second half.
** So, instead of computing the full gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half.
** The extreme version of this approach updates weights after each case. It's called "online" learning.

* Mini-batches are usually better than online.
** Less computation is used updating the weights.
** Computing the gradient for many cases simultaneously uses matrix-matrix multiplies which are very efficient, especially on GPUs.

* Mini-batches need to be balanced for classes (one way to approximate this is to pick points randomly).

==== Two types of learning algorithm ====

* If we use the full gradient computed from all the training cases, there are many clever ways to speed up learning (e.g. non-linear conjugate gradient).

* For large neural networks with very large and highly redundant training sets, it is nearly always best to use mini-batch learning.

==== Basic Mini-Batch Gradient Descent Algorithm ====

* Guess an initial learning rate.
** If the error keeps getting worse or oscillates wildly, reduce the learning rate.
** If the error is falling fairly consistently but slowly, increase the learning rate.

* Write a simple program to automate this way of adjusting the learning rate.

* Toward the end of mini-batch learning, it nearly always helps to turn down the learning rate.
** This removes fluctuations in the final weights caused by the variations between mini-batches.

* Turn down the learning rate when the error stops decreasing.

* '''Note''': Compute the error in a separate validation set.

=== Tricks for Mini-Batch Gradient Descent ===

==== Initializing The Weights ====

* If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights, they will always get exactly the same gradient.

** So they can never learn to be different features.
** We break symmetry by initializing the weights to have small random values.

* If a hidden unit has a big fan-in, small changes on many of its incoming weights can cause the learning to overshoot.
** We generally want smaller incoming weights when the fan-in is big, so initialize the weights to be proportional to $\sqrt{\text{fan-in}}$.

* We can also scale the learning rate the same way.

==== Shifting the Inputs ====

* When using steepest descent, shifting the input values makes a big difference.
** It usually helps to transform each component of the input vector so that it has zero mean over the whole training set.

* The hyperbolic tangent (which is $2\times \text{logistic} - 1$) produces hidden activations that are roughly zero mean.
** In this respect it's better than the logistic.

==== Scaling the Inputs ====

* When using steepest descent, scaling the input values makes a big difference.
** It usually helps to transform each component of the input vector so that is has unit variance over the whole training set.

==== A More Thorough Method: Decorrelate the Input Components ====

* For a linear neuron, we get a big win by decorrelating each component of the input from the other input components.

* There are several different ways to decorrelate inputs. A reasonable method is to use Principal Components Analysis.
** Drop the principal components with the smallest eigenvalues ― '''This achieves some dimensionality reduction'''.
** Divide the remaining principal components by the square roots of their eigenvalues. For a linear neuron, this converts an axis aligned elliptical error surface into a circular one.

* For a circular error surface, the gradient points straight towards the minimum.

==== Common Problems That Occur in Multilayer Networks ====

* If we start with a very big learning rate, the weights of each hidden unit will all become very big and positive or very big and negative.
** The error derivatives for the hidden units will all become tiny and the error will not decrease.
** This is usually a plateau, but people often mistake it for a local minimum.

* In classification networks that use a squared error or a cross-entropy error, the best guessing strategy is to make each output unit always produce an output equal to the proportion of time it should be a 1.
** The network find this strategy quickly and may take a long time to improve on it by making use of the input.
** This is another plateau that looks like a local minimum.

==== Be Careful about Turning Down the Learning Rate ====

* Turning down the learning rate reduces the random fluctuations in the error due to the different gradients on different mini-batches.
** So we get a quick win
** But then we get slower learning.

* Don't turn down the learning rate too soon!

==== Four Ways to Speed up Mini-Batch Learning ====

* Use "momentum"
** Instead of using the gradient to change the ''position'' of the weight "particle, use it to change the ''velocity''.

* Use separate adaptive learning rates for each parameter.
** Slowly adjust the rate using the consistency of the gradient for that parameter.

* '''rmsprop''': Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.
** This is the mini-batch version of just using the sign of the gradient.

* Take a fancy method from the optimization literature that makes use of curvature information (not this lecture)
** Adapt it to work for neural nets.
** Adapt it to work for mini-batches.

=== The Momentum Method ===

==== Intuition ====

Imagine a ball on the error surface. The location of the ball in the horizontal plane represents the weight vector.
# The ball starts off by following the gradient, but once it has velocity, it no longer does steepest descent.
# Its momentum makes it keep going in the previous direction.

* It damps oscillations in directions of high curvature by combining gradients with opposite signs.

* It builds up speed in directions with a gentle but consistent gradient.

==== Equations ====

$$\mathbb{v}(t) = \alpha \mathbb{v}(t-1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

The effect of the gradient is to increment the previous velocity. The velocity also decays by $\alpha$ which is slightly less then 1.

The weight change is equal to the current velocity.

$$\Delta \mathbb{w}(t) = \mathbb{v} (t)$$

$$ = \alpha \mathbb{v} (t - 1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

$$ = \alpha \Delta \mathbb{w} (t - 1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

The weight change can be expressed in terms of the previous weight change and the current gradient.

==== Behavior ====

* If the error surface is a tilted plane, the ball reaches a terminal velocity $\mathbb{v}(\infty)$.
** If the momentum is close to 1, this is much faster than simple gradient descent.

$$\mathbb{v}(\infty) = \frac{1}{1 - \alpha} \Big( - \epsilon \frac{\partial E}{\partial \mathbb{w}} \Big)$$

* At the beginning of learning there may be very large gradients.
** So it pays to use a small momentum (e.g. 0.5).
** Once the large gradients have disappeared and the weights are stuck in a ravine the momentum can be smoothly raised to its final value (e.g. 0.9 or even 0.99)

* This allows us to learn at a rate that would cause divergent oscillations without the momentum.

==== Better Type of Momentum (Nesterov 1983) ====

* The standard momentum method ''first'' computes the gradient at the current location and ''then'' takes a big jump in the direction of the updated accumulated gradient.
* Ilya Sutskever (2012 unpublished) suggested a new form of momentum that often works better.
** Inspired by the Nesterov method for optimizing convex functions.
* ''First'' make a big jump in the direction of the previous accumulated gradient.
* ''Then'' measure the gradient where you end up and make a correction.
** It's better to correct a mistake ''after'' you have made it!

=== Adaptive Learning Rates for Each Connection ===

==== Intuition ====

* In a multilayer net, the appropriate learning rates can vary widely between weights:
** The magnitudes of the gradients are often very different for different layers, especially if the initial weights are small.
** The fan-in of a unit determines the size of the "overshoot" effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error.

* So user a global learning rate (set by hand) multiplied by an appropriate local gain that is determined empirically for each weight.

==== One Way to Determine the Individual Learning Rates ====

$$\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}$$

* Start with a local gain of 1 for every weight. i.e. $g_{ij}(0) = 1$
* Increase the local gain if the gradient for that weight does not change sign.
* Use small additive increases and multiplicative decreases.
** This ensures that big gains decay rapidly when oscillations start.
** If the gradient is totally random, the gain will hover around 1 when we increase by ''plus'' $\delta$ half the time and decrease by ''times'' $1 - \delta$ half the time.

$$\text{if } \Big( \frac{\partial E}{\partial w_{ij}}(t) \cdot \frac{\partial E}{\partial w_{ij}}(t-1) \Big) > 0$$
$$\text{then } g_{ij}(t) = g_{ij}(t - 1) + 0.05$$
$$\text{else } g_{ij}(t) = g_{ij}(t - 1) * 0.95$$

==== Tricks for Making Adaptive Learning Rates Work Better ====

* Limit the gains to lie in some reasonable range, e.g. [0.1, 10] or [0.01, 100].

* Use full batch learning or very big mini-batches.
** This ensures that changes in the sign of the gradient are not mainly due to the sampling error of a mini batch.

* Adaptive learning rates can be combined with momentum.
** Use the agreement in sign between the current gradient for a weight and the velocity for that weight (Jacobs, 1989).

* Adaptive learning rates only deal with axis-aligned effects.
** Momentum does not care about the alignment of the axes.

=== RMSPROP: Divide The Gradient By Running Average of Its Recent Magnitude ===

==== RPROP: Using only the sign of the gradient ====

* The magnitude of the gradient can be very different for different weights and can change during learning.
** This makes it hard to choose a single global learning rate.

* For ''full batch learning'', we can deal with this variation by only using the sign of the gradient.
** The weight updates are all of the same magnitude.
** This escapes from plateaus with tiny gradients quickly.

* RPROP: This combines the idea of only using the sign of the gradient with the idea of adapting the step size separately for each weight.
** Increase the step size for a weight ''multiplicatively'' (e.g. times 1.2) if the signs of its last two gradients agree.
** Otherwise decrease the step size multiplicatively (e.g. times 0.5).
** Limit the step sizes to be less than 50 and more than a millionth (Mike Shuster's advice).

==== Why RPROP Does Not Work with Mini-Batches ====

* The idea behind stochastic gradient descent is that when the learning rate is small, it averages the gradients over successive mini-batches.
** Consider a weight that gets a gradient of +0.1 on nine mini-batches and a gradient of -0.9 on the tenth mini-batch.
** We want this weight to stay roughly where it is.

* RPROP would increment the weight nine times and decrement it once by about the same amount (assuming any adaptation of the step sizes is small on this time-scale), so the weight would grow a lot.
* Is there a way to combine:
** The robustness of RPROP.
** The efficiency of mini-batches.
** The effective averaging of gradients over mini-batches.

==== RMSPROP: Mini-Batch Version of RPROP ====

* RPROP is equivalent to using the gradient but also dividing by the size of the gradient.
** The problem with mini-batch RPROP is that we divide by a different number for each mini-batch. So why not force the number we divide by to be very similar for adjacent mini-batches?

* RMSPROP: Keep a moving average of the squared gradient for each weight.
$$MeanSquare(w,t) = 0.9 \cdot MeanSquare(w, t-1) + 0.1 \Big( \frac{\partial E}{\partial w}(t) \Big) ^2$$

* Dividing the gradient by $\sqrt{MeanSquare(w,t)}$ makes the learning work much better (Tijmen Tieleman, unpublished).

==== Further Developements of RMSPROP ====

* Combining RMSPROP with standard momentum
** Momentum does not help as much as it normally does. Needs more investigation.

* Combining RMSPROP with Nesterov momentum (Sutskever 2012)
** It works best if the RMS of the recent gradients is used to divide the correction rather than the jump in the direction of accumulated corrections.

* Combining RMSPROP with adaptive learning rates for each connection
** Needs more investigation.

* Other methods related to RMSPROP
** Yann LeCun's group has a fancy version in "No more pesky learning rates"

=== Summary of Learning Methods for Neural Networks ===

* For small datasets (e.g. 10,000 cases) or bigger datasets without much redundancy, use a full-batch method.
** Conjugate gradient, LBFGS, ...
** adaptive learning rates, RPROP, ...

* For big, redundant datasets, use mini-batches.
** Try gradient descent with momentum.
** Try RMSPROP (with momentum?)
** Try LeCun's latest recipe.

* Why there is no simple recipe:
*# '''Neural nets differ a lot''':
*#* Very deep nets (especially ones with narrow bottlenecks).
*#* Recurrent nets.
*#* Wide shallow nets.
*# '''Tasks differ a lot''':
*#* Some require very accurate weights, some don't.
*#* Some have many very rare cases (e.g. words).

Neural Networks (Geoffrey Hinton Course)

2017-01-15T04:07:05Z

Tedyun: /* Summary of Learning Methods for Neural Networks */

== Some Simple Models or Neurons ==

$y$ output, $x_i$ input.

=== Linear Neurons ===

$y = b + \sum_{i} x_i w_i$

$w_i$ weights, $b$ bias

=== Binary Threshold Neurons ===

$z = \sum_{i} x_i w_i$

$y = 1$ if $z \geq \theta$, $0$ otherwise.

Or, equivalently,

$z = b + \sum_{i} x_i w_i$

$y = 1$ if $z \geq 0$, $0$ otherwise.

=== Rectified Linear Neurons ===

$z = b + \sum_{i} x_i w_i$

$y = z$ if $z > 0$, $0$ otherwise. (linear above zero, decision at zero.)

=== Sigmoid Neurons ===

Give a real-valued output that is a smooth and bounded function of their total input.

$z = b + \sum_{i} x_i w_i$

$y = \frac{1}{1 + e^{-z}}$

=== Stochastic Binary Neurons ===

Same equations as logistic units, but outputs $1$ (=spike) or $0$ randomly based on the probability. They treat the output of the logistic as the probability of producing a spike in a short time window.

$z = b + \sum_{i} x_i w_i$

$P(s = 1) = \frac{1}{1 + e^{-z}}$

We can do a similar trick for rectified linear units - in this case the output is treated as the Poisson rate for spikes.

== Types of Learning ==

=== Supervised Learning ===
Learn to predict an output when given an input vector.
* Regression: The target output is a real number or a whole vector of real numbers.
* Classification: The target output is a class label.

==== How Supervised Learning Typically Works ====
# Start by choosing a model-class: $y = f(x;W)$
#* A model-class $f$ is a way of using some numerical parameters $W$ to map each input vector $x$ into a predicted output $y$.
# Learning usually means adjusting the parameters to reduce the discrepancy between the target output, $t$, on each training case and the actual output, $y$, produced by the model.
#* For regression $\frac{1}{2}(y-t)^2$ is often a sensible measure of the discrepancy.
#* For classification there are other measures that are generally more sensible (they also work better).

=== Reinforcement Learning ===
Learn to select an action to maximize payoff.

* The output is an action or sequence of actions and the only supervisory signal is an occasional scalar reward.
** The goal in selecting each action is to maximize the expected sum of the future rewards.
** We usually use a discount factor for delayed rewards so that we don't have to look too far into the future.
* Reinforcement learning is difficult because:
** The rewards are typically delayed so it's hard to know where we went wrong (or right).
** A scalar reward does not supply much information.
* Typically you can't learn millions of parameters using reinforcement learning (you can with supervised/unsupervised learning). Typically you learn dozens or thousands of parameters.
* Will not be covered in this course.

=== Unsupervised Learning ===
Discover a good internal representation of the input.

* For about 40 years unsupervised learning was largely ignored by the machine learning community (except for clustering).
* It is hard to say what the aim of unsupervised learning is:
** One major aim is to create an internal representation of the input that is useful for subsequent supervised or reinforcement learning.
** You can compute the distance to a surface by using the disparity between two images. But you don't want to learn to compute disparities by stubbing your toe thousands of times.
* Other goals:
** Providing a compact, low-dimensional representation of the input.
*** High-dimensional inputs typically live on or near a low-dimensional manifold (or several such manifolds)
*** Principal Component Analysis is a widely used linear method for finding a low-dimensional representation.
** Providing an economical high-dimensional representation of the input in terms of learned features.
*** Binary features
*** Real-valued features that are nearly all zero
** Finding sensible clusters in the input
*** This is an example of a very sparse code in which only one of the features is non-zero.

== Neural Network Architectures ==

# Feed-forward architecture: information comes into the input units and flows one direction through hidden layers until each reaches the output units.
# Recurrent neural network: information can flow around in cycles.
# Symmetrically connected network: weights are the same in both directions between two units.

=== Feed-forward Neural Networks ===

* The commonest type of neural network
* The first layer is the input and the last layer is the output
* Called "deep" neural networks if there is more than one hidden layer.
* They compute a series of transformations that change the similarities between cases - the activities of the neurons in each layer are a non-linear function of the activities in the layer below.

=== Recurrent Networks ===

* Have directed cycles in their connection graph.
* Have complicated dynamics — difficult to train.
* More biologically realistic.
* Very natural way to model sequential data
* Have the ability to remember information in their hidden state for a long time (but hard to train to use this ability).

=== Symmetrically Connected Networks ===

* Like recurrent networks, but the connections between units are symmetrical (same weight in both directions).
* Much easier to analyze than recurrent networks (John Hopfield, et al)
* More restricted in what they can do, because they obey an energy function. for example, they cannot model cycles.

== The First Generation of Neural Networks ==

=== Standard Paradigm for Statistical Pattern Recognition ===

* Convert the raw input vector into a vector of feature activations (using hand-written programs based on common-sense to define the features).
* Learn how to weight each of the feature activations to get a single scalar quantity.
* If this quantity is above some threshold, decide that the input vector is a positive example of the target class.

=== Perceptrons ===

input units => (by hand-coded programs) => feature units => decision unit

* Popularized by Frank Rosenblatt in the early 1960s.
* Lots of grand claims were made for what they could learn to do.
* In 1969, Minsky and Papert showed Perceptrons' limitations ('''Group Invariance Theorem'''). Many people thought these limitations applied to all neural network models.
* Perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.
* Decision units in perceptrons are '''binary threshold neurons'''.

==== Perceptron Learning Procedure ====

# Add an extra component with value 1 to each input vector ("bias" weight) and forget about the threshold. (bias) = -(threshold)
# Pick training cases using any policy that ensures that every training case will keep getting picked.
#* If the output unit is correct, leave its weights alone.
#* If the output unit incorrectly outputs a zero, add the input vector to the weight vector.
#* If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector.
# This is guaranteed to find a set of weights that gets the right answer for all the training cases ''if any such set exists''.

==== Geometrical View of Perceptrons ====

* Weight space (each point corresponds to a particular setting for all the weights)
* Each training cases represents a hyperplane through the origin (if we eliminate the threshold/bias). That hyperplane is perpendicular to the ''input'' vector.
** the weights must lie on one side of this hyperplane to get the answer correctly.
* To get all the training cases right we need to find a point on the right side of all the planes (there may not be any such point).
* If there are any weight vectors that get the right answer for all cases, they lie in a hyper-cone with its apex at the origin. This means the problem is '''convex'''.

==== Limitations of Perceptrons ====

* If you are allowed to choose the features by hand and if you use enough features, you can do almost anything.
* But once the hand-coded features have been determined, there are very strong limitations on what a perceptron cal learn.
** A binary threshold output unit cannot even tell if two single bit features are the same!
*** Positive cases (same): $(1,1) \rightarrow 1$, $(0,0) \rightarrow 1$
*** Negative cases (different): $(1,0) \rightarrow 0$, $(0,1) \rightarrow 0$
*** The four input-output pairs give four inequalities that are impossible to satisfy.
** Imagine 'data-space' in which the axes correspond to components of an input vector.
*** Each input vector is a point in this space, a weight vector defines a plane in data-space.
*** The weight plane is perpendicular to the weight vector and misses the origin by a distance equal to the threshold.
*** There are many cases where we cannot separate positive and negative cases only by a hyperplane. Those cases are called '''not linearly separable'''.
** Cannot discriminate simple patterns under translation with wrap-around.

==== Conclusion ====

* Networks without hidden units are very limited in the input-output mappings they can learn to model. More layers of linear units do not help, it's still linear. Fixed output non-linearities are not enough.
* We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets?
** We need an efficient way of adapting all the weights, not just the last layer. This is hard.
** Learning the weights going into hidden units is equivalent to learning features.
** This is difficult because nobody is telling us directly what the hidden units should do.

== Linear Neurons ==

=== Learning Overview ===

$y = \sum_{i} w_i x_i = \mathbb{w}^T \mathbb{x}$

$y$: Neuron's estimate of the desired output, $\mathbb{w}$: weight vector, $\mathbb{x}$: input vector

* Perceptron convergence procedure doesn't work for more complex networks (because the problem is not ''convex''). Multi-layer neural networks do not use the perceptron learning procedure.
* Instead of showing that the weights get closer to a good set of weights, we show that the actual output values get closer the target values.
* The aim of learning is to minimize the error summed over all training cases. The error is the squared difference between the desired output and the actual output.

* Why don't we solve it analytically?
** Scientific answer: We want a method that real neurons could use.
** Engineering answer: We want a method that can be generalized to multi-layer, non-linear neural networks.

=== Learning Procedure ===

* The ''delta-rule'' for learning:
** $\Delta w_i = \epsilon x_i (t - y)$, $\epsilon$: learning rate, $t$: target value
** $E = \frac{1}{2} \sum_{n \in \text{training}} (t^n - y^n)^2$, $E$: error

=== The Error Surface for Linear Neurons ===

* The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error.
** For a linear neuron with a squared error, it is a quadratic bowl.
** Vertical cross-sections are parabolas.
** Horizontal cross-sections are ellipses
* For multi-layer, non-linear nets the error surface is much more complicated.
* The simplest kind of batch learning does steepest descent on the error surface. This travels perpendicular to the contour lines.
* The simplest kind of online learning zig-zags around the direction of steepest descent. This travels perpendicular to the constraint line of that particular training case.

== Logistic Neurons ==

$z = b + \sum_{i} x_i w_i$, $y = \frac{1}{1 + e^{-z}}$

* These give a real-valued output that is a smooth and bounded function of their total input.
* The derivatives:
** $\frac{\partial z}{\partial w_i} = x_i$
** $\frac{\partial z}{\partial x_i} = w_i$
** $\frac{dz}{dy} = y(1-y)$
** By the chain rule, $\frac{\partial y}{\partial w_i} = x_i y (1-y)$.
** By the chain rule, $\frac{\partial E}{\partial w_i} = \sum_{n} \frac{\partial y^n}{\partial w_i} \frac{\partial E}{\partial y^n} = - \sum_{n} x_i^n y^n (1-y^n) (t^n - y^n)$.

== The Backpropagation Algorithm ==

=== Other Ideas ===

==== Learning by Perturbing Weights ====

* The ideas occurs to everyone who knows about evolution.
* Randomly perturb one weight and see if it improves performance. If so, save the change.
** This is a form of reinforcement learning.
** '''Very inefficient'''. We need to do mutiple forward passes on a representative set of training cases just to change one weight.
** Towards the end of learning, large weight perturbations will nearly always make things worse.

==== Learning by Using Perturbations ====

* We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes.
* Not any better because we need lots of trials on each training cases to see the effect of changing one weight through the noise created by all the changes to other weights.
* A better idea: Randomly perturb the ''activities'' of the hidden units. Once we know how we want a hidden activity to change on a given training case, we can compute how to change the weights.

=== The Idea behind Backpropagation ===

* We don't know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.
* We can compute error derivatives for all the hidden units efficiently at the same time.
** Once we have the error derivatives for the hidden activities, it's easy to get the error derivatives for the weights going into a hidden unit.

=== Sketch of the Backpropagation Algorithm ===

* First convert the discrepance between each output and its target value into an error derivative.
* Then compute error derivatives in each hidden layer from error derivatives in the layer above.
* Then use error derivatives w.r.t. activities to get error derivatives w.r.t. the incoming weights.

=== Using the Derivatives Computed by Backpropagation ===

* The backpropagation algorithm is an efficient way of computing the error derivative $\partial E / \partial w$ for every weight on a single training case.
* To get a fully specified learning procedure, we still need to make a lot of other decisions about how to use these error derivatives:
** Opimization issues: How do we use the error derivatives on individual cases to discover a good set of weights?
** Generalization issues: How do we ensure that the learned weights work well for cases we did not see during training?

==== Optimization Issues ====

* How often to update the weights:
** Online: after each training case.
** Full batch: after a full sweep through the training data.
** Mini-batch: after a small sample of training cases. This is typically used when training big nets on big data sets.
* How much to update: (will be discussed further later)
** Use a fixed learning rate?
** Adapt the global learning rate?
** Adapt the learning rate on each connection esparately?
** Don't use steepest descent?

==== Generalization Issue (Overfitting) ====

* The downside of using powerful models.
* The training data contains information about the regularities in the mapping from input to output. But it also contains two types of noise:
** The target values may be unreliable (usually only a minor worry).
** There is '''sampling error'''. There will be accidental regularities just because of the particular training cases that were chosen.
* When we fit the model, it cannot tell which regularities are real and which are caused by sampling error, so it fits both kinds of regularity. If the model is very flexible it can model the sampling error really well. ''This is a disaster''.

==== Ways to Reduce Overfitting ====

* A large number of different methods have been developed:
** Weight-decay: keep weights small / zero.
** Weight-sharing: insist that many weights have the same value.
** Early stopping: Make a fake test set. As training the net, once the performance on the fake test set starts getting worse you stop training.
** Model averaging: Train many different neural nets and average them.
** Bayesian fitting of neural nets: really just a fancy form of model averaging.
** Dropout: Randomly omit a hidden unit.
** Generative pre-training: (Will be discussed towards the end of the course.)

== Learning to Predict the Next Word ==

=== Diversion into Cognitive Science ===

There has been a long debate in cognitive science between two rival theories of what it means to have a ''concept'':

* The feature theory: A concept is a set of semantic features
** Good for explaining similarities between concepts.
** It's convenient -- a concept is a vector of feature activities.
* The structuralist theory: The meaning of a concept lies in its relationships to other concepts.
** So conceptual knowledge is best expressed as a relational graph.
** Minsky used the limitations of perceptrons as evidence against feature vectors and in favor of relational graph representations.

Both sides are wrong.

* These two theories need not be rivals. A neural net can use vectors of semantic features to implement a relational graph.
* We may use explicit rules for conscious, deliberate reasoning, but we do a lot of commonsense, analogical reasoning by just "seeing" the answer with no conscious intervening steps.

==== Localist and Distributed Representations of Concepts ====

The obvious way to implement a relational graph in a neural net is to treat a neuron as a node in the graph and a connection as a binary relationship. But this "localist" method will not work:

* We need many different types of relationship and the connections in a neural net do not have discrete labels.
* We need ternary relationships as well as binary ones, e.g. A is between B and C

The right way to implement relational knowledge in a neural net is still an open issue.

* But many neurons are probably used for each concept and each neuron is probably involved in many concepts. This is called a "distributed representation". It's a many-to-many mapping between concepts and neurons.

=== The Softmax Output Function (Another Diversion) ===

The squared error measure has some drawbacks. Is there a different cost function that works better?

Yes, force the outputs to represent a probability distribution across discrete alternatives.

==== Softmax ====

The output units in a softmax group uses a non-local non-linearity. Let $y_i$ be the output of a neuron in a softmax group, and $z_i$ (called ''logit'') the input of the same neuron:

$$y_i = \frac{e^{z_i}}{\sum_{j \in \text{group}} e^{z_j} }$$

Note that $\sum y_i = 1$, so we force the $y_i$ to represent a probability distribution. We can show:

$$\frac{\partial y_i}{\partial z_i} = y_i (1 - y_i)$$

The right cost function to use with softmax is the negative log probability of the right answer. This is called the ''Cross Entropy'' cost function. ($t_j$ is the target value.)

$$C = - \sum_{j} t_j \log y_j$$

$C$ has a very big gradient when the target value is 1 and the output is almost zero.
* A value of 0.000001 is much better than 0.000000001 (one-in-a-million chance vs one-in-a-billion chance).
* The steepness of $dC/dy$ exactly balances the flatness of $dy/dz$.

$$\frac{\partial C}{\partial z_i} = \sum_{j} \frac{\partial C}{\partial y_j} \frac{\partial y_j}{\partial z_i} = y_i - t_i$$

=== Neuro-Probabilistic Language Models ===

* We cannot identify phonemes perfectly in noisy speech.
* People use their understanding of the meaning of the utterance to hear the right words.
** "We do this unconsciously when we wreck a nice beach"
** We are very good at it.
* This means speech recognizers have to know which words are likely to come next and which are not.
** Fortunately words can be predicted quite well without full understanding.

==== The Standard "Trigram" Method ====

* Take a huge amount of text and count the frequencies of all triples of words.

* Use these frequencies to make bets on the relative probabilities of words given the previous two words.

* Until very recently this was the state-of-the-art.
** We cannot use a much bigger context because there are too many possibilities to store and the counts would mostly be zero.
** We have to "back-off" to digrams when the count for a trigram is too small. The probability is not zero just because the count is zero!

* A trigram model doesn't understand the similarities between words.
** The sentence "The cat got squashed in the garden on friday" should help us predict words in the sentence "The dog got flattened in the yard on monday".

* To overcome this limitation, we need to use the semantic and syntactic features of previous words to predict the features of the next word.
** Using a feature representation also allows a context that contains many more previous words, e.g. 10.

* Yoshua Bengio's neural net for predicting the next word

== Object Recognition ==

=== Why Object Recognition is Difficult ===

* Segmentation: Real scenes are cluttered with other objects.
* Lighting: The intensities of the pixels are determined as much by the lighting as by the objects.
* Deformation: Objects can deform in a variety of non-affine ways.
* Affordances: Object classes are things defined by how they are used.
* Viewpoint: Changes in viewpoint cause changes in images that standard learning methods cannot cope with.

=== Achieving Viewpoint Invariance ===

* Viewpoint invariance is one of the main difficulties in making computers perceive.
* Several Different approaches:
** Use redundant invariant features.
** Put a box around the object and use normalized pixels.
** Use replicated features with pooling. This is called ''convolutional neural nets'' (in later lectures).
** Use a hierarchy of parts that have explicit poses relative to the camera (in later lectures).

==== Invariant Feature Approach ====

* Extract a large, redundant set of features that are invariant under transformations.
** e.g. pair of roughly parallel lines with a red dot between them (this is what baby herring gulls use to know where to peck for food).
* With enough invariant features, there is only one way to assemble them into an object.
* But for recognition, we must avoid forming features from parts of different objects.

==== Judicious Normalization Approach ====

* Put a box around the object and use it as a coordinate frame for a set of normalized pixels.
** This solves the dimension-hopping problem. If we choose the box correctly, the same part of an object always occurs on the same normalized pixels.
** The box can provide invariance to many degrees of freedom: translation, rotation, scale, shear, stretch, ...

* But choosing the box is difficult because of:
** Segmentation errors, occlusion, unusual orientations.

* We need to recognize the shape to get the box right!

==== Brute Force Normalization Approach ====

* When training the recognizer, use well-segmented, upright images to fit the correct box.

* At test time try all possible boxes in a range of positions and scales.

** This approach is widely used for detecting upright things like faces and house numbers in unsegmented images.
** It is much more efficient if the recognizer can cope with some variation in position and scale so that we can use a coarse grid when trying all possible boxes.

=== Convolutional Nets for Digit Recognition ===

==== The Replicated Feature Approach ====

* This is currently the dominant approach for neural networks.

* Use many different copies of the same feature detector with different positions.

** Could also replicate across scale and orientation but it is tricky and expensive.
** Replication greatly reduces the number of free parameters to be learned.

* Use several different feature types, each with its own map of replicated detectors.

==== Backpropagation with Weight Contraints ====

* It's easy to modify the backpropagation algorithm to incorporate linear constraints between the weights.

* We compute the gradients as usual, and then modify the gradients so that they satisfy the constraints - so if the weights started off satisfying the constraints, they will continue to satisfy them.

To constrain $w_1 = w_2$, we need $\Delta w_1 = \Delta w_2$.

Compute $\partial E / \partial w_1$ and $\partial E / \partial w_2$, then use $\partial E / \partial w_1 + \partial E / \partial w_2$ for both $w_1$ and $w_2$.

==== What Does Replicating the Feature Detectors Achieve? ====

* '''Equivariant''' activities: Replicated features do ''not'' make the neural activities invariant to translation. The activities are equivariant.

* '''Invariant''' knowledge: If a feature is useful in some locations during training, detectors for that feature will be available in all locations during testing.

==== Pooling the Outputs of Replicated Feature Detectors ====

* Get a small amount of translational invariance at each level by averaging four neighboring replicated detectors to give a single output to the next level.
** This reduces the number of inputs to the next layer of feature extaction, thus allowing us to have many more different feature maps.
** Taking the maximum of the four works slightly better.

* Problem: After several levels of pooling, we have lost information about the precise positions of things.
** This makes it impossible to use the precise spatial relationships between high-level parts for recognition.

==== Le Net ====

* Yann LeCun and his collaborators developed a really good recognizer for handwritten digits by using backpropagation in a feedforward net with:
** Many hidden layers.
** Many maps of replicated units in each layer.
** Pooling of the outputs of nearby replicated units.
** A wide net that can cope with several characters at once even if they overlap.
** A clever way of training a complete system, not just a recognizer.

* This net was used for reading ~10% of the checks in North America.

* Look the impressive demos of LENET at http://yann.lecun.com

==== Priors and Prejudice ====

* We can put our prior knowledge about the task into the network by designing appropriate:
** Connectivity
** Weight constraints
** Neuron activation functions

* This is less intrusive than hand-designing the features
** But it still prejudices the network towards the particular way of solving the problem that we had in mind.

* Alternatively, we can use our prior knowledge to create a whole lot more training data.
** This may require a lot of work (Hofman & Tresp, 1993). Read data + synthetic data.
** It may make learning take much longer.
** It allows optimization to discover clever ways of using the multi-layer network that we did not think of - and we may never fully understand how it does it.

==== The Brute Force Approach ====

* LeNet uses knowledge about the invariances to design:
** the local connectivity
** the weight-sharing
** the pooling

* This achieves about 80 errors.
** This can be reduced to about 40 errors by using many different transformations of the input and other tricks (Ranzato 2008).

* Ciresan et. al. (2010) inject knowledge of invariances by creating a huge amount of carefully designed extra training data:
** For each training image, they produce many new training examples by applying many different transformations.
** They can then train a large, deep, dumb net on a GPU without much overfitting.
** They achieved about 35 errors.

* With model averaging they can now get about 25 errors (which is about the same as human error rate).

==== How to Detect a Significant Drop in the Error Rate ====

* Is 30 errors in 10,000 test cases significantly better than 40 errors?
** It all depends on the particular errors.
** The McNemar test uses the particular errors and can be much more powerful than a test that just uses the number of errors.

=== Convolutional Nets for Object Recognition ===

==== From Hand-Written Digits to 3D Objects ====

* Recognizing real objects in color photographs downloaded from the web is much more complicated than recognizing hand-written digits:

** Hundred times as many classes (1000 vs 10)
** Hundred times as many pixels (256x256 color vs 28x28 gray)
** Two dimensional image of three-dimensional scene.
** Cluttered scenes requiring segmentation.
** Multiple objects in each image.

* Will the same type of convolutional neural network work?

==== The ILSVRC-2012 Competition on ImageNet ====

* The dataset has 1.2 million high-resolution training images.

* The classification task:
** Get the "correct" class in your top 5 bets. There are 1000 classes.

* The localization task:
** For each bet, put a box around the object. Your box must have at least 50% overlap with the correct box.

* Some of the best existing computer vision methods were tried on this dataset by leading computer vision groups from Oxford, INRIA, XRCE, ...
** Computer vision systems use complicated multi-stage systems.
** The early stages are typically hand-tuned by optimizing a few parameters.

==== Error Rates on the ILSVRC-2012 Competition ====

* University of Tokyo: 26.1% (classification)
* Oxford University Computer Vision Group: 26.9% (classification)
* INRIA(French national research institute in CS) + XRCE(Xerox Research Center Europe): 27.0% (classification)
* University of Amsterdam: 29.5% (classification)
* University of Toronto ('''Alex Krizhevsky'''): '''16.4%''' (classification)

==== A Neural Network for ImageNet ====

* Alex Krizhevsky (NIPS 2012) developed a very deep convolutional neural net of the type pioneered by Yann LeCun. Its ''architecture'' was:
** 7 hidden layers not counting some max pooling layers.
** The early layers were convolutional.
** The last two layers were globally connected.

* The ''activation functions'' were:
** Rectified linear units in every hidden layer. These train much faster and are more expressive than logistic units.
** Competitive normalization to suppress hidden activities when nearby units have stronger activities. This helps with variations in intensity.

==== Tricks that Significantly Improve Generalization ====

* Train on random 224x224 patches from the 256x256 images to get more data. Also use left-right reflections of the images.
** At test time, combine the opinions from ten different patches: The four 224x224 corner patches plus the central 224x224 patch plus the reflections of those five patches.

* Use "dropout" to regularize the weights in the globally connected layers (which contain most of the parameters).
** Dropout means that half of the hidden units in a layer are randomly removed for each training example.
** This stops hidden units from relying too much on other hidden units.

==== The Hardward Required for Alex Krizhevsky's Net ====

* He uses a very efficient implementation of convolutional nets on two Nvidia GTX 580 Graphics Processor Units (over 1000 fast little cores)
** GPUs are very good for matrix-matrix multiplies.
** GPUs have very high bandwidth to memory.
** This allows him to train the network in a week.
** It also makes it quick to combine results from 10 patches at test time.

* We can spread a network over many cores if we can communicate the states fast enough.

* As cores get cheaper and datasets get bigger, big neural nets will improve faster than old-fashioned (i.e. pre Oct 2012) computer vision systems.

==== Finding Roads in High-Resolution Images ====

* Vlad Mnih (ICML 2012) used a non-convolutional net with local fields and multiple layers of rectified linear units to find roads in cluttered aerial images.
** It takes a large image patch and predicts a binary road label for the central 16x16 pixels.
** There is lots of labeled training data available for this task.

* The task is hard for many reasons:
** Occlusion by buildings, trees, and cars.
** Shadows, Lighting changes.
** Minoro viewpoint changes.

* The worst problems are incorrect labels:
** Badly registered maps
** Arbitrary decisions about what counts as a road (vs laneway etc.)

== Mini-Batch Gradient Descent ==

=== Overview ===

* The Error Surface: Lies in a space with a horizontal axis for each weight and one vertical axis for the error.
** For a linear neuron with a squared error, it is a quadratic bowl where vertical cross-sections are parabolas and horizontal cross-sections are ellipses.

* For multi-layer, non-linear nets the error surface is much more complicated. But locally, a piece of a quadratic bowl is usually a very good approximation.

* Going downhill reduces the error, but the direction of steepest descent does not point at the minimum unless the ellipse is a circle.
** The gradient is big in the direction in which we only want to travel a small distance, small in the direction in which we want to travel a large distance.

* If the learning rate is big, the weights slosh to and fro across the ravine. If the learning rate is too big, this oscillation diverges.

* What we would like to achieve: move quickly in directions with small but consistent gradients, and move slowly in directions with big but inconsistent gradients.

==== Stochastic Gradient Descent ====

* If the dataset is highly redundant, the gradient on the first half is almost identical to the gradient on the second half.
** So, instead of computing the full gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half.
** The extreme version of this approach updates weights after each case. It's called "online" learning.

* Mini-batches are usually better than online.
** Less computation is used updating the weights.
** Computing the gradient for many cases simultaneously uses matrix-matrix multiplies which are very efficient, especially on GPUs.

* Mini-batches need to be balanced for classes (one way to approximate this is to pick points randomly).

==== Two types of learning algorithm ====

* If we use the full gradient computed from all the training cases, there are many clever ways to speed up learning (e.g. non-linear conjugate gradient).

* For large neural networks with very large and highly redundant training sets, it is nearly always best to use mini-batch learning.

==== Basic Mini-Batch Gradient Descent Algorithm ====

* Guess an initial learning rate.
** If the error keeps getting worse or oscillates wildly, reduce the learning rate.
** If the error is falling fairly consistently but slowly, increase the learning rate.

* Write a simple program to automate this way of adjusting the learning rate.

* Toward the end of mini-batch learning, it nearly always helps to turn down the learning rate.
** This removes fluctuations in the final weights caused by the variations between mini-batches.

* Turn down the learning rate when the error stops decreasing.

* '''Note''': Compute the error in a separate validation set.

=== Tricks for Mini-Batch Gradient Descent ===

==== Initializing The Weights ====

* If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights, they will always get exactly the same gradient.

** So they can never learn to be different features.
** We break symmetry by initializing the weights to have small random values.

* If a hidden unit has a big fan-in, small changes on many of its incoming weights can cause the learning to overshoot.
** We generally want smaller incoming weights when the fan-in is big, so initialize the weights to be proportional to $\sqrt{\text{fan-in}}$.

* We can also scale the learning rate the same way.

==== Shifting the Inputs ====

* When using steepest descent, shifting the input values makes a big difference.
** It usually helps to transform each component of the input vector so that it has zero mean over the whole training set.

* The hyperbolic tangent (which is $2\times \text{logistic} - 1$) produces hidden activations that are roughly zero mean.
** In this respect it's better than the logistic.

==== Scaling the Inputs ====

* When using steepest descent, scaling the input values makes a big difference.
** It usually helps to transform each component of the input vector so that is has unit variance over the whole training set.

==== A More Thorough Method: Decorrelate the Input Components ====

* For a linear neuron, we get a big win by decorrelating each component of the input from the other input components.

* There are several different ways to decorrelate inputs. A reasonable method is to use Principal Components Analysis.
** Drop the principal components with the smallest eigenvalues ― '''This achieves some dimensionality reduction'''.
** Divide the remaining principal components by the square roots of their eigenvalues. For a linear neuron, this converts an axis aligned elliptical error surface into a circular one.

* For a circular error surface, the gradient points straight towards the minimum.

==== Common Problems That Occur in Multilayer Networks ====

* If we start with a very big learning rate, the weights of each hidden unit will all become very big and positive or very big and negative.
** The error derivatives for the hidden units will all become tiny and the error will not decrease.
** This is usually a plateau, but people often mistake it for a local minimum.

* In classification networks that use a squared error or a cross-entropy error, the best guessing strategy is to make each output unit always produce an output equal to the proportion of time it should be a 1.
** The network find this strategy quickly and may take a long time to improve on it by making use of the input.
** This is another plateau that looks like a local minimum.

==== Be Careful about Turning Down the Learning Rate ====

* Turning down the learning rate reduces the random fluctuations in the error due to the different gradients on different mini-batches.
** So we get a quick win
** But then we get slower learning.

* Don't turn down the learning rate too soon!

==== Four Ways to Speed up Mini-Batch Learning ====

* Use "momentum"
** Instead of using the gradient to change the ''position'' of the weight "particle, use it to change the ''velocity''.

* Use separate adaptive learning rates for each parameter.
** Slowly adjust the rate using the consistency of the gradient for that parameter.

* '''rmsprop''': Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.
** This is the mini-batch version of just using the sign of the gradient.

* Take a fancy method from the optimization literature that makes use of curvature information (not this lecture)
** Adapt it to work for neural nets.
** Adapt it to work for mini-batches.

=== The Momentum Method ===

==== Intuition ====

Imagine a ball on the error surface. The location of the ball in the horizontal plane represents the weight vector.
# The ball starts off by following the gradient, but once it has velocity, it no longer does steepest descent.
# Its momentum makes it keep going in the previous direction.

* It damps oscillations in directions of high curvature by combining gradients with opposite signs.

* It builds up speed in directions with a gentle but consistent gradient.

==== Equations ====

$$\mathbb{v}(t) = \alpha \mathbb{v}(t-1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

The effect of the gradient is to increment the previous velocity. The velocity also decays by $\alpha$ which is slightly less then 1.

The weight change is equal to the current velocity.

$$\Delta \mathbb{w}(t) = \mathbb{v} (t)$$

$$ = \alpha \mathbb{v} (t - 1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

$$ = \alpha \Delta \mathbb{w} (t - 1) - \epsilon \frac{\partial E}{\partial \mathbb{w}} (t)$$

The weight change can be expressed in terms of the previous weight change and the current gradient.

==== Behavior ====

* If the error surface is a tilted plane, the ball reaches a terminal velocity $\mathbb{v}(\infty)$.
** If the momentum is close to 1, this is much faster than simple gradient descent.

$$\mathbb{v}(\infty) = \frac{1}{1 - \alpha} \Big( - \epsilon \frac{\partial E}{\partial \mathbb{w}} \Big)$$

* At the beginning of learning there may be very large gradients.
** So it pays to use a small momentum (e.g. 0.5).
** Once the large gradients have disappeared and the weights are stuck in a ravine the momentum can be smoothly raised to its final value (e.g. 0.9 or even 0.99)

* This allows us to learn at a rate that would cause divergent oscillations without the momentum.

==== Better Type of Momentum (Nesterov 1983) ====

* The standard momentum method ''first'' computes the gradient at the current location and ''then'' takes a big jump in the direction of the updated accumulated gradient.
* Ilya Sutskever (2012 unpublished) suggested a new form of momentum that often works better.
** Inspired by the Nesterov method for optimizing convex functions.
* ''First'' make a big jump in the direction of the previous accumulated gradient.
* ''Then'' measure the gradient where you end up and make a correction.
** It's better to correct a mistake ''after'' you have made it!

=== Adaptive Learning Rates for Each Connection ===

==== Intuition ====

* In a multilayer net, the appropriate learning rates can vary widely between weights:
** The magnitudes of the gradients are often very different for different layers, especially if the initial weights are small.
** The fan-in of a unit determines the size of the "overshoot" effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error.

* So user a global learning rate (set by hand) multiplied by an appropriate local gain that is determined empirically for each weight.

==== One Way to Determine the Individual Learning Rates ====

$$\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}$$

* Start with a local gain of 1 for every weight. i.e. $g_{ij}(0) = 1$
* Increase the local gain if the gradient for that weight does not change sign.
* Use small additive increases and multiplicative decreases.
** This ensures that big gains decay rapidly when oscillations start.
** If the gradient is totally random, the gain will hover around 1 when we increase by ''plus'' $\delta$ half the time and decrease by ''times'' $1 - \delta$ half the time.

$$\text{if } \Big( \frac{\partial E}{\partial w_{ij}}(t) \cdot \frac{\partial E}{\partial w_{ij}}(t-1) \Big) > 0$$
$$\text{then } g_{ij}(t) = g_{ij}(t - 1) + 0.05$$
$$\text{else } g_{ij}(t) = g_{ij}(t - 1) * 0.95$$

==== Tricks for Making Adaptive Learning Rates Work Better ====

* Limit the gains to lie in some reasonable range, e.g. [0.1, 10] or [0.01, 100].

* Use full batch learning or very big mini-batches.
** This ensures that changes in the sign of the gradient are not mainly due to the sampling error of a mini batch.

* Adaptive learning rates can be combined with momentum.
** Use the agreement in sign between the current gradient for a weight and the velocity for that weight (Jacobs, 1989).

* Adaptive learning rates only deal with axis-aligned effects.
** Momentum does not care about the alignment of the axes.

=== RMSPROP: Divide The Gradient By Running Average of Its Recent Magnitude ===

==== RPROP: Using only the sign of the gradient ====

* The magnitude of the gradient can be very different for different weights and can change during learning.
** This makes it hard to choose a single global learning rate.

* For ''full batch learning'', we can deal with this variation by only using the sign of the gradient.
** The weight updates are all of the same magnitude.
** This escapes from plateaus with tiny gradients quickly.

* RPROP: This combines the idea of only using the sign of the gradient with the idea of adapting the step size separately for each weight.
** Increase the step size for a weight ''multiplicatively'' (e.g. times 1.2) if the signs of its last two gradients agree.
** Otherwise decrease the step size multiplicatively (e.g. times 0.5).
** Limit the step sizes to be less than 50 and more than a millionth (Mike Shuster's advice).

==== Why RPROP Does Not Work with Mini-Batches ====

* The idea behind stochastic gradient descent is that when the learning rate is small, it averages the gradients over successive mini-batches.
** Consider a weight that gets a gradient of +0.1 on nine mini-batches and a gradient of -0.9 on the tenth mini-batch.
** We want this weight to stay roughly where it is.

* RPROP would increment the weight nine times and decrement it once by about the same amount (assuming any adaptation of the step sizes is small on this time-scale), so the weight would grow a lot.
* Is there a way to combine:
** The robustness of RPROP.
** The efficiency of mini-batches.
** The effective averaging of gradients over mini-batches.

==== RMSPROP: Mini-Batch Version of RPROP ====

* RPROP is equivalent to using the gradient but also dividing by the size of the gradient.
** The problem with mini-batch RPROP is that we divide by a different number for each mini-batch. So why not force the number we divide by to be very similar for adjacent mini-batches?

* RMSPROP: Keep a moving average of the squared gradient for each weight.
$$MeanSquare(w,t) = 0.9 \cdot MeanSquare(w, t-1) + 0.1 \Big( \frac{\partial E}{\partial w}(t) \Big) ^2$$

* Dividing the gradient by $\sqrt{MeanSquare(w,t)}$ makes the learning work much better (Tijmen Tieleman, unpublished).

==== Further Developements of RMSPROP ====

* Combining RMSPROP with standard momentum
** Momentum does not help as much as it normally does. Needs more investigation.

* Combining RMSPROP with Nesterov momentum (Sutskever 2012)
** It works best if the RMS of the recent gradients is used to divide the correction rather than the jump in the direction of accumulated corrections.

* Combining RMSPROP with adaptive learning rates for each connection
** Needs more investigation.

* Other methods related to RMSPROP
** Yann LeCun's group has a fancy version in "No more pesky learning rates"

=== Summary of Learning Methods for Neural Networks ===

* For small datasets (e.g. 10,000 cases) or bigger datasets without much redundancy, use a full-batch method.
** Conjugate gradient, LBFGS, ...
** adaptive learning rates, RPROP, ...

* For big, redundant datasets, use mini-batches.
** Try gradient descent with momentum.
** Try RMSPROP (with momentum?)
** Try LeCun's latest recipe.

* Why there is no simple recipe:
*# '''Neural nets differ a lot''':
*#* Very deep nets(especially ones with narrow bottlenecks).
*#* Recurrent nets.
*#* Wide shallow nets.
*# '''Tasks differ a lot''':
*#* Some require very accurate weights, some don't.
*#* Some have many very rare cases (e.g. words).

Neural Networks (Geoffrey Hinton Course)

2017-01-15T04:05:54Z