Difference between revisions of "Machine Learning"

Revision as of 15:52, 22 May 2016

Types of Machine Learning

Supervised Learning
- Regression Problem: Continuous valued output.
- Classification Problem: Discrete valued output.
Unsupervised Learning
- Clustering

Linear Regression

Terminologies

$x^{(i)}_j$ feature vectors
$y^{(i)}$ outcomes
$h_\theta(x)$ the hypothesis
$J(\theta)$ the cost function
$\alpha$ the learning rate

Advanced Optimization Algorithms

There are advanced algorithms (from numerical computing) to minimize the cost function other than the gradient descent. For all of the following algorithms all we need to supply to the algorithm is a code to compute the function $J(\theta)$ (the cost function) and the partial derivatives of the cost function $\frac{\partial}{\partial \theta_i} J(\theta)$.

Conjugate gradient
BFGS
L-BFGS

Advantages

No need to manually pick $\alpha$ (the learning rate in gradient descent)
Often faster than gradient descent

Disadvantages

More complex

Classification Problem

Logistic Regression

$h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function.
$J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics.

Multi-class Classification

One-vs-all (one-vs-rest): For $n$-class classification, train a logistic regression classifier $h_\theta^{(i)} (x)$ for each class $i \in \{1, \ldots, n\}$ to predict the probability that $y = i$. To make a prediction on a new input $x$, pick the class $i$ that maximizes $h_\theta^{(i)} (x)$.

Cocktail Party Problem

Algorithm
- [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x');

@@ Line 41: / Line 41: @@
 === Multi-class Classification ===
-* One-vs-all (one-vs-rest): For $n$-class classification, convert the question into the combination of $n$ classification problems, each represents whether the outcome is $i \in \{1, \ldots, n\}$ or not.
+* One-vs-all (one-vs-rest): For $n$-class classification, train a logistic regression classifier $h_\theta^{(i)} (x)$ for each class $i \in \{1, \ldots, n\}$ to predict the probability that $y = i$. To make a prediction on a new input $x$, pick the class $i$ that maximizes $h_\theta^{(i)} (x)$.
 === Cocktail Party Problem ===
 * Algorithm
 ** [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x');