Machine Learning (Andrew Ng Course)

Types of Machine Learning

Supervised Learning
- Regression Problem: Continuous valued output.
- Classification Problem: Discrete valued output.
Unsupervised Learning
- Clustering

Linear Regression

Terminologies

$x^{(i)}_j$ feature vectors
$y^{(i)}$ outcomes
$h_\theta(x)$ the hypothesis
$J(\theta)$ the cost function
$\alpha$ the learning rate

Advanced Optimization Algorithms

There are advanced algorithms (from numerical computing) to minimize the cost function other than the gradient descent. For all of the following algorithms all we need to supply to the algorithm is a code to compute the function $J(\theta)$ (the cost function) and the partial derivatives of the cost function $\frac{\partial}{\partial \theta_i} J(\theta)$.

Conjugate gradient
BFGS
L-BFGS

Advantages

No need to manually pick $\alpha$ (the learning rate in gradient descent)
Often faster than gradient descent

Disadvantages

More complex

Classification Problem

Logistic Regression

$h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function.
$J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics.

Multi-class Classification

One-vs-all (one-vs-rest): For $n$-class classification, train a logistic regression classifier $h_\theta^{(i)} (x)$ for each class $i = 1, \ldots, n$ to predict the probability that $y = i$. To make a prediction on a new input $x$, pick the class $i$ that maximizes $h_\theta^{(i)} (x)$.

Cocktail Party Problem

Algorithm
- [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x');

Overfitting

Terminologies

"underfitting" or "high bias": not fitting the training set well
"overfitting" or "high variance": too many features, fails to generalize to new examples

Regularization

Modify the cost function to penalize large parameters. $J(\theta) = \frac{1}{2m} \big[ \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j = 1}^n \theta_j^2 \big]$. $\lambda$ is the regularization parameter. Note that the index $j$ starts from $1$ which means we don't penalize the constant term (by convention).

Regularized Linear Regression

Gradient Descent

For a learning rate $\alpha > 0$ and a regularization parameter $\lambda > 0$,

$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$
$\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j > 0$

Normal Equation

Normal equation: $\theta = (X^T X)^{-1} X^T y$
Normal equation with regularization: $\theta = (X^T X + \lambda K)^{-1} X^T y$, where $K$ is a diagonal matrix whose first diagonal entry is $0$ and the rest of the diagonal is $1$.

Note that while $X^T X$ may not be invertible, $X^T X + \lambda K$ is always invertible for $\lambda > 0$.

Regularized Logistic Regression

Cost function $J(\theta) = - \frac{1}{m} \sum_{i = 1}^m \big[ y^{(i)} \cdot \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \cdot \log (1-h_\theta(x^{(i)})) \big] + \frac{\lambda}{2m} \sum_{j = 1}^n \theta_j^2$

Gradient Descent

$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$
$\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j > 0$

Neural Network

TODO.

Debugging Learning Algorithm

When a learning algorithm makes unacceptably large errors in its predictions (on a new data set), what can you do?

Get more training examples
Try smaller sets of features
Try getting additional features
Try adding polynomial features
Try decreasing $\lambda$
Try increasing $\lambda$

Diagnostics

Machine Learning Diagnostic: A test that you can run to gain insight what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.

Split the data into training/test sets, use the test set as a diagnostic.

Model Selection

Suppose we want to choose a degree of polynomial $d$ of a regression model.

Split the data into three sets:

Training set (e.g. 60%)
Cross Validation set (e.g. 20%)
Test set (e.g. 20%)

For each $d$, train the model with the training set, and compute the cross-validation error $J_{cv}(\Theta^{(d)})$ in the cross validation set. Pick $d$ where this cross validation error is the smallest. Finally, report the test set error $J_{test}(\Theta^{(d)})$ as the estimated error rate of the model.

Regularizaton and Bias/Variance

Learning Curves

Machine Learning System Design

Support Vector Machine

SVM Libraries

liblinear, libsvm package

Choosing Kernel

Not all similarity functions make valid kernels. Need to satisfy "Mercer's Theorem" to make sure SVM packages' optimizations run correctly and do not diverge.

Example: polynomial kernel, string kernel, chi-suqare kernel, histogram intersection kernel, ...

Logistic Regression vs SVM

$n$ = number of features ($x \in \mathbb{R}^{n+1}$), $m$ = number of training examples

If $n$ large (relative to $m$): Use logistic regression or SVM without a kernel ("linear kernel").
If $n$ is small, $m$ is intermediate: Use SVM with Gaussian kernel.
If $n$ is small, $m$ is large: Create/add more features, then use logistic regression or SVM without a kernel.
Note: Neural network likely to work well for most of these settings, but may be slower to train.

Unsupervised Learning

Clustering

K-means, random initialization

Dimensionality Reduction

Principal Component Analysis (PCA)

Anomaly Detection

Gradient Descent with Large Datasets

Batch gradient descent: Use all $m$ examples in each iteration

Stochastic gradient descent: Use 1 example in each iteration

Mini-batch gradient descent: Use $b$ examples in each iteration ($b$ is usually a number between 2 and 100)

Machine Learning (Andrew Ng Course)

Contents

Types of Machine Learning

Linear Regression

Terminologies

Advanced Optimization Algorithms

Classification Problem

Logistic Regression

Multi-class Classification

Cocktail Party Problem

Overfitting

Terminologies

Regularization

Regularized Linear Regression

Gradient Descent

Normal Equation

Regularized Logistic Regression

Gradient Descent

Neural Network

Debugging Learning Algorithm

Diagnostics

Model Selection

Regularizaton and Bias/Variance

Learning Curves

Machine Learning System Design

Support Vector Machine

SVM Libraries

Choosing Kernel

Logistic Regression vs SVM

Unsupervised Learning

Clustering

Dimensionality Reduction

Anomaly Detection

Gradient Descent with Large Datasets

Navigation menu

Search