Difference between revisions of "Machine Learning"

From TedYunWiki
Jump to navigation Jump to search
Line 98: Line 98:
  
 
Machine Learning Diagnostic: A test that you can run to gain insight what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.
 
Machine Learning Diagnostic: A test that you can run to gain insight what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.
 +
 +
Split the data into training/test sets, use the test set as a diagnostic.
 +
 +
=== Model Selection ===
 +
 +
Want to choose a degree of polynomial $d$
 +
 +
* Training set
 +
* Validation set
 +
* Test set

Revision as of 19:30, 5 June 2016

Types of Machine Learning

  • Supervised Learning
    • Regression Problem: Continuous valued output.
    • Classification Problem: Discrete valued output.
  • Unsupervised Learning
    • Clustering

Linear Regression

Terminologies

  • $x^{(i)}_j$ feature vectors
  • $y^{(i)}$ outcomes
  • $h_\theta(x)$ the hypothesis
  • $J(\theta)$ the cost function
  • $\alpha$ the learning rate

Advanced Optimization Algorithms

There are advanced algorithms (from numerical computing) to minimize the cost function other than the gradient descent. For all of the following algorithms all we need to supply to the algorithm is a code to compute the function $J(\theta)$ (the cost function) and the partial derivatives of the cost function $\frac{\partial}{\partial \theta_i} J(\theta)$.

  1. Conjugate gradient
  2. BFGS
  3. L-BFGS

Advantages

  • No need to manually pick $\alpha$ (the learning rate in gradient descent)
  • Often faster than gradient descent

Disadvantages

  • More complex

Classification Problem

Logistic Regression

  • $h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function.
  • $J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics.

Multi-class Classification

  • One-vs-all (one-vs-rest): For $n$-class classification, train a logistic regression classifier $h_\theta^{(i)} (x)$ for each class $i = 1, \ldots, n$ to predict the probability that $y = i$. To make a prediction on a new input $x$, pick the class $i$ that maximizes $h_\theta^{(i)} (x)$.

Cocktail Party Problem

  • Algorithm
    • [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x');

Overfitting

Terminologies

  • "underfitting" or "high bias": not fitting the training set well
  • "overfitting" or "high variance": too many features, fails to generalize to new examples

Regularization

  • Modify the cost function to penalize large parameters. $J(\theta) = \frac{1}{2m} \big[ \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j = 1}^n \theta_j^2 \big]$. $\lambda$ is the regularization parameter. Note that the index $j$ starts from $1$ which means we don't penalize the constant term (by convention).

Regularized Linear Regression

Gradient Descent

For a learning rate $\alpha > 0$ and a regularization parameter $\lambda > 0$,

  • $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$
  • $\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j > 0$

Normal Equation

  • Normal equation: $\theta = (X^T X)^{-1} X^T y$
  • Normal equation with regularization: $\theta = (X^T X + \lambda K)^{-1} X^T y$, where $K$ is a diagonal matrix whose first diagonal entry is $0$ and the rest of the diagonal is $1$.

Note that while $X^T X$ may not be invertible, $X^T X + \lambda K$ is always invertible for $\lambda > 0$.

Regularized Logistic Regression

Cost function $J(\theta) = - \frac{1}{m} \sum_{i = 1}^m \big[ y^{(i)} \cdot \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \cdot \log (1-h_\theta(x^{(i)})) \big] + \frac{\lambda}{2m} \sum_{j = 1}^n \theta_j^2$

Gradient Descent

  • $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$
  • $\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j > 0$

Neural Network

TODO.

Debugging Learning Algorithm

When a learning algorithm makes unacceptably large errors in its predictions (on a new data set), what can you do?

  • Get more training examples
  • Try smaller sets of features
  • Try getting additional features
  • Try adding polynomial features
  • Try decreasing $\lambda$
  • Try increasing $\lambda$

Diagnostics

Machine Learning Diagnostic: A test that you can run to gain insight what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.

Split the data into training/test sets, use the test set as a diagnostic.

Model Selection

Want to choose a degree of polynomial $d$

  • Training set
  • Validation set
  • Test set