Difference between revisions of "Machine Learning"

From TedYunWiki
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Types of Machine Learning ==
+
* [[Machine Learning (Andrew Ng Course)]]
 
+
* [[Neural Networks (Geoffrey Hinton Course)]]
* Supervised Learning
 
** Regression Problem: Continuous valued output.
 
** Classification Problem: Discrete valued output.
 
* Unsupervised Learning
 
** Clustering
 
 
 
== Linear Regression ==
 
 
 
=== Terminologies ===
 
 
 
* $x^{(i)}_j$ feature vectors
 
* $y^{(i)}$ outcomes
 
* $h_\theta(x)$ the hypothesis
 
* $J(\theta)$ the cost function
 
* $\alpha$ the learning rate
 
 
 
=== Advanced Optimization Algorithms ===
 
 
 
There are advanced algorithms (from numerical computing) to minimize the cost function other than the '''gradient descent'''. For all of the following algorithms all we need to supply to the algorithm is a code to compute the function $J(\theta)$ (the cost function) and the partial derivatives of the cost function $\frac{\partial}{\partial \theta_i} J(\theta)$.
 
 
 
# Conjugate gradient
 
# BFGS
 
# L-BFGS
 
 
 
Advantages
 
* No need to manually pick $\alpha$ (the learning rate in gradient descent)
 
* Often faster than gradient descent
 
 
 
Disadvantages
 
* More complex
 
 
 
== Classification Problem ==
 
 
 
=== Logistic Regression ===
 
 
 
* $h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function.
 
* $J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics.
 
 
 
=== Multi-class Classification ===
 
 
 
* One-vs-all (one-vs-rest): For $n$-class classification, train a logistic regression classifier $h_\theta^{(i)} (x)$ for each class $i = 1, \ldots, n$ to predict the probability that $y = i$. To make a prediction on a new input $x$, pick the class $i$ that maximizes $h_\theta^{(i)} (x)$.
 
 
 
=== Cocktail Party Problem ===
 
* Algorithm
 
** [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x');
 
 
 
== Overfitting ==
 
 
 
=== Terminologies ===
 
* "underfitting" or "high bias": not fitting the training set well
 
* "overfitting" or "high variance": too many features, fails to generalize to new examples
 
 
 
=== Regularization ===
 
* Modify the cost function to penalize large parameters. $J(\theta) = \frac{1}{2m} \big[ \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j = 1}^n \theta_j^2 \big]$. $\lambda$ is the regularization parameter. Note that the index $j$ starts from $1$ which means we don't penalize the constant term (by convention).
 
 
 
=== Regularized Linear Regression ===
 
 
 
==== Gradient Descent ====
 
For a learning rate $\alpha > 0$ and a regularization parameter $\lambda > 0$,
 
 
 
* $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$
 
* $\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j > 0$
 
 
 
==== Normal Equation ====
 
 
 
* Normal equation: $\theta = (X^T X)^{-1} X^T y$
 
* Normal equation '''with regularization''': $\theta = (X^T X + \lambda K)^{-1} X^T y$, where $K$ is a diagonal matrix whose first diagonal entry is $0$ and the rest of the diagonal is $1$.
 
 
 
Note that while $X^T X$ may not be invertible, $X^T X + \lambda K$ is ''always'' invertible for $\lambda > 0$.
 
 
 
=== Regularized Logistic Regression ===
 
 
 
Cost function $J(\theta) = - \frac{1}{m} \sum_{i = 1}^m \big[ y^{(i)} \cdot \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \cdot \log (1-h_\theta(x^{(i)})) \big] + \frac{\lambda}{2m} \sum_{j = 1}^n \theta_j^2$
 
 
 
==== Gradient Descent ====
 
 
 
* $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$
 
* $\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j > 0$
 

Latest revision as of 17:20, 30 October 2016