Difference between revisions of "Machine Learning"
Jump to navigation
Jump to search
Line 38: | Line 38: | ||
* $h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function. | * $h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function. | ||
* $J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics. | * $J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics. | ||
+ | |||
+ | === Multi-class Classification === | ||
+ | |||
+ | * One-vs-all (one-vs-rest): For $n$-class classification, convert the question into the combination of $n$ classification problems, each represents whether the outcome is $i \in \{1, \ldots, n\}$ or not. | ||
=== Cocktail Party Problem === | === Cocktail Party Problem === | ||
* Algorithm | * Algorithm | ||
** [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x'); | ** [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x'); |
Revision as of 16:49, 22 May 2016
Types of Machine Learning
- Supervised Learning
- Regression Problem: Continuous valued output.
- Classification Problem: Discrete valued output.
- Unsupervised Learning
- Clustering
Linear Regression
Terminologies
- $x^{(i)}_j$ feature vectors
- $y^{(i)}$ outcomes
- $h_\theta(x)$ the hypothesis
- $J(\theta)$ the cost function
- $\alpha$ the learning rate
Advanced Optimization Algorithms
There are advanced algorithms (from numerical computing) to minimize the cost function other than the gradient descent. For all of the following algorithms all we need to supply to the algorithm is a code to compute the function $J(\theta)$ (the cost function) and the partial derivatives of the cost function $\frac{\partial}{\partial \theta_i} J(\theta)$.
- Conjugate gradient
- BFGS
- L-BFGS
Advantages
- No need to manually pick $\alpha$ (the learning rate in gradient descent)
- Often faster than gradient descent
Disadvantages
- More complex
Classification Problem
Logistic Regression
- $h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function.
- $J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics.
Multi-class Classification
- One-vs-all (one-vs-rest): For $n$-class classification, convert the question into the combination of $n$ classification problems, each represents whether the outcome is $i \in \{1, \ldots, n\}$ or not.
Cocktail Party Problem
- Algorithm
- [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x');