Machine Learning - Revision history

Tedyun: Replaced content with "* Machine Learning (Andrew Ng Course) * Neural Networks (Geoffrey Hinton Course)"

2016-10-30T21:20:28Z

Replaced content with "* Machine Learning (Andrew Ng Course) * Neural Networks (Geoffrey Hinton Course)"

Tedyun: /* Gradient Descent with Large Datasets */

2016-07-16T23:29:53Z

Gradient Descent with Large Datasets

Tedyun at 23:25, 16 July 2016

2016-07-16T23:25:45Z

Tedyun at 22:41, 3 July 2016

2016-07-03T22:41:51Z

Tedyun: /* Logistic Regression vs SVM */

2016-06-19T21:41:11Z

Logistic Regression vs SVM

Tedyun: /* Logistic Regression vs SVM */

2016-06-19T21:40:50Z

Logistic Regression vs SVM

Tedyun: /* Logistic Regression vs SVM */

2016-06-19T21:39:05Z

Logistic Regression vs SVM

Tedyun: /* Support Vector Machine */

2016-06-19T21:38:04Z

Support Vector Machine

Tedyun: /* Support Vector Machine */

2016-06-19T21:31:18Z

Support Vector Machine

Tedyun: /* Support Vector Machine */

2016-06-19T21:24:18Z

Support Vector Machine

← Older revision		Revision as of 21:20, 30 October 2016
Line 1:		Line 1:
−	~~== Types of~~ Machine Learning ==	+	* [[Machine Learning (Andrew Ng Course)]]
−		+	* [[Neural Networks (Geoffrey Hinton Course)]]
−	* Supervised Learning
−	** Regression Problem: Continuous valued output.
−	** Classification Problem: Discrete valued output.
−	* Unsupervised Learning
−	** Clustering
−
−	~~== Linear Regression ==~~
−
−	~~=== Terminologies ===~~
−
−	* $x^{(i)~~}_j$ feature vectors~~
−	* $y^{(i)}$ outcomes
−	* $h_\theta(x)$ the hypothesis
−	* $J(\theta)$ the cost function
−	* $\alpha$ the learning rate
−
−	~~=== Advanced Optimization Algorithms ===~~
−
−	There are advanced algorithms (from numerical computing) to minimize the cost function other than the '''gradient descent'''. For all of the following algorithms all we need to supply to the algorithm is a code to compute the function $J(\theta)$ (the cost function) and the partial derivatives of the cost function $\frac{\partial}{\partial \theta_i} J(\theta)$.
−
−	~~# Conjugate gradient~~
−	~~# BFGS~~
−	~~# L-BFGS~~
−
−	~~Advantages~~
−	* No need to manually pick $\alpha$ (the learning rate in gradient descent)
−	* Often faster than gradient descent
−
−	~~Disadvantages~~
−	* More complex
−
−	~~== Classification Problem ==~~
−
−	~~=== Logistic Regression ===~~
−
−	* $h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function.
−	* $J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics.
−
−	~~=== Multi-class Classification ===~~
−
−	* One-vs-all (one-vs-rest): For $n$-class classification, train a logistic regression classifier $h_\theta^{(i)} (x)$ for each class $i = 1, \ldots, n$ to predict the probability that $y = i$. To make a prediction on a new input $x$, pick the class $i$ that maximizes $h_\theta^{(i)} (x)$.
−
−	~~=== Cocktail Party Problem ===~~
−	* Algorithm
−	** [~~W, s, v] = svd((repmat(sum(x.x, 1), size(x, 1), 1).x)*x');~~
−
−	~~== Overfitting ==~~
−
−	~~=== Terminologies ===~~
−	* "underfitting" or "high bias": not fitting the training set well
−	* "overfitting" or "high variance": too many features, fails to generalize to new examples
−
−	~~=== Regularization ===~~
−	* Modify the cost function to penalize large parameters. $J(\theta) = \frac{1}{2m} \big[ ~~\sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{~~(i)~~})^2 + \lambda \sum_{j = 1}^n \theta_j^2 \big~~]~~$. $\lambda$ is the regularization parameter. Note that the index $j$ starts from $1$ which means we don't penalize the constant term (by convention).~~
−
−	~~=== Regularized Linear Regression ===~~
−
−	~~==== Gradient Descent ====~~
−	~~For a learning rate $\alpha > 0$ and a regularization parameter $\lambda > 0$,~~
−
−	* $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$
−	* $\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j > 0$
−
−	~~==== Normal Equation ====~~
−
−	* Normal equation: $\theta = (X^T X)^{-1} X^T y$
−	* Normal equation '''with regularization''': $\theta = (X^T X + \lambda K)^{-1} X^T y$, where $K$ is a diagonal matrix whose first diagonal entry is $0$ and the rest of the diagonal is $1$.
−
−	~~Note that while $X^T X$ may not be invertible, $X^T X + \lambda K$ is ''always'' invertible for $\lambda > 0$.~~
−
−	~~=== Regularized Logistic Regression ===~~
−
−	~~Cost function $J(\theta) = - \frac{1}{m} \sum_{i = 1}^m \big[ y^{(i)} \cdot \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \cdot \log (1-h_\theta(x^{(i)})) \big~~] ~~+ \frac{\lambda}{2m} \sum_{j = 1}^n \theta_j^2$~~
−
−	~~==== Gradient Descent ====~~
−
−	* $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$
−	* $\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j > 0$
−
−	~~== Neural Network ==~~
−
−	~~TODO.~~
−
−	~~== Debugging Learning Algorithm ==~~
−
−	~~When a learning algorithm makes unacceptably large errors in its predictions (on a new data set), what can you do?~~
−
−	* Get more training examples
−	* Try smaller sets of features
−	* Try getting additional features
−	* Try adding polynomial features
−	* Try decreasing $\lambda$
−	* Try increasing $\lambda$
−
−	~~=== Diagnostics ===~~
−
−	~~Machine Learning Diagnostic: A test that you can run to gain insight what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.~~
−
−	~~Split the data into training/test sets, use the test set as a diagnostic.~~
−
−	~~=== Model Selection ===~~
−
−	~~Suppose we want to choose a degree of polynomial $d$ of a regression model.~~
−
−	~~Split the data into three sets:~~
−
−	* Training set (e.g. 60%)
−	* Cross Validation set (e.g. 20%)
−	* Test set (e.g. 20%)
−
−	For each $d$, train the model with the training set, and compute the cross-validation error $J_{cv}(\Theta^{(d)})$ in the cross validation set. Pick $d$ where this cross validation error is the smallest. Finally, report the test set error $J_{test}(\Theta^{(d)})$ as the estimated error rate of the model.
−
−	~~=== Regularizaton and Bias/Variance ===~~
−
−	~~=== Learning Curves ===~~
−
−	~~== Machine Learning System Design ==~~
−
−	~~== Support Vector Machine ==~~
−
−	~~=== SVM Libraries ===~~
−
−	~~liblinear, libsvm package~~
−
−	~~=== Choosing Kernel ===~~
−
−	~~Not all similarity functions make valid kernels. Need to satisfy "Mercer's Theorem" to make sure SVM packages' optimizations run correctly and do not diverge.~~
−
−	~~Example: polynomial kernel, string kernel, chi-suqare kernel, histogram intersection kernel, ...~~
−
−	~~=== Logistic Regression vs SVM ===~~
−
−	~~$n$ = number of features ($x \in \mathbb{R}^{n+1}$), $m$ = number of training examples~~
−
−	* If $n$ large (relative to $m$): Use logistic regression or SVM without a kernel ("linear kernel").
−	* If $n$ is small, $m$ is intermediate: Use SVM with Gaussian kernel.
−	* If $n$ is small, $m$ is large: Create/add more features, then use logistic regression or SVM without a kernel.
−	* Note: Neural network likely to work well for most of these settings, but may be slower to train.
−
−	~~== Unsupervised Learning ==~~
−
−	~~=== Clustering ===~~
−
−	~~K-means, random initialization~~
−
−	~~=== Dimensionality Reduction ===~~
−
−	~~Principal Component Analysis (PCA)~~
−
−	~~=== Anomaly Detection ===~~
−
−	~~== Gradient Descent with Large Datasets ==~~
−
−	* Batch gradient descent: Use all $m$ examples in each iteration
−
−	* Stochastic gradient descent: Use 1 example in each iteration
−
−	* Mini-batch gradient descent: Use $b$ examples in each iteration ($b$ is usually a number between 2 and 100)

@@ Line 154: / Line 154: @@
 == Gradient Descent with Large Datasets ==
-- Batch gradient descent: Use all $m$ examples in each iteration
+* Batch gradient descent: Use all $m$ examples in each iteration
-- Stochastic gradient descent: Use 1 example in each iteration
+* Stochastic gradient descent: Use 1 example in each iteration
-- Mini-batch gradient descent: Use $b$ examples in each iteration ($b$ is usually a number between 2 and 100)
+* Mini-batch gradient descent: Use $b$ examples in each iteration ($b$ is usually a number between 2 and 100)

← Older revision		Revision as of 23:25, 16 July 2016
Line 151:		Line 151:

	=== Anomaly Detection ===		=== Anomaly Detection ===
		+
		+	== Gradient Descent with Large Datasets ==
		+
		+	- Batch gradient descent: Use all $m$ examples in each iteration
		+
		+	- Stochastic gradient descent: Use 1 example in each iteration
		+
		+	- Mini-batch gradient descent: Use $b$ examples in each iteration ($b$ is usually a number between 2 and 100)

← Older revision		Revision as of 22:41, 3 July 2016
Line 139:		Line 139:
	* If $n$ is small, $m$ is large: Create/add more features, then use logistic regression or SVM without a kernel.		* If $n$ is small, $m$ is large: Create/add more features, then use logistic regression or SVM without a kernel.
	* Note: Neural network likely to work well for most of these settings, but may be slower to train.		* Note: Neural network likely to work well for most of these settings, but may be slower to train.
		+
		+	== Unsupervised Learning ==
		+
		+	=== Clustering ===
		+
		+	K-means, random initialization
		+
		+	=== Dimensionality Reduction ===
		+
		+	Principal Component Analysis (PCA)
		+
		+	=== Anomaly Detection ===

@@ Line 133: / Line 133: @@
 === Logistic Regression vs SVM ===
-$n$ = number of features ($x \in R^{n+1}$), $m$ = number of training examples
+$n$ = number of features ($x \in \mathbb{R}^{n+1}$), $m$ = number of training examples
 * If $n$ large (relative to $m$): Use logistic regression or SVM without a kernel ("linear kernel").

← Older revision		Revision as of 21:40, 19 June 2016
Line 138:		Line 138:
	* If $n$ is small, $m$ is intermediate: Use SVM with Gaussian kernel.		* If $n$ is small, $m$ is intermediate: Use SVM with Gaussian kernel.
	* If $n$ is small, $m$ is large: Create/add more features, then use logistic regression or SVM without a kernel.		* If $n$ is small, $m$ is large: Create/add more features, then use logistic regression or SVM without a kernel.
		+	* Note: Neural network likely to work well for most of these settings, but may be slower to train.

← Older revision		Revision as of 21:38, 19 June 2016
Line 130:		Line 130:

	Example: polynomial kernel, string kernel, chi-suqare kernel, histogram intersection kernel, ...		Example: polynomial kernel, string kernel, chi-suqare kernel, histogram intersection kernel, ...
		+
		+	=== Logistic Regression vs SVM ===
		+
		+	$n$ = number of features, $m$ = number of training examples
		+
		+	* If $n$ large (relative to $m$): Use logistic regression or SVN without a kernel ("linear kernel").
		+	* If $n$ is small, $m$ is intermediate: Use SVM with Gaussian kernel.
		+	* If $n$ is small, $m$ is large: Create/add more features, then use logistic regression or SVM without a kernel.

← Older revision		Revision as of 21:31, 19 June 2016
Line 128:		Line 128:

	Not all similarity functions make valid kernels. Need to satisfy "Mercer's Theorem" to make sure SVM packages' optimizations run correctly and do not diverge.		Not all similarity functions make valid kernels. Need to satisfy "Mercer's Theorem" to make sure SVM packages' optimizations run correctly and do not diverge.
		+
		+	Example: polynomial kernel, string kernel, chi-suqare kernel, histogram intersection kernel, ...

← Older revision		Revision as of 21:24, 19 June 2016
Line 124:		Line 124:

	liblinear, libsvm package		liblinear, libsvm package
		+
		+	=== Choosing Kernel ===
		+
		+	Not all similarity functions make valid kernels. Need to satisfy "Mercer's Theorem" to make sure SVM packages' optimizations run correctly and do not diverge.