 <?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.tedyun.com/index.php?action=history&amp;feed=atom&amp;title=Machine_Learning_%28Andrew_Ng_Course%29</id>
	<title>Machine Learning (Andrew Ng Course) - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.tedyun.com/index.php?action=history&amp;feed=atom&amp;title=Machine_Learning_%28Andrew_Ng_Course%29"/>
	<link rel="alternate" type="text/html" href="https://wiki.tedyun.com/index.php?title=Machine_Learning_(Andrew_Ng_Course)&amp;action=history"/>
	<updated>2026-04-25T19:43:39Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.35.10</generator>
	<entry>
		<id>https://wiki.tedyun.com/index.php?title=Machine_Learning_(Andrew_Ng_Course)&amp;diff=338&amp;oldid=prev</id>
		<title>Tedyun: Created page with &quot;== Types of Machine Learning ==  * Supervised Learning ** Regression Problem: Continuous valued output. ** Classification Problem: Discrete valued output. * Unsupervised Learn...&quot;</title>
		<link rel="alternate" type="text/html" href="https://wiki.tedyun.com/index.php?title=Machine_Learning_(Andrew_Ng_Course)&amp;diff=338&amp;oldid=prev"/>
		<updated>2016-10-30T21:18:53Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;== Types of Machine Learning ==  * Supervised Learning ** Regression Problem: Continuous valued output. ** Classification Problem: Discrete valued output. * Unsupervised Learn...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;== Types of Machine Learning ==&lt;br /&gt;
&lt;br /&gt;
* Supervised Learning&lt;br /&gt;
** Regression Problem: Continuous valued output.&lt;br /&gt;
** Classification Problem: Discrete valued output.&lt;br /&gt;
* Unsupervised Learning&lt;br /&gt;
** Clustering&lt;br /&gt;
&lt;br /&gt;
== Linear Regression ==&lt;br /&gt;
&lt;br /&gt;
=== Terminologies ===&lt;br /&gt;
&lt;br /&gt;
* $x^{(i)}_j$ feature vectors&lt;br /&gt;
* $y^{(i)}$ outcomes&lt;br /&gt;
* $h_\theta(x)$ the hypothesis&lt;br /&gt;
* $J(\theta)$ the cost function&lt;br /&gt;
* $\alpha$ the learning rate&lt;br /&gt;
&lt;br /&gt;
=== Advanced Optimization Algorithms ===&lt;br /&gt;
&lt;br /&gt;
There are advanced algorithms (from numerical computing) to minimize the cost function other than the &amp;#039;&amp;#039;&amp;#039;gradient descent&amp;#039;&amp;#039;&amp;#039;. For all of the following algorithms all we need to supply to the algorithm is a code to compute the function $J(\theta)$ (the cost function) and the partial derivatives of the cost function $\frac{\partial}{\partial \theta_i} J(\theta)$.&lt;br /&gt;
&lt;br /&gt;
# Conjugate gradient&lt;br /&gt;
# BFGS&lt;br /&gt;
# L-BFGS&lt;br /&gt;
&lt;br /&gt;
Advantages&lt;br /&gt;
* No need to manually pick $\alpha$ (the learning rate in gradient descent)&lt;br /&gt;
* Often faster than gradient descent&lt;br /&gt;
&lt;br /&gt;
Disadvantages&lt;br /&gt;
* More complex&lt;br /&gt;
&lt;br /&gt;
== Classification Problem ==&lt;br /&gt;
&lt;br /&gt;
=== Logistic Regression ===&lt;br /&gt;
&lt;br /&gt;
* $h_\theta(x) = 1 / (1 + e^{-\theta^T x})$. Note $f(z) = 1 / (1 + e^{-z})$ is called the sigmoid function / logistic function.&lt;br /&gt;
* $J(\theta) = - y \cdot \log h_\theta(x) - (1 - y) \cdot \log (1-h_\theta(x))$. This comes from Maximum Likelihood Estimation in Statistics.&lt;br /&gt;
&lt;br /&gt;
=== Multi-class Classification ===&lt;br /&gt;
&lt;br /&gt;
* One-vs-all (one-vs-rest): For $n$-class classification, train a logistic regression classifier $h_\theta^{(i)} (x)$ for each class $i = 1, \ldots, n$ to predict the probability that $y = i$. To make a prediction on a new input $x$, pick the class $i$ that maximizes $h_\theta^{(i)} (x)$.&lt;br /&gt;
&lt;br /&gt;
=== Cocktail Party Problem ===&lt;br /&gt;
* Algorithm&lt;br /&gt;
** [W, s, v] = svd((repmat(sum(x.*x, 1), size(x, 1), 1).*x)*x&amp;#039;);&lt;br /&gt;
&lt;br /&gt;
== Overfitting ==&lt;br /&gt;
&lt;br /&gt;
=== Terminologies ===&lt;br /&gt;
* &amp;quot;underfitting&amp;quot; or &amp;quot;high bias&amp;quot;: not fitting the training set well&lt;br /&gt;
* &amp;quot;overfitting&amp;quot; or &amp;quot;high variance&amp;quot;: too many features, fails to generalize to new examples&lt;br /&gt;
&lt;br /&gt;
=== Regularization ===&lt;br /&gt;
* Modify the cost function to penalize large parameters. $J(\theta) = \frac{1}{2m} \big[ \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j = 1}^n \theta_j^2 \big]$. $\lambda$ is the regularization parameter. Note that the index $j$ starts from $1$ which means we don&amp;#039;t penalize the constant term (by convention).&lt;br /&gt;
&lt;br /&gt;
=== Regularized Linear Regression ===&lt;br /&gt;
&lt;br /&gt;
==== Gradient Descent ====&lt;br /&gt;
For a learning rate $\alpha &amp;gt; 0$ and a regularization parameter $\lambda &amp;gt; 0$,&lt;br /&gt;
&lt;br /&gt;
* $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$&lt;br /&gt;
* $\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j &amp;gt; 0$&lt;br /&gt;
&lt;br /&gt;
==== Normal Equation ====&lt;br /&gt;
&lt;br /&gt;
* Normal equation: $\theta = (X^T X)^{-1} X^T y$&lt;br /&gt;
* Normal equation &amp;#039;&amp;#039;&amp;#039;with regularization&amp;#039;&amp;#039;&amp;#039;: $\theta = (X^T X + \lambda K)^{-1} X^T y$, where $K$ is a diagonal matrix whose first diagonal entry is $0$ and the rest of the diagonal is $1$.&lt;br /&gt;
&lt;br /&gt;
Note that while $X^T X$ may not be invertible, $X^T X + \lambda K$ is &amp;#039;&amp;#039;always&amp;#039;&amp;#039; invertible for $\lambda &amp;gt; 0$.&lt;br /&gt;
&lt;br /&gt;
=== Regularized Logistic Regression ===&lt;br /&gt;
&lt;br /&gt;
Cost function $J(\theta) = - \frac{1}{m} \sum_{i = 1}^m \big[ y^{(i)} \cdot \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \cdot \log (1-h_\theta(x^{(i)})) \big] + \frac{\lambda}{2m} \sum_{j = 1}^n \theta_j^2$&lt;br /&gt;
&lt;br /&gt;
==== Gradient Descent ====&lt;br /&gt;
&lt;br /&gt;
* $\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$&lt;br /&gt;
* $\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ for $j &amp;gt; 0$&lt;br /&gt;
&lt;br /&gt;
== Neural Network ==&lt;br /&gt;
&lt;br /&gt;
TODO.&lt;br /&gt;
&lt;br /&gt;
== Debugging Learning Algorithm ==&lt;br /&gt;
&lt;br /&gt;
When a learning algorithm makes unacceptably large errors in its predictions (on a new data set), what can you do?&lt;br /&gt;
&lt;br /&gt;
* Get more training examples&lt;br /&gt;
* Try smaller sets of features&lt;br /&gt;
* Try getting additional features&lt;br /&gt;
* Try adding polynomial features&lt;br /&gt;
* Try decreasing $\lambda$&lt;br /&gt;
* Try increasing $\lambda$&lt;br /&gt;
&lt;br /&gt;
=== Diagnostics ===&lt;br /&gt;
&lt;br /&gt;
Machine Learning Diagnostic: A test that you can run to gain insight what is/isn&amp;#039;t working with a learning algorithm, and gain guidance as to how best to improve its performance.&lt;br /&gt;
&lt;br /&gt;
Split the data into training/test sets, use the test set as a diagnostic.&lt;br /&gt;
&lt;br /&gt;
=== Model Selection ===&lt;br /&gt;
&lt;br /&gt;
Suppose we want to choose a degree of polynomial $d$ of a regression model.&lt;br /&gt;
&lt;br /&gt;
Split the data into three sets:&lt;br /&gt;
&lt;br /&gt;
* Training set (e.g. 60%)&lt;br /&gt;
* Cross Validation set  (e.g. 20%)&lt;br /&gt;
* Test set  (e.g. 20%)&lt;br /&gt;
&lt;br /&gt;
For each $d$, train the model with the training set, and compute the cross-validation error $J_{cv}(\Theta^{(d)})$ in the cross validation set. Pick $d$ where this cross validation error is the smallest. Finally, report the test set error $J_{test}(\Theta^{(d)})$ as the estimated error rate of the model.&lt;br /&gt;
&lt;br /&gt;
=== Regularizaton and Bias/Variance ===&lt;br /&gt;
&lt;br /&gt;
=== Learning Curves ===&lt;br /&gt;
&lt;br /&gt;
== Machine Learning System Design ==&lt;br /&gt;
&lt;br /&gt;
== Support Vector Machine ==&lt;br /&gt;
&lt;br /&gt;
=== SVM Libraries ===&lt;br /&gt;
&lt;br /&gt;
liblinear, libsvm package&lt;br /&gt;
&lt;br /&gt;
=== Choosing Kernel ===&lt;br /&gt;
&lt;br /&gt;
Not all similarity functions make valid kernels. Need to satisfy &amp;quot;Mercer&amp;#039;s Theorem&amp;quot; to make sure SVM packages&amp;#039; optimizations run correctly and do not diverge.&lt;br /&gt;
&lt;br /&gt;
Example: polynomial kernel, string kernel, chi-suqare kernel, histogram intersection kernel, ...&lt;br /&gt;
&lt;br /&gt;
=== Logistic Regression vs SVM ===&lt;br /&gt;
&lt;br /&gt;
$n$ = number of features ($x \in \mathbb{R}^{n+1}$), $m$ = number of training examples&lt;br /&gt;
&lt;br /&gt;
* If $n$ large (relative to $m$): Use logistic regression or SVM without a kernel (&amp;quot;linear kernel&amp;quot;).&lt;br /&gt;
* If $n$ is small, $m$ is intermediate: Use SVM with Gaussian kernel.&lt;br /&gt;
* If $n$ is small, $m$ is large: Create/add more features, then use logistic regression or SVM without a kernel.&lt;br /&gt;
* Note: Neural network likely to work well for most of these settings, but may be slower to train.&lt;br /&gt;
&lt;br /&gt;
== Unsupervised Learning ==&lt;br /&gt;
&lt;br /&gt;
=== Clustering ===&lt;br /&gt;
&lt;br /&gt;
K-means, random initialization&lt;br /&gt;
&lt;br /&gt;
=== Dimensionality Reduction ===&lt;br /&gt;
&lt;br /&gt;
Principal Component Analysis (PCA)&lt;br /&gt;
&lt;br /&gt;
=== Anomaly Detection ===&lt;br /&gt;
&lt;br /&gt;
== Gradient Descent with Large Datasets ==&lt;br /&gt;
&lt;br /&gt;
* Batch gradient descent: Use all $m$ examples in each iteration&lt;br /&gt;
&lt;br /&gt;
* Stochastic gradient descent: Use 1 example in each iteration&lt;br /&gt;
&lt;br /&gt;
* Mini-batch gradient descent: Use $b$ examples in each iteration ($b$ is usually a number between 2 and 100)&lt;/div&gt;</summary>
		<author><name>Tedyun</name></author>
	</entry>
</feed>