Taxonomy

There are various kinds machine learning models. A good taxonomy helps us understand this field. The criteria to create a taxonomy can be

  • supervised / unsupervised

  • parametric / non-parametric

  • discriminative / generative

Supervised vs Unsupervised

The easiest way to create a taxonomy is to look at the input and the objective of the learning tasks.

Supervised Learning

In supervised learning, the input data set is \(\mathcal{D} = \left\{ \left(\boldsymbol{x} _i, y_i \right) \right\}_{i=1}^n\), where \(\boldsymbol{x_i}\) is called features, and attributes or covariates \(y_i\) is called labels. The objective is to predict \(y_i\) by learning the relation between \(\boldsymbol{x}_i\) and \(y_i\), or we say it aims to build a mapping from feature space \(\mathcal{X}\) to label space \(\mathcal{Y}\), given training set \(\mathcal{D} = \left\{ (\boldsymbol{x}_i , y_i) \right\}\) with \(\boldsymbol{x}_i \in \mathcal{X} , y_i \in \mathcal{Y}\).

When \(y_i\) is a real value variable, i.e., \(\mathcal{Y} = \boldsymbol{\mathbb{R}}\) the learning tasks are called regression. When \(y_i\) is categorical, i.e., \(\mathcal{Y} = \left\{ 1, \ldots, C \right\}\), the learning tasks are called classification.

A model uses a function \(f(\boldsymbol{x} ; \boldsymbol{\theta})\) parameterized by \(\boldsymbol{\theta}\) to predict \(\hat{y}\). In regression, this function returns the predicted value of \(y\); in classification, it returns the per-class score \(f_{y=c}(\boldsymbol{x}; \boldsymbol{\theta})\) for each class \(c\), and make prediction by choosing the class with the highest score.

Learning

There is a loss function \(\ell\) penalizing for predicting \(\hat{y}\) when eh ground truth is \(y\). Learning is basically minimizing the empirical loss

\[ \boldsymbol{\theta}^{*}=\underset{\boldsymbol{\theta}}{\operatorname{argmin}} \frac{1}{N} \sum_{i} \ell\left(f\left(\boldsymbol{x}_i ; \boldsymbol{\theta}\right), y_{i}\right) \]

Note

  • The coefficient \(\frac{1}{N}\) is often omitted since it is a constant term and does not affect the optimization process.

  • The loss function \(\ell\) is usually related to the specific task, but often is not exactly the same as the task loss (e.g., error rate, F1 score, etc.).

Assumptions
  • observations \((\boldsymbol{x}_i ,y_i)\) are drawn i.i.d. from a joint probability distribution \(p(\boldsymbol{x} ,y)\)

  • the joint distributions are the same in both training and test data.

The ultimate goal is to minimize the expected loss, aka. risk.

\[ \operatorname{E}_{(\boldsymbol{x} , y) \sim p(\boldsymbol{x} ,y)}\left[\ell \left(f \left( \boldsymbol{x}_i ; \boldsymbol{\theta}\right), y_{i} \right)\right] \]

When the training set \(\mathcal{D}\) is a good representatie of the underlying distribution \(p(\boldsymbol{x} , y)\), the empirical loss serves as a proxy for the risk.

Discriminative vs Generative

Models for classification can be divided into two types.

  • discriminative

    • probabilistic: directly estimate the class probability \(p(y \mid \boldsymbol{x} ; \boldsymbol{\theta} )\)

    • non-probabilistic: directly optimize loss of \(f(\boldsymbol{x} ; \boldsymbol{\theta})\)

  • generative

    • estimate joint density \(p(y, \boldsymbol{x} ; \boldsymbol{\theta} )\) by estimating per-class density \(p(\boldsymbol{x} \mid y)\) and class prior \(p(y)\), then estimate the class probability as

    \[ p(y \mid \boldsymbol{x} ; \boldsymbol{\theta}) = \frac{p( y, \boldsymbol{x} ; \boldsymbol{\theta})}{p(\boldsymbol{x})} \propto p(y, \boldsymbol{x} ; \boldsymbol{\theta} ) \]

When to use which? If \(\operatorname{dim}(\boldsymbol{x})\) is large but only a few \(X_i\)’s in \(\boldsymbol{x}\) are helpful to discriminate \(y\) (sparse signal), then modeling \(p(\boldsymbol{x} \mid y)\) or \(p(\boldsymbol{x})\) may be computationally intractable and unnecessary. It’s better to use discriminative models to directly estimate \(p(y\mid \boldsymbol{x}, \boldsymbol{\theta})\)

On the other hand, if \(p(y\mid \boldsymbol{x} ; \boldsymbol{\theta} )\) is hard to estimate while \(p(\boldsymbol{x} \mid y)\) is easy to estimate (e.g. Gaussian), then we prefer generative models.

Note

Generative classification models can be seen as unsupervised learning (density estimation) in the service of a supervised objective.

Unsupervised Learning

: In unsupervised learning, the input data set is \(\mathcal{D} = \left\{ \boldsymbol{x} _i \right\}_{i=1}^n\), i.e., no labels. The objective is to find interesting patterns in the data, aka knowledge discovery. There could be many tasks depending on what the patterns we are interested in. The two most common tasks are clustering and dimensionality reduction.

Let the objective function to be \(J(\mathcal{D}; \boldsymbol{\theta})\). We can also add a regularizer term such that the optimization becomes

\[ \min \left\{ J(\mathcal{D} ; \boldsymbol{\theta} ) + \lambda R(\boldsymbol{\theta} ) \right\} \]

Note

The function \(J\) may or may not decompose over observations \(\boldsymbol{x}_i \). If yes then we can write \(\sum _i J(\boldsymbol{x}_i ; \boldsymbol{\theta} )\).

What is \(J\)? It depends on the pattern we are interested to find.

The four tasks mentioned above are summarized below.

Task

Input

Objective

Application

Regression

\(\mathcal{D} = \left\{ \left(\boldsymbol{x} _i, y_i \right) \right\}_{i=1}^N\) where \(y_i\in \mathbb{R}\)

given a new \(\boldsymbol{x}_i\), predict \(\hat y_i\)

predict home price

Classification

\(\mathcal{D} = \left\{ \left(\boldsymbol{x} _i, y_i \right) \right\}_{i=1}^N\), where \(y_i\) is categorical

given a new \(\boldsymbol{x_i}\), predict \(\hat{y_i}\)

predict disease

Clustering

\(\mathcal{D} = \left\{ \boldsymbol{x} _i \right\}_{i=1}^N\), no labels

estimate the number of clusters \(C\) and determine which cluster each observation \(i\) belongs to

cluster users into groups

Dimensionality reduction
(representation learning)

\(\mathcal{D} = \left\{ \boldsymbol{x} _i \right\}_{i=1}^N\), no labels

reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the essence of the data

compression, visualization, downstream learning tasks

There are other unsupervised learning tasks, such as

  • discovering graph structure

  • matrix completion

  • generation / density estimation

Semi-supervised Learning

Besides, there are also** semi-supervised learning** tasks, where there are lots of unlabeled data \(\boldsymbol{X} _U\) and a little labeled data \(\boldsymbol{X}_L\). The objective is to learn representation from unlabeled data and use the learned representation to improve performance of supervised learner on labeled data.

Note

  • Supervised learning usually doesn’t optimize the loss we really care about.

  • The boundary between supervised and unsupervised learning is fuzzy. They are somehow “differently supervised”.

Parametric vs Non-parametric

Whether there is a fixed number of parameters in the model can be another criteria to create a taxonomy.

Parametric

Parametric models have a fixed number of parameters

  • pros: faster to use

  • cons: stronger assumptions

Non-parametric

In non-parametric models, the number of parameters grow with the amount of training data.

  • pros: flexible, no or few assumptions

  • cons: computationally intractable for large datasets

Table 8.1 in the book Machine Learning: a Probablistic Perspective summarized the models as below.

Model

Classif/reg

Discr/Gen

Param/Non

Discriminant analysis

Classif

Gen

Param

Naive Bayes classifier

Classif

Gen

Param

Tree-augmented Naive Bayes clssifier

Classif

Gen

Param

Linear regression

Regr

Discrim

Param

Logistic regression

Classif

Discrim

Param

Sparse linear/ logistic regression

Both

Discrim

Param

Multilayer perceptron (MLP) / Neural network

Both

Discrim

Param

Conditional random field (CRF)

Classif

Discrim

Param

\(K\) nearest neighbor classifier

Classif

Gen

Non

(Infinite) Mixture Discriminant analysis

Classif

Gen

Non

Classification and regression trees (CART)

Both

Discrim

Non

Boosted model

Both

Discrim

Non

Sparse kernelized lin/logreg (SKLR)

Both

Discrim

Non

Support vector machin (SVM)

Both

Discrim

Non

Gaussian processes (GP)

Both

Discrim

Non

Smoothing splines

Both

Discrim

Non

Others

Reinforcement Learning

Multi-agent Learning