K-nearest neighbors¶

\(K\)-nearest neighbors is a non-parametric method. It simply the application of nearest-neighbor smoothing on a graph.

It can be used for both classification and regression. The nearest-neighbor methods can also be applied to a graph with vertex attributes.

Classification from Data Matrix¶

If the input is a data matrix, \(K\)-nearest neighbors method look at the \(K\) points in the training set \(\mathcal{D}\) that are nearest to the test input \(\boldsymbol{x}_i\), denoted \(N_K \left( \boldsymbol{x}_i, \mathcal{D} \right)\). To define ‘nearest’, it requires a distance measure \(d(\boldsymbol{x}_i, \boldsymbol{x}_j)\), such as Euclidean distance.

To predict the class of observation \(\boldsymbol{x}\), use

\[\mathbb{P} (y=c\vert \boldsymbol{x}, \mathcal{D}, K) = \frac{1}{K} \sum_{j \in N_K (\boldsymbol{x}, \mathcal{D})} \mathbb{I}\left( y_j =c \right)\]

Validation data is used to decide the hyper-parameter \(K\).

Regression on Graph¶

If the input is a graph \(G=(V, E)\) with vertex attribute \(x _v\), then for new vertex \(i\) with unknown attribute but known edges, we can predict its attribute value by the average of those of its neighbors,

\[ \hat{\boldsymbol{x} }_i = \frac{\sum_{j \in \mathscr{N}_{i}} \boldsymbol{x} _{j}}{\left|\mathscr{N} _{i}\right|} \]

Note that there is no hyper-parameter \(K\).

Tuning¶

To choose optimal \(K\), we can use train-test split.

Fig. 80 Random train-test split for different ratios [Wang 2021]¶

As \(K\) increases from 1, the test error first decreases and then increases. The error bound increases.

Pros¶

Can express complex, non-linear, non-parametric boundaries
Very fast training
Simple, yet with good performance in practice
Reasonably good interpretability

Cons¶

It is an example of memory-based learning or instance-based learning. It stores all seen points.
Not among the best classifiers in terms of accuracy
Standardization is often necessary
Poor performance in high dimensional settings
In the graph case, how best to evaluate may not be clear in the case where \(\boldsymbol{x} _j\) is not observed for one or more \(j \in N_i\).
- One solution to this dilemma is to redefine it in terms of only those vertices \(j \in V_i\) that are both adjacent to \(i\) and for which \(\boldsymbol{x} _j\) is observed.
- Alternatively, we might impute values for the unobserved \(\boldsymbol{x} _j\), substituting, for instance, the overall average \(\frac{1}{\left\vert V_{obs} \right\vert}\sum _{k \in V_{obs}} \boldsymbol{x} _k\)
- Although neither of these approaches is entirely satisfactory, imputation is likely to work better in practice, particularly when the relative number of missing observations is not small

Data Science Handbook