본문 바로가기

Study/Lecture - Basic

W8.L1-2. K-Means Clustering and Gaussian Mixture Model - K-Means

| Unsupervised Learning

Discovering clusters

Discovering latent factors

Discovering graph structures ...

 

Finding the representative topic words from text data

Finding the latent image from facial data ...

 

 

Clustering Problems:

Latent variable of classes

Optimal assignment to the latent classes

뭔가 latent 한 vector 를 공유하고 있는 것들의 묶음이구나 ... 

 

 

| K-Means Algorithm

https://upload.wikimedia.org/wikipedia/commons/e/e5/KMeans-Gaussian-data.svg

 

잠재적인 내부 동력이 K 개 정도 있겠구나!

- Setup K number of centroids (or prototypes)

- and cluster data points by the distance from the points to the nearest centroid

 

 

Formally,

Minimize $J$ by optimizing

 

$$ J = \sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk} \left\| x_n - \mu_{k} \right\|^2 $$

 

$ r_{nk} $ : the assignment of data points to clusters (클러스터 j 에 속하면 1 아니면 0)

$ \mu_{k} $ : the location of centroids 

 

위 둘을 변형시켜가며 $ J $ 를 최소화하는 과정

두 파라미터의 최적값을 찾기 위해서는 ... Iterative optimization ! 

 

 

| Expectation and Maximization

Expectation 과 Maximization 을 번갈아가며 수행 !

 

Expectation

- Expectation of the log-likelihood given the parameters

- Assign the data points to the nearest centroid ($ r_{nk} $)

 

Maximization

- Maximization of the parameters with respect to the log-likelihood

- Update the centroid positions given the assignments ($\mu_{k}$)

 

 

$ r_{nk} $

- $ r_{nk}=\left\{ 0, 1\right\} $

- Discrete variable

- Logical choice: the nearest centroid $ \mu_{k} $ for a data point of $ x_{n} $

 

 

$ \mu_{k} $

$$
\begin{align*}
\frac{dJ}{d\mu_{k}}&=\frac{d}{d\mu_{k}}\sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk} \left\| x_n - \mu_{k} \right\|^2 \\
&= \frac{d}{d\mu_{k}} \sum_{n=1}^{N}r_{nk}\left\| x_n - \mu_{k} \right\| ^2 \\
&= \sum_{n=1}^{N} \left\{ -2r_{nk}(x_n - \mu_{k}) \right\} \\
&= -2(-\sum_{n=1}^{N} r_{nk}\mu_{k} + \sum_{n=1}^{N} r_{nk}x_{n}) \\
&= 0 \\
\end{align*}
$$

 

$ \therefore \mu_{k} = \frac{\sum_{n=1}^{N} r_{nk}x_n }{\sum_{n=1}^{N} r_{nk}}$ ( Mean value of the assigned data )

 

 

| Properties of K-Means Algorithm

# of clusters in uncertain

Initial location of centroids

- Some initial locations might not result in the reasonable results

Limitation of distance metrics

- Euclidean distance is very limited knowledge of information

Hard clustering

- Hard assignment of data points to clusters (0 or 1)

 

 

 

 

Reference
문일철 교수님 강의 

https://www.youtube.com/watch?v=mnUcZbT5E28&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=48
https://www.youtube.com/watch?v=mnUcZbT5E28&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=49