본문 바로가기

Study/Lecture - Basic

W4.L1-6. Logistic Regression - Decision Boundary, Gradient Method

 

| Introduction

Naive Bayes - Conditional Independent 가정 하지 않고 해볼 수 있는 Classification

그 중 하나가 Logistic Regression!

 

 

| Optimal Classification and Bayes Risk

실선 (Non-linear) 이 점선 (Linear) 보다 좋다!

 

 

| Logistic function

 

Sigmoid function

1. Bounded

2. Differentiable

3. Real function

4. Defined for all real inputs

5. With positive derivative (increase function)

- Example: tahh, logistic, ...

 

Logistic function

- Sigmoid function

- Easy to calculate derivative

Logistic regression fitting

 

$X\theta = log(\frac{P(Y \mid X)}{1-P(Y \mid X)})$  →  $P(Y \mid X) = \frac{e^{X\theta}}{1+e^{X\theta}}$

 

| Logistic Regression

Logistic Regression is a probabilistic classifier to predict the binomial or the multinomial outcome

Given the Bernoulli experiment

 

 

Previous Example

 

$$ P(D \mid \theta) = \theta ^{\ a_H} \cdot (1-\theta)^{\ a_T} $$

 

​$P(H)=\theta$

 

$P(T)=1-\theta$

 

 

 

Logistic regression given X and Y

 

$$ P(y \mid x) = \mu(x)^y \cdot (1-\mu(x))^{1-y} $$
$$ \mu(x) = \frac{1}{1+e^{-\theta^{T} x}} $$

 

 $y=1$ :   $P(y=1\mid x) = \mu(x)$  

 

$y=0$ :   $P(y=0\mid x) = 1-\mu(x)$ 

 

$\mu(x)$ → Logistic function으로 모델링 

 

 

자 그러면 MLE로 $\theta$ 를 한 번 계산해보자!

 

Maximum Conditional Likelihood Estimation (MCLE)

 

$P(D \mid \theta)$ : 이전에는 단순히 HEAD, TAIL 결과 D 로 생각

 

$P(Y_i \mid X_i \ ; \theta)$ : $X$ 라는 조건이 추가 (어떠한 X라는 Condition이 있는 상태에서 HEAD 가 나오거나 TAIL 이 나오거나)

( $\theta$ 는 우리가 Inference 해야할 Paramter, N 은 instance 개수 ) 

 

$P(Y_i \mid X_i \ ; \theta) = \mu(x)^y \cdot (1-\mu(x))^{1-y} $

($Y_i = 0$ 일 때 $\mu(X_i)$,  $Y_i=1$ 일 때 $1-\mu(X_i)$ 의 확률로 Modelling 해서 풀어보자. 

 

 

$$
\begin{align*}
\hat{\theta}=argmax_{\theta}P(D\mid \theta)&=argmax_{\theta}\prod_{1\leqslant i \leqslant N}P(Y_i \mid X_i\ ; \ \theta) \\
&=argmax_{\theta}\ log(\prod_{1\leqslant i \leqslant N} P(Y_i \mid X_i\ ; \ \theta)) \\
&=argmax_{\theta} \sum_{1\leqslant i \leqslant N} log(P(Y_i \mid X_i\ ; \ \theta))
\end{align*}
$$

 

$$
\begin{align*}
log\left \{ P(Y_i \mid X_i\ ; \ \theta) \right \} &= log\left \{ \mu(X_i)^{Y_i} \cdot (1-\mu(X_i))^{1-Yi}\ \right \} \\
&= Y_i \cdot log\left \{ \mu(X_i) \right \} + (1-Y_i) \cdot log\left \{ 1-\mu(X_i) \right \} \\
&= Y_i \cdot log\left \{ \frac{\mu(X_i)}{1-\mu(X_i)} \right \}+log\left \{ 1-\mu(X_i) \right \} \\
&= Y_i X_i \theta + log\left \{ 1-\mu(X_i) \right \} \\
&= Y_iX_i\theta - log(1+e^{X_i\theta}) \\
\end{align*}
$$

 

 

정리하면 ...

$$ \hat{\theta}=argmax_{\theta} \sum_{i=1}^{N}\left \{ Y_i X_i \theta - log(1+e^{X_i\theta}) \right \} $$

 

Max 구하기 위해 미분해보자.

일단 $\theta_{j}$ 성분에 대해 편미분해보자.

→ $\theta_{j}$ 가 아닌 다른 $\theta_{1},\ \theta_{2}, \ ... $ 성분들은 사라져서 $X_{i, j}$ 성분만 남게 됨

 

 

$$
\begin{align*}
\frac{\partial }{\partial \theta_{j}}\  \left \{ \sum_{i=1}^{N}\left \{ Y_i X_i \theta - log(1+e^{X_i\theta}) \right \} \right \} &= \left \{ \sum_{i=1}^{N} Y_i X_{i,j} \right \} + \left \{ \sum_{i=1}^{N} -\frac{1}{1+e^{X_i\theta}} \cdot e^{X_i\theta} \cdot X_{i,j} \right \} \\
&= \sum_{i=1}^{N} X_{i,j}\cdot\left ( Y_i-\frac{e^{X_i\theta}}{1+e^{X_i\theta}} \right ) \\
&= \sum_{i=1}^{N} X_{i,j} \cdot \left ( \  Y_i-P(Y_i=1 \mid X_i \ ; \theta) \ \right )
\end{align*}
$$

 

미분을 0 으로 두면 ... 

 

$$ \sum_{i=1}^{N} X_{i,j} \cdot \left ( \  Y_i-P(Y_i=1 \mid X_i \ ; \theta) \ \right ) = 0 $$

 

 

딱 떨어지는 식이 아닌 "Not closed form" → Approximation 필요

 

Approximation 방법 중 하나로 "Gradient Descent" Method !

 

 

 

 

| Gradient Descent / Ascent

 

From Tayler Expansion ...

'무한히 미분 가능한 함수' 에서 Iteratively moving (방향, 속력) 찾는 방법

 

$$ f(x)=f(a)+\frac{f'(a)}{1!}(x-a)+O(\left | {\left | x-a \right |} \right |) ^2) $$

 

Example. Prof. Moon (http://seslab.kaist.ac.kr/xe2/page_GBex27)

 

 

 

| Finding $\theta$ with Gradient Ascent (Logistic Regression)

 

$$ \hat{\theta}=argmax_{\theta} \sum_{i=1}^{N} log(P(Y_i \mid X_i\ ; \ \theta)) $$

 

 

Previously derived equations

개별 $\theta_j$ 에 대해 편미분

 

$$ f(\theta) = \sum_{i=1}^{N} log\left \{ P(Y_i \mid X_i\ ; \ \theta) \right \} $$

 

$$ \frac{\partial f(\theta)}{\partial \theta_j}=\sum_{i=1}^{N} X_{i,j} \cdot \left ( \  Y_i-P(Y_i=1 \mid X_i \ ; \theta) \ \right ) $$

 

Gradient Ascent

 

$$
\begin{align*}
\theta_{j}^{\ t+1} \leftarrow \theta_{j}^{\ t}+h \frac{\partial f(\theta^{t})}{\partial \theta_j} &= \theta_{j}^{\ t}+ h \cdot \left \{ \sum_{i=1}^{N} X_{i,j} \cdot \left ( \  Y_i-P(Y_i=1 \mid X_i \ ; \theta) \ \right ) \right \} \\
&= \theta_{j}^{\ t}+\frac{h}{C}\left \{ \sum_{i=1}^{N}X_{i,j}\left ( Y_i-\frac{e^{X_i\theta^{t}}}{1+e^{X_i\theta^t}} \right ) \right \}
\end{align*}
$$

 

 

| Linear Regression with Gradient Descent

 

Data 사이즈가 워낙 크면 Matrix Inverse 계산이 어려움

이럴 때 보통 Gradient Descent 를 통해 Approximation 가능

 

$$
\begin{align*}
\hat{\theta} &=argmin_{\theta}(f-\hat{f})^2 = argmin_{\theta}(Y-X\theta)^2 \\
&=argmin_{\theta} \sum_{i=1}^{N}(Y^{i}-\sum_{j=1}^{d}X_{j}^{\ i}\cdot \theta_{j})^2
\end{align*}
$$

 

$$ f(\theta) = \sum_{i=1}^{N}(Y^{i}-\sum_{j=1}^{d}X_{j}^{\ i}\cdot \theta_{j})^2 $$

$$ \frac{\partial }{\partial \theta_k}\left \{ \sum_{i=1}^{N}(Y^{i}-\sum_{j=1}^{d}X_{j}^{\ i}\cdot \theta_{j})^2 \right \} = -\sum_{i=1}^{N}2\cdot(Y^i-\sum_{j=1}^{d}X_{j}^{\ i}\theta_{j})X_{k}^{\ i} $$

$$ \theta_{k}^{\ t+1} \leftarrow \theta_{k}^{\ t}-h\cdot \frac{\partial f(\theta^{t})}{\partial \theta_k^{\ t}} = \theta_{k}^{\ t}+ h \cdot \sum_{i=1}^{N}2\cdot(Y^i-\sum_{j=1}^{d}X_{j}^{\ i}\theta_{j})X_{k}^{\ i} $$

 

 

 

 

 

 

 

Reference
문일철 교수님 강의 
https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=15

https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=16

https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=17

https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=18

https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=19

https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=20