| Introduction
Naive Bayes - Conditional Independent 가정 하지 않고 해볼 수 있는 Classification
그 중 하나가 Logistic Regression!
| Optimal Classification and Bayes Risk
실선 (Non-linear) 이 점선 (Linear) 보다 좋다!
| Logistic function
Sigmoid function
1. Bounded
2. Differentiable
3. Real function
4. Defined for all real inputs
5. With positive derivative (increase function)
- Example: tahh, logistic, ...
Logistic function
- Sigmoid function
- Easy to calculate derivative
Logistic regression fitting
$X\theta = log(\frac{P(Y \mid X)}{1-P(Y \mid X)})$ → $P(Y \mid X) = \frac{e^{X\theta}}{1+e^{X\theta}}$
| Logistic Regression
Logistic Regression is a probabilistic classifier to predict the binomial or the multinomial outcome
Given the Bernoulli experiment
Previous Example
$$ P(D \mid \theta) = \theta ^{\ a_H} \cdot (1-\theta)^{\ a_T} $$
$P(H)=\theta$
$P(T)=1-\theta$
Logistic regression given X and Y
$$ P(y \mid x) = \mu(x)^y \cdot (1-\mu(x))^{1-y} $$
$$ \mu(x) = \frac{1}{1+e^{-\theta^{T} x}} $$
$y=1$ : $P(y=1\mid x) = \mu(x)$
$y=0$ : $P(y=0\mid x) = 1-\mu(x)$
$\mu(x)$ → Logistic function으로 모델링
자 그러면 MLE로 $\theta$ 를 한 번 계산해보자!
Maximum Conditional Likelihood Estimation (MCLE)
$P(D \mid \theta)$ : 이전에는 단순히 HEAD, TAIL 결과 D 로 생각
$P(Y_i \mid X_i \ ; \theta)$ : $X$ 라는 조건이 추가 (어떠한 X라는 Condition이 있는 상태에서 HEAD 가 나오거나 TAIL 이 나오거나)
( $\theta$ 는 우리가 Inference 해야할 Paramter, N 은 instance 개수 )
$P(Y_i \mid X_i \ ; \theta) = \mu(x)^y \cdot (1-\mu(x))^{1-y} $
($Y_i = 0$ 일 때 $\mu(X_i)$, $Y_i=1$ 일 때 $1-\mu(X_i)$ 의 확률로 Modelling 해서 풀어보자.
$$
\begin{align*}
\hat{\theta}=argmax_{\theta}P(D\mid \theta)&=argmax_{\theta}\prod_{1\leqslant i \leqslant N}P(Y_i \mid X_i\ ; \ \theta) \\
&=argmax_{\theta}\ log(\prod_{1\leqslant i \leqslant N} P(Y_i \mid X_i\ ; \ \theta)) \\
&=argmax_{\theta} \sum_{1\leqslant i \leqslant N} log(P(Y_i \mid X_i\ ; \ \theta))
\end{align*}
$$
$$
\begin{align*}
log\left \{ P(Y_i \mid X_i\ ; \ \theta) \right \} &= log\left \{ \mu(X_i)^{Y_i} \cdot (1-\mu(X_i))^{1-Yi}\ \right \} \\
&= Y_i \cdot log\left \{ \mu(X_i) \right \} + (1-Y_i) \cdot log\left \{ 1-\mu(X_i) \right \} \\
&= Y_i \cdot log\left \{ \frac{\mu(X_i)}{1-\mu(X_i)} \right \}+log\left \{ 1-\mu(X_i) \right \} \\
&= Y_i X_i \theta + log\left \{ 1-\mu(X_i) \right \} \\
&= Y_iX_i\theta - log(1+e^{X_i\theta}) \\
\end{align*}
$$
정리하면 ...
$$ \hat{\theta}=argmax_{\theta} \sum_{i=1}^{N}\left \{ Y_i X_i \theta - log(1+e^{X_i\theta}) \right \} $$
Max 구하기 위해 미분해보자.
일단 $\theta_{j}$ 성분에 대해 편미분해보자.
→ $\theta_{j}$ 가 아닌 다른 $\theta_{1},\ \theta_{2}, \ ... $ 성분들은 사라져서 $X_{i, j}$ 성분만 남게 됨
$$
\begin{align*}
\frac{\partial }{\partial \theta_{j}}\ \left \{ \sum_{i=1}^{N}\left \{ Y_i X_i \theta - log(1+e^{X_i\theta}) \right \} \right \} &= \left \{ \sum_{i=1}^{N} Y_i X_{i,j} \right \} + \left \{ \sum_{i=1}^{N} -\frac{1}{1+e^{X_i\theta}} \cdot e^{X_i\theta} \cdot X_{i,j} \right \} \\
&= \sum_{i=1}^{N} X_{i,j}\cdot\left ( Y_i-\frac{e^{X_i\theta}}{1+e^{X_i\theta}} \right ) \\
&= \sum_{i=1}^{N} X_{i,j} \cdot \left ( \ Y_i-P(Y_i=1 \mid X_i \ ; \theta) \ \right )
\end{align*}
$$
미분을 0 으로 두면 ...
$$ \sum_{i=1}^{N} X_{i,j} \cdot \left ( \ Y_i-P(Y_i=1 \mid X_i \ ; \theta) \ \right ) = 0 $$
딱 떨어지는 식이 아닌 "Not closed form" → Approximation 필요
Approximation 방법 중 하나로 "Gradient Descent" Method !
| Gradient Descent / Ascent
From Tayler Expansion ...
'무한히 미분 가능한 함수' 에서 Iteratively moving (방향, 속력) 찾는 방법
$$ f(x)=f(a)+\frac{f'(a)}{1!}(x-a)+O(\left | {\left | x-a \right |} \right |) ^2) $$
| Finding $\theta$ with Gradient Ascent (Logistic Regression)
$$ \hat{\theta}=argmax_{\theta} \sum_{i=1}^{N} log(P(Y_i \mid X_i\ ; \ \theta)) $$
Previously derived equations
개별 $\theta_j$ 에 대해 편미분
$$ f(\theta) = \sum_{i=1}^{N} log\left \{ P(Y_i \mid X_i\ ; \ \theta) \right \} $$
$$ \frac{\partial f(\theta)}{\partial \theta_j}=\sum_{i=1}^{N} X_{i,j} \cdot \left ( \ Y_i-P(Y_i=1 \mid X_i \ ; \theta) \ \right ) $$
Gradient Ascent
$$
\begin{align*}
\theta_{j}^{\ t+1} \leftarrow \theta_{j}^{\ t}+h \frac{\partial f(\theta^{t})}{\partial \theta_j} &= \theta_{j}^{\ t}+ h \cdot \left \{ \sum_{i=1}^{N} X_{i,j} \cdot \left ( \ Y_i-P(Y_i=1 \mid X_i \ ; \theta) \ \right ) \right \} \\
&= \theta_{j}^{\ t}+\frac{h}{C}\left \{ \sum_{i=1}^{N}X_{i,j}\left ( Y_i-\frac{e^{X_i\theta^{t}}}{1+e^{X_i\theta^t}} \right ) \right \}
\end{align*}
$$
| Linear Regression with Gradient Descent
Data 사이즈가 워낙 크면 Matrix Inverse 계산이 어려움
이럴 때 보통 Gradient Descent 를 통해 Approximation 가능
$$
\begin{align*}
\hat{\theta} &=argmin_{\theta}(f-\hat{f})^2 = argmin_{\theta}(Y-X\theta)^2 \\
&=argmin_{\theta} \sum_{i=1}^{N}(Y^{i}-\sum_{j=1}^{d}X_{j}^{\ i}\cdot \theta_{j})^2
\end{align*}
$$
$$ f(\theta) = \sum_{i=1}^{N}(Y^{i}-\sum_{j=1}^{d}X_{j}^{\ i}\cdot \theta_{j})^2 $$
$$ \frac{\partial }{\partial \theta_k}\left \{ \sum_{i=1}^{N}(Y^{i}-\sum_{j=1}^{d}X_{j}^{\ i}\cdot \theta_{j})^2 \right \} = -\sum_{i=1}^{N}2\cdot(Y^i-\sum_{j=1}^{d}X_{j}^{\ i}\theta_{j})X_{k}^{\ i} $$
$$ \theta_{k}^{\ t+1} \leftarrow \theta_{k}^{\ t}-h\cdot \frac{\partial f(\theta^{t})}{\partial \theta_k^{\ t}} = \theta_{k}^{\ t}+ h \cdot \sum_{i=1}^{N}2\cdot(Y^i-\sum_{j=1}^{d}X_{j}^{\ i}\theta_{j})X_{k}^{\ i} $$
Reference
문일철 교수님 강의
https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=15
https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=16
https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=17
https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=18
https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=19
https://www.youtube.com/watch?v=oNTXMgqCv6E&list=PLbhbGI_ppZISMV4tAWHlytBqNq1-lb8bz&index=20
'Study > Lecture - Basic' 카테고리의 다른 글
W4.StatQuest. MLE, Gaussian Naive Bayes (0) | 2023.05.06 |
---|---|
W4.L7-8. Logistic Regression - Naive Bayes to Logistic Regression (0) | 2023.05.06 |
W3.L3-4. Naive Bayes Classifier - Naive Bayes Classifier (0) | 2023.04.30 |
W3.L2. Naive Bayes Classifier - Conditional Independence (0) | 2023.04.30 |
W3.L1. Naive Bayes Classifier - Optimal Classification (0) | 2023.04.28 |