Linear Regression (1) - OLS Estimator | Python for Process Innovation

Python for Process Innovation

#Python, #Javascript, #SAP, #Automation, #ML, #AI

Linear Regression (1) - OLS Estimator

Tags: #Statistics #Python #Statsmodels #scikit-learn

Update: 2019-01-29

Frequentist Linear Regression

베이지안 선형 회귀를 효과적으로 정리하기 위해 먼저 빈도주의 회귀 분석을 간략히 되새겨 보려고 합니다. 제 수준에서는 statlect의 자료가 비교적 친절할 뿐만 아니라 유도 과정까지 자세히 설명되어 있어 좋았습니다. 이를 참고해서 필요한 부분만 간단히 요약 해보도록 하겠습니다.

선형 회귀 모델은 주어진 input을 조건으로 output을 도출하는 조건부 확률 분포(conditional probability distribution)를 모델링 하는 Conditional 모델의 일종입니다. 이는 input과 output의 결합 확률 분포(joint distribution)를 모델링 하는 Generative 모델과 대비되는 개념입니다.

$P(y|X;\beta)$

빈도주의 관점에서 가장 중요한 목표는 모델의 parameter(여기서는 $\beta$ )를 estimation하는 것입니다. 즉 이는 어떤 최적의 고정된 모수(parameter) 값이 존재할 것이라는 것을 이미 전제했다는 이야기이기도 합니다.

OLS estimator

위 조건부 모델을 우리가 흔히 사용하는 회귀식으로 다시 표현하면,

$y_i=x_i\beta+\epsilon_i$

이를 행렬 형태로 나타내면 아래와 같습니다.

$y=X\beta+\epsilon$

여기서 $X$ 를 Design Matrix라고 합니다.

우선 가장 흔히 알려진 방법으로 OLS(Ordinary Least Squared; 최소제곱) estimator를 이용해서 $\beta$ 를 알아낼 수 있습니다. OLS estimator는 아래와 같이 잔차의 제곱의 합을 최소화 하는 $\beta$ 를 찾을 수 있도록 해 줍니다.

$\hat{\beta}=\arg\max_\beta\sum_{i=1}^N\epsilon_i^2$

단, 아래의 가정을 만족해야 합니다.

Linear relationship between input and output
Full rank design matrix
...
...
...

이제 SSE(Sum of squares due to residual errors; 잔차제곱합)를 줄이기 위해, 이 값이 최소가 되는 지점의 $\hat{\beta}$ 를 아래와 같이 찾는 것입니다.

$\begin{aligned} SSE&=\sum_{i=1}^N(y_i-x_i\beta_i)^2 \\&=(y-X\beta)^{\intercal}(y-X\beta) \\&=y^{\intercal}y -y^{\intercal}X\beta -\beta^{\intercal}X^{\intercal}y +\beta^{\intercal}X^{\intercal}X\beta \end{aligned}$

$\nabla_\beta{SSE}=-2X^{\intercal}Y+2X^{\intercal}X\beta=0$

$(X^{\intercal}X)\beta=X^{\intercal}y$

X가 full rank라면 $(X^{\intercal}X)$ 의 역행렬이 존재하므로,

$\hat{\beta}=(X^{\intercal}X)^{-1}X^{\intercal}y$

이렇게 구한 회귀식이 의미를 가지려면 OLS estimator 또는 다른 estimator인 $\beta$ 와 $\epsilon$ 에 대한 추가적인 가정이 필요합니다. 모델에 따라 달라질 수 있지만 가장 일반적으로 사용되는 NLRM(Normal Linear Regression Model; 정규회귀모델)을 가정하면,

Multivariate normal distribution of $\epsilon$
Diagonal covariance matrix of $\epsilon$
Equal diagonal entries of $\epsilon$

이는 잔차가 다변수정규분포를 따르고, 독립적이며, 등분산이라는 의미입니다.

$\epsilon\sim{N}(0,\sigma^2I)$

따라서 $\epsilon$ 의 기대값은 0이 될 것이므로,

$E(\epsilon)=0$

회귀식의 기대값과 분산은 아래와 같습니다.

$E(y)=X\beta+E(\epsilon)=X\beta$

$Var(y)=\beta^2Var(X)+Var(\epsilon)=\sigma^2I$

Python implementation

위 내용을 파이썬의 Statsmodels와 scikit-learn 라이브러리를 이용해 실제로 수행하게 되면 대략 아래와 같습니다.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

df = sm.datasets.get_rdataset("Duncan", "carData").data

X = df[['income', 'education']]
y = df['prestige']

# Statsmodels
ols_sm = sm.OLS(y, X, axis=1).fit()

# scikit-learn
ols_sk = LinearRegression(fit_intercept=False).fit(X, y)

ols_sm.params

"""
income       0.548272
education    0.495751
dtype: float64
"""

ols_sk.coef_

"""
array([0.54827173, 0.49575132])
"""

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Statsmodels의 경우에는 아래와 같은 요약 리포트도 제공합니다. 요약 리포트에 나오는 검정통계량들은 (제 경우엔) 몇몇은 이미 익숙하지만 어떤 것은 생소하기도 합니다. 추후 기회가 될 때 하나씩 공부하며 정리하도록 하겠습니다.

ols_sm.summary()

"""
                            OLS Regression Results
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                     377.6
Date:                Sun, 27 Jan 2019   Prob (F-statistic):           5.30e-28
Time:                        23:16:08   Log-Likelihood:                -180.04
No. Observations:                  45   AIC:                             364.1
Df Residuals:                      43   BIC:                             367.7
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
income         0.5483      0.116      4.743      0.000       0.315       0.781
education      0.4958      0.093      5.343      0.000       0.309       0.683
==============================================================================
Omnibus:                        0.724   Durbin-Watson:                   1.356
Prob(Omnibus):                  0.696   Jarque-Bera (JB):                0.366
Skew:                           0.219   Prob(JB):                        0.833
Kurtosis:                       3.055   Cond. No.                         5.50
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

# Frequentist Linear Regression

# OLS estimator

# Python implementation

# Reference

Frequentist Linear Regression

OLS estimator

Python implementation

Reference