Data Analysis/데이터분석 기초
EDA, Linear Regression_보스턴 집값 데이터
박째롱
2022. 2. 18. 17:15
1) Library & Data Import¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/yoonkt200/FastCampusDataset/master/BostonHousing2.csv")
In [3]:
df.head()
Out[3]:
TOWN | LON | LAT | CMEDV | CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nahant | -70.955 | 42.2550 | 24.0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 |
1 | Swampscott | -70.950 | 42.2875 | 21.6 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 |
2 | Swampscott | -70.936 | 42.2830 | 34.7 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 |
3 | Marblehead | -70.928 | 42.2930 | 33.4 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 |
4 | Marblehead | -70.922 | 42.2980 | 36.2 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 |
Feature Description¶
- TOWN : 지역 이름
- LON, LAT : 위도, 경도 정보
- CMEDV : 해당 지역의 집값(median)
- CRIM : 근방 범죄율
- ZN : 주택지 비율
- INDUS : 상업적 비즈니스에 활용되지 않는 농지 면적
- CHAS : 경계선에 강에 있는지 여부
- NOX : 산화 질소 농도
- RM : 자택당 평균 방 갯수
- AGE : 1940 년 이전에 건설된 비율
- DIS : 5 개의 보스턴 고용 센터와의 거리에 다른 가중치 부여
- RAD : radial 고속도로와의 접근성 지수
- TAX : 10000달러당 재산세
- PTRATIO : 지역별 학생-교사 비율
- B : 지역의 흑인 지수 (1000(B - 0.63)^2), B는 흑인의 비율.
- LSTAT : 빈곤층의 비율
2) EDA¶
In [4]:
df.shape #506rows, 17 columns
Out[4]:
(506, 17)
In [5]:
df.isnull().sum() #null값 없음
Out[5]:
TOWN 0
LON 0
LAT 0
CMEDV 0
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
dtype: int64
In [6]:
df.info() #town제외 numeric
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TOWN 506 non-null object
1 LON 506 non-null float64
2 LAT 506 non-null float64
3 CMEDV 506 non-null float64
4 CRIM 506 non-null float64
5 ZN 506 non-null float64
6 INDUS 506 non-null float64
7 CHAS 506 non-null int64
8 NOX 506 non-null float64
9 RM 506 non-null float64
10 AGE 506 non-null float64
11 DIS 506 non-null float64
12 RAD 506 non-null int64
13 TAX 506 non-null int64
14 PTRATIO 506 non-null float64
15 B 506 non-null float64
16 LSTAT 506 non-null float64
dtypes: float64(13), int64(3), object(1)
memory usage: 67.3+ KB
CMEDV 피쳐 탐색¶
In [7]:
df['CMEDV'].describe()
Out[7]:
count 506.000000
mean 22.528854
std 9.182176
min 5.000000
25% 17.025000
50% 21.200000
75% 25.000000
max 50.000000
Name: CMEDV, dtype: float64
In [8]:
df['CMEDV'].hist(bins=50)
Out[8]:
<AxesSubplot:>
In [9]:
df.boxplot(column = 'CMEDV')
Out[9]:
<AxesSubplot:>
설명변수 확인¶
In [10]:
numerical_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
fig = plt.figure(figsize = (16,20)) #이미지 영역 확보
ax = plt.gca() #Axes의 수
df[numerical_columns].hist(ax=ax)
plt.show()
C:\Users\user\AppData\Local\Temp/ipykernel_12376/1584473939.py:4: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
df[numerical_columns].hist(ax=ax)
In [11]:
cols = ['CMEDV','CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
corr = df[cols].corr(method='pearson')
In [12]:
fig = plt.figure(figsize=(16,20))
ax = fig.gca()
sns.set(font_scale = 1.5) # seaborn, 폰트사이즈 설정
hm = sns.heatmap(corr.values, annot = True, fmt='.2f', annot_kws = {'size':15}, yticklabels=cols, xticklabels=cols, ax=ax, cmap='Blues')
# annot:칸에 값 표기
plt.tight_layout()
plt.show()
- CMEDV와 RM,LSTAT이 0.7 이상의 상관계수
- 개인적으루 CRIM,B도 궁금
In [13]:
plt.plot('RM', 'CMEDV', data=df, linestyle = 'none', marker='o', markersize=3, color='blue', alpha=0.5)
plt.title('RM-CMEDV',fontsize=15)
plt.xlabel('RM',fontsize=10)
plt.ylabel('CMEDV',fontsize=10)
plt.show()
- 평균 방의 개수와 해당 지역의 집값은 뚜렷한 양의 상관관계를 보임!
In [14]:
plt.plot('LSTAT', 'CMEDV', data=df, linestyle = 'none', marker='o', markersize=3, color='blue', alpha=0.5)
plt.title('LSTAT-CMEDV',fontsize=15)
plt.xlabel('LSTAT',fontsize=10)
plt.ylabel('CMEDV',fontsize=10)
plt.show()
- 빈곤층의 비율과 해당 지역의 집값은 뚜렷한 음의 상관관계를 보임!
In [16]:
plt.plot('B', 'CRIM', data=df, linestyle = 'none', marker='o', markersize=3, color='blue', alpha=0.5)
plt.title('B-CRIM',fontsize=15)
plt.xlabel('B',fontsize=10)
plt.ylabel('CRIM',fontsize=10)
plt.show()
- 흑인지수와 범죄율은 명확한 상관관계가 보이지 않음.
- 흑인지수가 높은 곳이어도 범죄율이 매우 낮게 밀집되어 있음.
In [17]:
df['TOWN'].value_counts()
Out[17]:
Cambridge 30
Boston Savin Hill 23
Lynn 22
Boston Roxbury 19
Newton 18
..
Medfield 1
Dover 1
Lincoln 1
Sherborn 1
Nahant 1
Name: TOWN, Length: 92, dtype: int64
In [18]:
df['TOWN'].value_counts().hist(bins=30)
Out[18]:
<AxesSubplot:>
In [19]:
fig = plt.figure(figsize = (20,20))
ax = fig.gca()
sns.boxplot(x='CMEDV', y='TOWN', data=df, ax=ax)
Out[19]:
<AxesSubplot:xlabel='CMEDV', ylabel='TOWN'>
In [20]:
fig = plt.figure(figsize = (16,20))
ax = fig.gca()
sns.boxplot(x='CRIM', y='TOWN', data=df, ax=ax)
Out[20]:
<AxesSubplot:xlabel='CRIM', ylabel='TOWN'>
In [21]:
fig=plt.figure(figsize=(16,20))
ax=fig.gca()
sns.boxplot(x='B', y='TOWN', data=df, ax=ax)
Out[21]:
<AxesSubplot:xlabel='B', ylabel='TOWN'>
3) 집값 예측 분석 : 회귀분석¶
1.피쳐표준화
In [22]:
from sklearn.preprocessing import StandardScaler
# 평균0, 분산1로 조정
scaler =StandardScaler()
scale_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
df[scale_columns] = scaler.fit_transform(df[scale_columns])
In [23]:
df.head()
Out[23]:
TOWN | LON | LAT | CMEDV | CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nahant | -70.955 | 42.2550 | 24.0 | -0.419782 | 0.284830 | -1.287909 | -0.272599 | -0.144217 | 0.413672 | -0.120013 | 0.140214 | -0.982843 | -0.666608 | -1.459000 | 0.441052 | -1.075562 |
1 | Swampscott | -70.950 | 42.2875 | 21.6 | -0.417339 | -0.487722 | -0.593381 | -0.272599 | -0.740262 | 0.194274 | 0.367166 | 0.557160 | -0.867883 | -0.987329 | -0.303094 | 0.441052 | -0.492439 |
2 | Swampscott | -70.936 | 42.2830 | 34.7 | -0.417342 | -0.487722 | -0.593381 | -0.272599 | -0.740262 | 1.282714 | -0.265812 | 0.557160 | -0.867883 | -0.987329 | -0.303094 | 0.396427 | -1.208727 |
3 | Marblehead | -70.928 | 42.2930 | 33.4 | -0.416750 | -0.487722 | -1.306878 | -0.272599 | -0.835284 | 1.016303 | -0.809889 | 1.077737 | -0.752922 | -1.106115 | 0.113032 | 0.416163 | -1.361517 |
4 | Marblehead | -70.922 | 42.2980 | 36.2 | -0.412482 | -0.487722 | -1.306878 | -0.272599 | -0.835284 | 1.228577 | -0.511180 | 1.077737 | -0.752922 | -1.106115 | 0.113032 | 0.441052 | -1.026501 |
- 데이터셋 분리
In [24]:
from sklearn.model_selection import train_test_split
x = df[scale_columns]
y = df['CMEDV']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=33)
In [25]:
x_train.shape
Out[25]:
(404, 13)
- 회귀분석모델 학습
In [27]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error #MSE 계산
from math import sqrt #RMSE 계산
In [28]:
#train regression model
lr = linear_model.LinearRegression()
model = lr.fit(x_train, y_train)
In [29]:
print(lr.coef_)
[-0.95549078 1.18690662 0.22303997 0.76659756 -1.78400866 2.83991455
-0.05556583 -3.28406695 2.84479571 -2.33740727 -1.77815381 0.79772973
-4.17382086]
In [32]:
plt.rcParams['figure.figsize'] = [12,16] #rcParams로 차트의 크기, 선 색, 두께 등 설정
coefs = lr.coef_.tolist()
coefs_series = pd.Series(coefs)
In [35]:
x_labels = scale_columns
ax = coefs_series.plot.barh()
ax.set_title('feature coef graph')#대략적으로 각 변수의 coef 값 확인
ax.set_xlabel('coef')
ax.set_ylabel('x_features')
plt.show()
학습결과 해석¶
- R2 score (1-SSE/SST) 회귀모델의 설명력 (1에 가까울 수록 좋음) RMSE 평균 제곱근 오차 (Root Mean Squared Error) (낮을 수록 좋음)
In [37]:
print(model.score(x_train, y_train))
0.7490284664199387
In [38]:
print(model.score(x_test, y_test))
0.7009342135321552
In [40]:
y_predictions = lr.predict(x_train)
print(sqrt(mean_squared_error(y_train, y_predictions)))
4.672162734008589
In [41]:
y_predictions = lr.predict(x_test)
print(sqrt(mean_squared_error(y_test, y_predictions)))
4.614951784913307
- 피쳐 유의성 검정
In [43]:
import statsmodels.api as sm
x_train = sm.add_constant(x_train)
model = sm.OLS(y_train, x_train).fit()
model.summary()
C:\Users\user\anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
x = pd.concat(x[::order], 1)
Out[43]:
Dep. Variable: | CMEDV | R-squared: | 0.749 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.741 |
Method: | Least Squares | F-statistic: | 89.54 |
Date: | Fri, 18 Feb 2022 | Prob (F-statistic): | 2.61e-108 |
Time: | 16:19:58 | Log-Likelihood: | -1196.1 |
No. Observations: | 404 | AIC: | 2420. |
Df Residuals: | 390 | BIC: | 2476. |
Df Model: | 13 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 22.4800 | 0.238 | 94.635 | 0.000 | 22.013 | 22.947 |
CRIM | -0.9555 | 0.299 | -3.192 | 0.002 | -1.544 | -0.367 |
ZN | 1.1869 | 0.353 | 3.362 | 0.001 | 0.493 | 1.881 |
INDUS | 0.2230 | 0.470 | 0.475 | 0.635 | -0.700 | 1.147 |
CHAS | 0.7666 | 0.238 | 3.227 | 0.001 | 0.300 | 1.234 |
NOX | -1.7840 | 0.512 | -3.482 | 0.001 | -2.791 | -0.777 |
RM | 2.8399 | 0.326 | 8.723 | 0.000 | 2.200 | 3.480 |
AGE | -0.0556 | 0.410 | -0.135 | 0.892 | -0.862 | 0.751 |
DIS | -3.2841 | 0.491 | -6.695 | 0.000 | -4.248 | -2.320 |
RAD | 2.8448 | 0.650 | 4.375 | 0.000 | 1.566 | 4.123 |
TAX | -2.3374 | 0.717 | -3.259 | 0.001 | -3.748 | -0.927 |
PTRATIO | -1.7782 | 0.312 | -5.700 | 0.000 | -2.391 | -1.165 |
B | 0.7977 | 0.293 | 2.725 | 0.007 | 0.222 | 1.373 |
LSTAT | -4.1738 | 0.405 | -10.317 | 0.000 | -4.969 | -3.378 |
Omnibus: | 167.528 | Durbin-Watson: | 1.913 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 769.057 |
Skew: | 1.774 | Prob(JB): | 1.00e-167 |
Kurtosis: | 8.753 | Cond. No. | 9.63 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- R스퀘어 74.9%
- INDUS, AGE 외에는 유의함
- 다중공선성 (일반적으로 10 넘으면 문제 있다고 봄)
In [46]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(x_train.values, i) for i in range(x_train.shape[1])]
In [47]:
vif.head()
Out[47]:
VIF Factor | |
---|---|
0 | 1.008141 |
1 | 1.731807 |
2 | 2.222212 |
3 | 3.857543 |
4 | 1.076206 |
In [48]:
vif["feature"] = x_train.columns
vif.round(1)
Out[48]:
VIF Factor | feature | |
---|---|---|
0 | 1.0 | const |
1 | 1.7 | CRIM |
2 | 2.2 | ZN |
3 | 3.9 | INDUS |
4 | 1.1 | CHAS |
5 | 4.4 | NOX |
6 | 1.9 | RM |
7 | 3.1 | AGE |
8 | 4.1 | DIS |
9 | 6.9 | RAD |
10 | 8.6 | TAX |
11 | 1.7 | PTRATIO |
12 | 1.3 | B |
13 | 2.8 | LSTAT |
+) INDUS,AGE 빼고 회귀분석¶
In [54]:
scale_columns2 = ['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
df[scale_columns2] = scaler.fit_transform(df[scale_columns2])
df[scale_columns2].head()
Out[54]:
CRIM | ZN | CHAS | NOX | RM | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.419782 | 0.284830 | -0.272599 | -0.144217 | 0.413672 | 0.140214 | -0.982843 | -0.666608 | -1.459000 | 0.441052 | -1.075562 |
1 | -0.417339 | -0.487722 | -0.272599 | -0.740262 | 0.194274 | 0.557160 | -0.867883 | -0.987329 | -0.303094 | 0.441052 | -0.492439 |
2 | -0.417342 | -0.487722 | -0.272599 | -0.740262 | 1.282714 | 0.557160 | -0.867883 | -0.987329 | -0.303094 | 0.396427 | -1.208727 |
3 | -0.416750 | -0.487722 | -0.272599 | -0.835284 | 1.016303 | 1.077737 | -0.752922 | -1.106115 | 0.113032 | 0.416163 | -1.361517 |
4 | -0.412482 | -0.487722 | -0.272599 | -0.835284 | 1.228577 | 1.077737 | -0.752922 | -1.106115 | 0.113032 | 0.441052 | -1.026501 |
In [56]:
x = df[scale_columns2]
y = df['CMEDV']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=33)
model = lr.fit(x_train, y_train)
In [57]:
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))
y_predictions = lr.predict(x_train)
print(sqrt(mean_squared_error(y_train, y_predictions)))
y_predictions = lr.predict(x_test)
print(sqrt(mean_squared_error(y_test, y_predictions)))
0.7488676412541904
0.7016523248519184
4.673659479471091
4.609407785787356
In [58]:
x_train = sm.add_constant(x_train)
model = sm.OLS(y_train, x_train).fit()
model.summary()
C:\Users\user\anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
x = pd.concat(x[::order], 1)
Out[58]:
Dep. Variable: | CMEDV | R-squared: | 0.749 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.742 |
Method: | Least Squares | F-statistic: | 106.3 |
Date: | Fri, 18 Feb 2022 | Prob (F-statistic): | 2.77e-110 |
Time: | 16:53:36 | Log-Likelihood: | -1196.2 |
No. Observations: | 404 | AIC: | 2416. |
Df Residuals: | 392 | BIC: | 2464. |
Df Model: | 11 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 22.4803 | 0.237 | 94.849 | 0.000 | 22.014 | 22.946 |
CRIM | -0.9602 | 0.299 | -3.216 | 0.001 | -1.547 | -0.373 |
ZN | 1.1713 | 0.349 | 3.361 | 0.001 | 0.486 | 1.857 |
CHAS | 0.7745 | 0.235 | 3.294 | 0.001 | 0.312 | 1.237 |
NOX | -1.7406 | 0.475 | -3.661 | 0.000 | -2.675 | -0.806 |
RM | 2.8183 | 0.317 | 8.888 | 0.000 | 2.195 | 3.442 |
DIS | -3.3051 | 0.453 | -7.293 | 0.000 | -4.196 | -2.414 |
RAD | 2.7583 | 0.616 | 4.481 | 0.000 | 1.548 | 3.968 |
TAX | -2.1810 | 0.634 | -3.438 | 0.001 | -3.428 | -0.934 |
PTRATIO | -1.7631 | 0.306 | -5.762 | 0.000 | -2.365 | -1.161 |
B | 0.7936 | 0.292 | 2.721 | 0.007 | 0.220 | 1.367 |
LSTAT | -4.1755 | 0.378 | -11.047 | 0.000 | -4.919 | -3.432 |
Omnibus: | 167.125 | Durbin-Watson: | 1.912 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 764.205 |
Skew: | 1.771 | Prob(JB): | 1.14e-166 |
Kurtosis: | 8.732 | Cond. No. | 7.60 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- 설명력에 큰 차이 없넹..
- 범죄율, 산화질소농도, 보스턴 고용센터와의 거리(생각보다 큰 영향), 재산세, 지역별 학생-교사 비율, 빈곤층의 비율(가장 큰 영향)이 보스턴의 집값과 음의 상관관계
- 주택지 비율, 강의 여부, 방 갯수, 고속도로 접근성, 흑인지수가 양의 상관관계
In [59]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
#창 맞추기위함
In [ ]: