EDA, Linear Regression_보스턴 집값 데이터

박째롱 2022. 2. 18. 17:15

1) Library & Data Import¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

df = pd.read_csv("https://raw.githubusercontent.com/yoonkt200/FastCampusDataset/master/BostonHousing2.csv")

In [3]:

df.head()

Out[3]:

	TOWN	LON	LAT	CMEDV	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	Nahant	-70.955	42.2550	24.0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98
1	Swampscott	-70.950	42.2875	21.6	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14
2	Swampscott	-70.936	42.2830	34.7	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03
3	Marblehead	-70.928	42.2930	33.4	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94
4	Marblehead	-70.922	42.2980	36.2	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33

Feature Description¶

TOWN : 지역 이름
LON, LAT : 위도, 경도 정보
CMEDV : 해당 지역의 집값(median)
CRIM : 근방 범죄율
ZN : 주택지 비율
INDUS : 상업적 비즈니스에 활용되지 않는 농지 면적
CHAS : 경계선에 강에 있는지 여부
NOX : 산화 질소 농도
RM : 자택당 평균 방 갯수
AGE : 1940 년 이전에 건설된 비율
DIS : 5 개의 보스턴 고용 센터와의 거리에 다른 가중치 부여
RAD : radial 고속도로와의 접근성 지수
TAX : 10000달러당 재산세
PTRATIO : 지역별 학생-교사 비율
B : 지역의 흑인 지수 (1000(B - 0.63)^2), B는 흑인의 비율.
LSTAT : 빈곤층의 비율

2) EDA¶

In [4]:

df.shape #506rows, 17 columns

Out[4]:

(506, 17)

In [5]:

df.isnull().sum() #null값 없음

Out[5]:

TOWN       0
LON        0
LAT        0
CMEDV      0
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

In [6]:

df.info() #town제외 numeric

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 17 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   TOWN     506 non-null    object 
 1   LON      506 non-null    float64
 2   LAT      506 non-null    float64
 3   CMEDV    506 non-null    float64
 4   CRIM     506 non-null    float64
 5   ZN       506 non-null    float64
 6   INDUS    506 non-null    float64
 7   CHAS     506 non-null    int64  
 8   NOX      506 non-null    float64
 9   RM       506 non-null    float64
 10  AGE      506 non-null    float64
 11  DIS      506 non-null    float64
 12  RAD      506 non-null    int64  
 13  TAX      506 non-null    int64  
 14  PTRATIO  506 non-null    float64
 15  B        506 non-null    float64
 16  LSTAT    506 non-null    float64
dtypes: float64(13), int64(3), object(1)
memory usage: 67.3+ KB

CMEDV 피쳐 탐색¶

In [7]:

df['CMEDV'].describe()

Out[7]:

count    506.000000
mean      22.528854
std        9.182176
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: CMEDV, dtype: float64

In [8]:

df['CMEDV'].hist(bins=50)

Out[8]:

<AxesSubplot:>

In [9]:

df.boxplot(column = 'CMEDV')

Out[9]:

<AxesSubplot:>

설명변수 확인¶

In [10]:

numerical_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
fig = plt.figure(figsize = (16,20)) #이미지 영역 확보
ax = plt.gca() #Axes의 수
df[numerical_columns].hist(ax=ax)
plt.show()

C:\Users\user\AppData\Local\Temp/ipykernel_12376/1584473939.py:4: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  df[numerical_columns].hist(ax=ax)

In [11]:

cols = ['CMEDV','CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
corr = df[cols].corr(method='pearson')

In [12]:

fig = plt.figure(figsize=(16,20))
ax = fig.gca()
sns.set(font_scale = 1.5) # seaborn, 폰트사이즈 설정
hm = sns.heatmap(corr.values, annot = True, fmt='.2f', annot_kws = {'size':15}, yticklabels=cols, xticklabels=cols, ax=ax, cmap='Blues')
# annot:칸에 값 표기
plt.tight_layout()
plt.show()

CMEDV와 RM,LSTAT이 0.7 이상의 상관계수
개인적으루 CRIM,B도 궁금

In [13]:

plt.plot('RM', 'CMEDV', data=df, linestyle = 'none', marker='o', markersize=3, color='blue', alpha=0.5)
plt.title('RM-CMEDV',fontsize=15)
plt.xlabel('RM',fontsize=10)
plt.ylabel('CMEDV',fontsize=10)
plt.show()

평균 방의 개수와 해당 지역의 집값은 뚜렷한 양의 상관관계를 보임!

In [14]:

plt.plot('LSTAT', 'CMEDV', data=df, linestyle = 'none', marker='o', markersize=3, color='blue', alpha=0.5)
plt.title('LSTAT-CMEDV',fontsize=15)
plt.xlabel('LSTAT',fontsize=10)
plt.ylabel('CMEDV',fontsize=10)
plt.show()

빈곤층의 비율과 해당 지역의 집값은 뚜렷한 음의 상관관계를 보임!

In [16]:

plt.plot('B', 'CRIM', data=df, linestyle = 'none', marker='o', markersize=3, color='blue', alpha=0.5)
plt.title('B-CRIM',fontsize=15)
plt.xlabel('B',fontsize=10)
plt.ylabel('CRIM',fontsize=10)
plt.show()

흑인지수와 범죄율은 명확한 상관관계가 보이지 않음.
흑인지수가 높은 곳이어도 범죄율이 매우 낮게 밀집되어 있음.

In [17]:

df['TOWN'].value_counts()

Out[17]:

Cambridge            30
Boston Savin Hill    23
Lynn                 22
Boston Roxbury       19
Newton               18
                     ..
Medfield              1
Dover                 1
Lincoln               1
Sherborn              1
Nahant                1
Name: TOWN, Length: 92, dtype: int64

In [18]:

df['TOWN'].value_counts().hist(bins=30)

Out[18]:

<AxesSubplot:>

In [19]:

fig = plt.figure(figsize = (20,20))
ax = fig.gca()
sns.boxplot(x='CMEDV', y='TOWN', data=df, ax=ax)

Out[19]:

<AxesSubplot:xlabel='CMEDV', ylabel='TOWN'>

In [20]:

fig = plt.figure(figsize = (16,20))
ax = fig.gca()
sns.boxplot(x='CRIM', y='TOWN', data=df, ax=ax)

Out[20]:

<AxesSubplot:xlabel='CRIM', ylabel='TOWN'>

In [21]:

fig=plt.figure(figsize=(16,20))
ax=fig.gca()
sns.boxplot(x='B', y='TOWN', data=df, ax=ax)

Out[21]:

<AxesSubplot:xlabel='B', ylabel='TOWN'>

3) 집값 예측 분석 : 회귀분석¶

1.피쳐표준화

In [22]:

from sklearn.preprocessing import StandardScaler
# 평균0, 분산1로 조정
scaler =StandardScaler()
scale_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
df[scale_columns] = scaler.fit_transform(df[scale_columns])

In [23]:

df.head()

Out[23]:

	TOWN	LON	LAT	CMEDV	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	Nahant	-70.955	42.2550	24.0	-0.419782	0.284830	-1.287909	-0.272599	-0.144217	0.413672	-0.120013	0.140214	-0.982843	-0.666608	-1.459000	0.441052	-1.075562
1	Swampscott	-70.950	42.2875	21.6	-0.417339	-0.487722	-0.593381	-0.272599	-0.740262	0.194274	0.367166	0.557160	-0.867883	-0.987329	-0.303094	0.441052	-0.492439
2	Swampscott	-70.936	42.2830	34.7	-0.417342	-0.487722	-0.593381	-0.272599	-0.740262	1.282714	-0.265812	0.557160	-0.867883	-0.987329	-0.303094	0.396427	-1.208727
3	Marblehead	-70.928	42.2930	33.4	-0.416750	-0.487722	-1.306878	-0.272599	-0.835284	1.016303	-0.809889	1.077737	-0.752922	-1.106115	0.113032	0.416163	-1.361517
4	Marblehead	-70.922	42.2980	36.2	-0.412482	-0.487722	-1.306878	-0.272599	-0.835284	1.228577	-0.511180	1.077737	-0.752922	-1.106115	0.113032	0.441052	-1.026501

데이터셋 분리

In [24]:

from sklearn.model_selection import train_test_split
x = df[scale_columns]
y = df['CMEDV']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=33)

In [25]:

x_train.shape

Out[25]:

(404, 13)

회귀분석모델 학습

In [27]:

from sklearn import linear_model
from sklearn.metrics import mean_squared_error #MSE 계산
from math import sqrt #RMSE 계산

In [28]:

#train regression model
lr = linear_model.LinearRegression()
model = lr.fit(x_train, y_train)

In [29]:

print(lr.coef_)

[-0.95549078  1.18690662  0.22303997  0.76659756 -1.78400866  2.83991455
 -0.05556583 -3.28406695  2.84479571 -2.33740727 -1.77815381  0.79772973
 -4.17382086]

In [32]:

plt.rcParams['figure.figsize'] = [12,16] #rcParams로 차트의 크기, 선 색, 두께 등 설정
coefs = lr.coef_.tolist()
coefs_series = pd.Series(coefs)

In [35]:

x_labels = scale_columns
ax = coefs_series.plot.barh()
ax.set_title('feature coef graph')#대략적으로 각 변수의 coef 값 확인
ax.set_xlabel('coef')
ax.set_ylabel('x_features')
plt.show()

학습결과 해석¶

R2 score (1-SSE/SST) 회귀모델의 설명력 (1에 가까울 수록 좋음) RMSE 평균 제곱근 오차 (Root Mean Squared Error) (낮을 수록 좋음)

In [37]:

print(model.score(x_train, y_train))

0.7490284664199387

In [38]:

print(model.score(x_test, y_test))

0.7009342135321552

In [40]:

y_predictions = lr.predict(x_train)
print(sqrt(mean_squared_error(y_train, y_predictions)))

4.672162734008589

In [41]:

y_predictions = lr.predict(x_test)
print(sqrt(mean_squared_error(y_test, y_predictions)))

4.614951784913307

피쳐 유의성 검정

In [43]:

import statsmodels.api as sm

x_train = sm.add_constant(x_train)
model = sm.OLS(y_train, x_train).fit()
model.summary()

C:\Users\user\anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)

Out[43]:

OLS Regression Results
Dep. Variable:	CMEDV	R-squared:	0.749
Model:	OLS	Adj. R-squared:	0.741
Method:	Least Squares	F-statistic:	89.54
Date:	Fri, 18 Feb 2022	Prob (F-statistic):	2.61e-108
Time:	16:19:58	Log-Likelihood:	-1196.1
No. Observations:	404	AIC:	2420.
Df Residuals:	390	BIC:	2476.
Df Model:	13
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	22.4800	0.238	94.635	0.000	22.013	22.947
CRIM	-0.9555	0.299	-3.192	0.002	-1.544	-0.367
ZN	1.1869	0.353	3.362	0.001	0.493	1.881
INDUS	0.2230	0.470	0.475	0.635	-0.700	1.147
CHAS	0.7666	0.238	3.227	0.001	0.300	1.234
NOX	-1.7840	0.512	-3.482	0.001	-2.791	-0.777
RM	2.8399	0.326	8.723	0.000	2.200	3.480
AGE	-0.0556	0.410	-0.135	0.892	-0.862	0.751
DIS	-3.2841	0.491	-6.695	0.000	-4.248	-2.320
RAD	2.8448	0.650	4.375	0.000	1.566	4.123
TAX	-2.3374	0.717	-3.259	0.001	-3.748	-0.927
PTRATIO	-1.7782	0.312	-5.700	0.000	-2.391	-1.165
B	0.7977	0.293	2.725	0.007	0.222	1.373
LSTAT	-4.1738	0.405	-10.317	0.000	-4.969	-3.378

Omnibus:	167.528	Durbin-Watson:	1.913
Prob(Omnibus):	0.000	Jarque-Bera (JB):	769.057
Skew:	1.774	Prob(JB):	1.00e-167
Kurtosis:	8.753	Cond. No.	9.63

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R스퀘어 74.9%
INDUS, AGE 외에는 유의함

다중공선성 (일반적으로 10 넘으면 문제 있다고 봄)

In [46]:

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(x_train.values, i) for i in range(x_train.shape[1])]

In [47]:

vif.head()

Out[47]:

	VIF Factor
0	1.008141
1	1.731807
2	2.222212
3	3.857543
4	1.076206

In [48]:

vif["feature"] = x_train.columns
vif.round(1)

Out[48]:

	VIF Factor	feature
0	1.0	const
1	1.7	CRIM
2	2.2	ZN
3	3.9	INDUS
4	1.1	CHAS
5	4.4	NOX
6	1.9	RM
7	3.1	AGE
8	4.1	DIS
9	6.9	RAD
10	8.6	TAX
11	1.7	PTRATIO
12	1.3	B
13	2.8	LSTAT

+) INDUS,AGE 빼고 회귀분석¶

In [54]:

scale_columns2 = ['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
df[scale_columns2] = scaler.fit_transform(df[scale_columns2])
df[scale_columns2].head()

Out[54]:

	CRIM	ZN	CHAS	NOX	RM	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	-0.419782	0.284830	-0.272599	-0.144217	0.413672	0.140214	-0.982843	-0.666608	-1.459000	0.441052	-1.075562
1	-0.417339	-0.487722	-0.272599	-0.740262	0.194274	0.557160	-0.867883	-0.987329	-0.303094	0.441052	-0.492439
2	-0.417342	-0.487722	-0.272599	-0.740262	1.282714	0.557160	-0.867883	-0.987329	-0.303094	0.396427	-1.208727
3	-0.416750	-0.487722	-0.272599	-0.835284	1.016303	1.077737	-0.752922	-1.106115	0.113032	0.416163	-1.361517
4	-0.412482	-0.487722	-0.272599	-0.835284	1.228577	1.077737	-0.752922	-1.106115	0.113032	0.441052	-1.026501

In [56]:

x = df[scale_columns2]
y = df['CMEDV']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=33)
model = lr.fit(x_train, y_train)

In [57]:

print(model.score(x_train,y_train))
print(model.score(x_test,y_test))

y_predictions = lr.predict(x_train)
print(sqrt(mean_squared_error(y_train, y_predictions)))

y_predictions = lr.predict(x_test)
print(sqrt(mean_squared_error(y_test, y_predictions)))

0.7488676412541904
0.7016523248519184
4.673659479471091
4.609407785787356

In [58]:

x_train = sm.add_constant(x_train)
model = sm.OLS(y_train, x_train).fit()
model.summary()

C:\Users\user\anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)

Out[58]:

OLS Regression Results
Dep. Variable:	CMEDV	R-squared:	0.749
Model:	OLS	Adj. R-squared:	0.742
Method:	Least Squares	F-statistic:	106.3
Date:	Fri, 18 Feb 2022	Prob (F-statistic):	2.77e-110
Time:	16:53:36	Log-Likelihood:	-1196.2
No. Observations:	404	AIC:	2416.
Df Residuals:	392	BIC:	2464.
Df Model:	11
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	22.4803	0.237	94.849	0.000	22.014	22.946
CRIM	-0.9602	0.299	-3.216	0.001	-1.547	-0.373
ZN	1.1713	0.349	3.361	0.001	0.486	1.857
CHAS	0.7745	0.235	3.294	0.001	0.312	1.237
NOX	-1.7406	0.475	-3.661	0.000	-2.675	-0.806
RM	2.8183	0.317	8.888	0.000	2.195	3.442
DIS	-3.3051	0.453	-7.293	0.000	-4.196	-2.414
RAD	2.7583	0.616	4.481	0.000	1.548	3.968
TAX	-2.1810	0.634	-3.438	0.001	-3.428	-0.934
PTRATIO	-1.7631	0.306	-5.762	0.000	-2.365	-1.161
B	0.7936	0.292	2.721	0.007	0.220	1.367
LSTAT	-4.1755	0.378	-11.047	0.000	-4.919	-3.432

Omnibus:	167.125	Durbin-Watson:	1.912
Prob(Omnibus):	0.000	Jarque-Bera (JB):	764.205
Skew:	1.771	Prob(JB):	1.14e-166
Kurtosis:	8.732	Cond. No.	7.60

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

설명력에 큰 차이 없넹..
범죄율, 산화질소농도, 보스턴 고용센터와의 거리(생각보다 큰 영향), 재산세, 지역별 학생-교사 비율, 빈곤층의 비율(가장 큰 영향)이 보스턴의 집값과 음의 상관관계
주택지 비율, 강의 여부, 방 갯수, 고속도로 접근성, 흑인지수가 양의 상관관계

In [59]:

from IPython.core.display import display, HTML

display(HTML("<style>.container { width:90% !important; }</style>"))

#창 맞추기위함

In [ ]: