Data Analysis/데이터분석 기초
Classification_포켓몬 데이터셋
박째롱
2022. 2. 25. 16:48
[Pokemon Classification]¶
0. 개요¶
Pokemon Dataset을 이용한 분류분석을 진행하겠습니다.
- 지도학습(Logistic Regression) : 전설의 포켓몬 여부 예측 ('Legendary' 0 or 1)
- 비지도학습(K-Means Clustering) : 포켓몬 군집 분석
1. Library & Data Import¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
In [2]:
# 포켓몬 데이터셋 import
df = pd.read_csv("https://raw.githubusercontent.com/yoonkt200/FastCampusDataset/master/Pokemon.csv")
In [3]:
plt.rcParams['figure.figsize'] = (8, 6) #fig size 설정
plt.rcParams['font.family'] = 'Malgun Gothic' #한글 깨짐 방지
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>")) #창 맞추기위함
In [4]:
df.head()
Out[4]:
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
Feature Description¶
- Name : 포켓몬 이름
- Type 1/2 : 포켓몬 타입1/2
- Total : 포켓몬 총 능력치 (Sum of Attack, Sp. Atk, Defense, Sp. Def, Speed and HP)
- HP : 포켓몬 HP 능력치
- Attack : 포켓몬 Attack 능력치
- Defense : 포켓몬 Defense 능력치
- Sp. Atk : 포켓몬 Sp. Atk 능력치
- Sp. Def : 포켓몬 Sp. Def 능력치
- Speed : 포켓몬 Speed 능력치
- Generation : 포켓몬 세대
- Legendary : 전설의 포켓몬 여부
2. EDA(Exploratory Data Aanalysis : 탐색적 데이터 분석)¶
2-1. 기본정보¶
[전체 Data Set 탐색]
In [5]:
# Dimension
df.shape
Out[5]:
(800, 13)
In [6]:
# Data Type 및 Not-null
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 # 800 non-null int64
1 Name 800 non-null object
2 Type 1 800 non-null object
3 Type 2 414 non-null object
4 Total 800 non-null int64
5 HP 800 non-null int64
6 Attack 800 non-null int64
7 Defense 800 non-null int64
8 Sp. Atk 800 non-null int64
9 Sp. Def 800 non-null int64
10 Speed 800 non-null int64
11 Generation 800 non-null int64
12 Legendary 800 non-null bool
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB
In [7]:
# 결측치
df.isnull().sum()
Out[7]:
# 0
Name 0
Type 1 0
Type 2 386
Total 0
HP 0
Attack 0
Defense 0
Sp. Atk 0
Sp. Def 0
Speed 0
Generation 0
Legendary 0
dtype: int64
[개별 데이터셋 탐색]
- "Legendary" (분류 목표 Feature)
In [8]:
# class 별 데이터 수 확인
df['Legendary'].value_counts()
Out[8]:
False 735
True 65
Name: Legendary, dtype: int64
- "Generation" (포켓몬 세대)
In [9]:
# 세대 별 데이터 수 확인
df['Generation'].value_counts().sort_index()
Out[9]:
1 166
2 106
3 160
4 121
5 165
6 82
Name: Generation, dtype: int64
In [10]:
# 세대 별 데이터 수 시각화
df['Generation'].value_counts().sort_index().plot.bar()
plt.title('Generation 별 데이터 수')
Out[10]:
Text(0.5, 1.0, 'Generation 별 데이터 수')
- "Type 1" & "Type 2" (포켓몬 타입)
In [11]:
# Type 1 구분 별 데이터 수 시각화
df['Type 1'].value_counts().plot.bar()
Out[11]:
<AxesSubplot:>
In [12]:
# Type 2 구분 별 데이터 수 시각화
df['Type 2'].value_counts().plot.bar()
Out[12]:
<AxesSubplot:>
In [13]:
len(df[df['Type 2'].notnull()]['Type 2'].unique())
Out[13]:
18
2-2. 변수(Feature) 특징 탐색¶
- 전체 데이터를 대상으로 각 변수의 분포를 확인합니다.
- 분류 목표인 'Legendary'의 class 구분에 따른 각 변수의 분포 또한 추가적으로 확인해 봅니다.
[능력치 분포]
In [14]:
# 전체 포켓몬의 능력치 분포
stat = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
sns.boxplot(data=df[stat])
plt.title("능력치 분포")
plt.show()
In [15]:
# 전설의 포켓몬 유무에 따른 능력치 분포
fig = plt.figure(figsize = (16,8))
plt.subplot(1,2,1)
sns.boxplot(data=df[df['Legendary']==1][stat])
plt.title('Legendary = True')
plt.subplot(1,2,2)
sns.boxplot(data=df[df['Legendary']==0][stat])
plt.title('Legendary = False')
fig.suptitle('능력치 분포', fontsize=20)
plt.show()
- Legnedary = True의 경우 False와 비교하여 모든 능력치가 전반적으로 높았습니다.
- Attack, Sp.Atk에서 특히 상대적으로 높은 값을 가지는 것으로 보입니다.
[총 능력치(Total) 분포]
In [16]:
# 전체 포켓몬의 총 능력치 분포
df['Total'].hist(bins=50)
plt.title('총 능력치(Total) 분포')
Out[16]:
Text(0.5, 1.0, '총 능력치(Total) 분포')
In [17]:
# 전설의 포켓몬 유무에 따른 총 능력치 분포
sns.histplot(data=df, x='Total', hue = "Legendary", bins=50)
plt.title("Legendary 유무에 따른 총 능력치(Total) 분포")
plt.show()
In [18]:
# 세대별 총 능력치 분포
sns.boxplot(data=df, x='Generation', y='Total')
plt.title("세대별 총 능력치 분포")
plt.show()
In [19]:
# 전설의 포켓몬 유무에 따른 세대별 총 능력치 분포
sns.boxplot(data=df, x='Generation', y='Total', hue='Legendary')
plt.title("전설의 포켓몬 유무에 따른 세대별 총 능력치 분포")
plt.show()
In [20]:
# Type 1 별 총 능력치 분포
sns.boxplot(data=df, x='Type 1', y='Total')
plt.title('Type 1 별 총 능력치 분포')
plt.xticks(rotation=45)
plt.show()
[Type 1 & 2의 분포]
In [21]:
# Type 1의 분포
df['Type 1'].value_counts(sort=False).sort_index().plot.bar()
plt.title("Type 1의 분포")
Out[21]:
Text(0.5, 1.0, 'Type 1의 분포')
In [22]:
# 전설의 포켓몬 여부에 따른 Type 1의 분포
T1_Total = pd.DataFrame(df['Type 1'].value_counts().sort_index())
T1_NotLeg = pd.DataFrame(df[df['Legendary']==0].groupby('Type 1').size())
T1_count = pd.concat([T1_Total, T1_NotLeg], axis=1)
T1_count.columns = ['Total', 'Not Legend']
T1_count['Legend'] = T1_count['Total'] - T1_count['Not Legend']
T1_count
Out[22]:
Total | Not Legend | Legend | |
---|---|---|---|
Bug | 69 | 69 | 0 |
Dark | 31 | 29 | 2 |
Dragon | 32 | 20 | 12 |
Electric | 44 | 40 | 4 |
Fairy | 17 | 16 | 1 |
Fighting | 27 | 27 | 0 |
Fire | 52 | 47 | 5 |
Flying | 4 | 2 | 2 |
Ghost | 32 | 30 | 2 |
Grass | 70 | 67 | 3 |
Ground | 32 | 28 | 4 |
Ice | 24 | 22 | 2 |
Normal | 98 | 96 | 2 |
Poison | 28 | 28 | 0 |
Psychic | 57 | 43 | 14 |
Rock | 44 | 40 | 4 |
Steel | 27 | 23 | 4 |
Water | 112 | 108 | 4 |
In [23]:
T1_count[['Not Legend','Legend']].plot.bar()
Out[23]:
<AxesSubplot:>
In [24]:
# Type 2의 분포
df['Type 2'].value_counts(sort=False).sort_index().plot.bar()
plt.title('Type 2의 분포')
Out[24]:
Text(0.5, 1.0, 'Type 2의 분포')
In [25]:
# 전설의 포켓몬 여부에 따른 Type 2의 분포
T2_total = pd.DataFrame(df['Type 2'].value_counts().sort_index())
T2_NotLeg = pd.DataFrame(df[df['Legendary']==0].groupby('Type 2').size())
T2_Count = pd.concat([T2_total,T2_NotLeg],axis=1)
T2_Count.columns = ['Total', 'Not Legend']
T2_Count['Legend'] = T2_Count['Total']-T2_Count['Not Legend']
T2_Count
Out[25]:
Total | Not Legend | Legend | |
---|---|---|---|
Bug | 3 | 3 | 0 |
Dark | 20 | 19 | 1 |
Dragon | 18 | 14 | 4 |
Electric | 6 | 5 | 1 |
Fairy | 23 | 21 | 2 |
Fighting | 26 | 22 | 4 |
Fire | 12 | 9 | 3 |
Flying | 97 | 84 | 13 |
Ghost | 14 | 13 | 1 |
Grass | 25 | 25 | 0 |
Ground | 35 | 34 | 1 |
Ice | 14 | 11 | 3 |
Normal | 4 | 4 | 0 |
Poison | 34 | 34 | 0 |
Psychic | 33 | 28 | 5 |
Rock | 14 | 14 | 0 |
Steel | 22 | 21 | 1 |
Water | 14 | 13 | 1 |
In [26]:
T2_Count[['Not Legend', 'Legend']].plot.bar()
plt.title("전설의 포켓몬 여부에 따른 Type 2 분포")
Out[26]:
Text(0.5, 1.0, '전설의 포켓몬 여부에 따른 Type 2 분포')
[세대(Generation) 분포]
In [27]:
Gen_NotLeg = pd.DataFrame(df[df['Legendary']==0].groupby('Generation').size())
Gen_Leg = pd.DataFrame(df[df['Legendary']==1].groupby('Generation').size())
Gen_Count = pd.concat([Gen_NotLeg, Gen_Leg], axis=1)
Gen_Count.columns = ['Not Legend', 'Legend']
Gen_Count['Total'] = Gen_Count['Not Legend']+Gen_Count['Legend']
Gen_Count
Out[27]:
Not Legend | Legend | Total | |
---|---|---|---|
Generation | |||
1 | 160 | 6 | 166 |
2 | 101 | 5 | 106 |
3 | 142 | 18 | 160 |
4 | 108 | 13 | 121 |
5 | 150 | 15 | 165 |
6 | 74 | 8 | 82 |
In [28]:
Gen_Count[['Not Legend', 'Legend']].plot.bar()
Out[28]:
<AxesSubplot:xlabel='Generation'>
3. 분류 분석(지도 학습)_Logistic Regression¶
3-1. 데이터 전처리¶
[데이터 타입 변경]¶
In [29]:
df.head()
Out[29]:
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
In [30]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 # 800 non-null int64
1 Name 800 non-null object
2 Type 1 800 non-null object
3 Type 2 414 non-null object
4 Total 800 non-null int64
5 HP 800 non-null int64
6 Attack 800 non-null int64
7 Defense 800 non-null int64
8 Sp. Atk 800 non-null int64
9 Sp. Def 800 non-null int64
10 Speed 800 non-null int64
11 Generation 800 non-null int64
12 Legendary 800 non-null bool
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB
- 'Legendary' : bool(True/false) -> int(0/1)으로 변경 필요
- 'Generaion' : int -> str(문자열)으로 변경 필요
In [31]:
df['Legendary'] = df['Legendary'].astype(int)
df['Generation'] = df['Generation'].astype(int)
preprocessed_df= df[['Type 1','Type 2', 'Total', 'HP', 'Attack', 'Defense',
'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']]
preprocessed_df.head()
Out[31]:
Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | 0 |
1 | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | 0 |
2 | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | 0 |
3 | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | 0 |
4 | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | 0 |
[Multi Label Encoding - Type]¶
- Type 1, Type 2를 하나의 변수로 묶고, Multi-label Encoding 진행.
In [32]:
# Type 1, Type 2를 통합한 Type list 생성
def make_list(x1,x2):
type_list = []
type_list.append(x1)
if x2 is not np.nan:
type_list.append(x2)
return type_list
preprocessed_df['Type'] = preprocessed_df.apply(lambda x: make_list(x['Type 1'], x['Type 2']), axis=1)
preprocessed_df.head()
Out[32]:
Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | 0 | [Grass, Poison] |
1 | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | 0 | [Grass, Poison] |
2 | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | 0 | [Grass, Poison] |
3 | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | 0 | [Grass, Poison] |
4 | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | 0 | [Fire] |
In [33]:
del preprocessed_df['Type 1']
del preprocessed_df['Type 2']
In [34]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
preprocessed_df = preprocessed_df.join(pd.DataFrame(mlb.fit_transform(preprocessed_df.pop('Type')),columns=mlb.classes_))
In [35]:
preprocessed_df.head()
Out[35]:
Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | Bug | ... | Ghost | Grass | Ground | Ice | Normal | Poison | Psychic | Rock | Steel | Water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
[One-Hot Encoding - Generation]¶
In [36]:
preprocessed_df = pd.get_dummies(preprocessed_df, columns = ['Generation'])
preprocessed_df.head()
Out[36]:
Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Legendary | Bug | Dark | ... | Psychic | Rock | Steel | Water | Generation_1 | Generation_2 | Generation_3 | Generation_4 | Generation_5 | Generation_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 rows × 32 columns
Feature 표준화¶
In [37]:
preprocessed_df.describe()
Out[37]:
Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Legendary | Bug | Dark | ... | Psychic | Rock | Steel | Water | Generation_1 | Generation_2 | Generation_3 | Generation_4 | Generation_5 | Generation_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 800.00000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.00000 | 800.000000 | 800.00000 | ... | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.00000 | 800.000000 | 800.00000 | 800.000000 | 800.000000 | 800.000000 |
mean | 435.10250 | 69.258750 | 79.001250 | 73.842500 | 72.820000 | 71.902500 | 68.277500 | 0.08125 | 0.090000 | 0.06375 | ... | 0.112500 | 0.072500 | 0.061250 | 0.157500 | 0.20750 | 0.132500 | 0.20000 | 0.151250 | 0.206250 | 0.102500 |
std | 119.96304 | 25.534669 | 32.457366 | 31.183501 | 32.722294 | 27.828916 | 29.060474 | 0.27339 | 0.286361 | 0.24446 | ... | 0.316178 | 0.259476 | 0.239938 | 0.364499 | 0.40577 | 0.339246 | 0.40025 | 0.358517 | 0.404865 | 0.303494 |
min | 180.00000 | 1.000000 | 5.000000 | 5.000000 | 10.000000 | 20.000000 | 5.000000 | 0.00000 | 0.000000 | 0.00000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
25% | 330.00000 | 50.000000 | 55.000000 | 50.000000 | 49.750000 | 50.000000 | 45.000000 | 0.00000 | 0.000000 | 0.00000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
50% | 450.00000 | 65.000000 | 75.000000 | 70.000000 | 65.000000 | 70.000000 | 65.000000 | 0.00000 | 0.000000 | 0.00000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
75% | 515.00000 | 80.000000 | 100.000000 | 90.000000 | 95.000000 | 90.000000 | 90.000000 | 0.00000 | 0.000000 | 0.00000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
max | 780.00000 | 255.000000 | 190.000000 | 230.000000 | 194.000000 | 230.000000 | 180.000000 | 1.00000 | 1.000000 | 1.00000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 32 columns
- 표준화를 통해 Feature간의 Scale 차이를 조정
In [38]:
# Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scale_columns = ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def','Speed']
preprocessed_df[scale_columns] = scaler.fit_transform(preprocessed_df[scale_columns])
preprocessed_df.head()
Out[38]:
Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Legendary | Bug | Dark | ... | Psychic | Rock | Steel | Water | Generation_1 | Generation_2 | Generation_3 | Generation_4 | Generation_5 | Generation_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.976765 | -0.950626 | -0.924906 | -0.797154 | -0.239130 | -0.248189 | -0.801503 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | -0.251088 | -0.362822 | -0.524130 | -0.347917 | 0.219560 | 0.291156 | -0.285015 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0.749845 | 0.420917 | 0.092448 | 0.293849 | 0.831146 | 1.010283 | 0.403635 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 1.583957 | 0.420917 | 0.647369 | 1.577381 | 1.503891 | 1.729409 | 0.403635 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | -1.051836 | -1.185748 | -0.832419 | -0.989683 | -0.392027 | -0.787533 | -0.112853 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 rows × 32 columns
[Train/Test set 설정]¶
In [39]:
from sklearn.model_selection import train_test_split
x = preprocessed_df.loc[:,preprocessed_df.columns !='Legendary']
y = preprocessed_df['Legendary']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.25, random_state=1)
In [40]:
x_train.shape, x_test.shape
Out[40]:
((600, 31), (200, 31))
3-2. Logistic Regression 모델 학습¶
[모델 학습]¶
In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
#1. Fit in Traning Set
logit = LogisticRegression(random_state=1)
logit.fit(x_train,y_train)
#2. Predict in Test Set
y_pred = logit.predict(x_test)
[모델 평가]¶
In [42]:
# Classification result
print("Accuracy : %.3f" % accuracy_score(y_test, y_pred))
print("Precision : %.3f" % precision_score(y_test, y_pred))
print("Recall : %.3f" % recall_score(y_test, y_pred))
print("F1 : %.3f" % f1_score(y_test, y_pred))
Accuracy : 0.930
Precision : 0.615
Recall : 0.471
F1 : 0.533
- Accuracy는 93%이나, 그 외 Precision(정밀도), Recall(재현율), F1 score 등은 매우 낮은 값을 보임.
- 학습 데이터의 클래스 불균형으로 인한 정확도 함정. -> 불균형 조정 필요.
- Confusion Matrix를 통해 확인.
In [43]:
from sklearn.metrics import confusion_matrix
# Print Confusion Matrix
confu = confusion_matrix(y_true = y_test, y_pred = y_pred)
plt.figure(figsize=(4,3))
sns.heatmap(confu, annot=True, annot_kws={'size':15}, cmap='OrRd', fmt='.10g')
plt.title('Confusion Matrix')
plt.show()
3-3. 클래스 불균형 조정¶
In [44]:
preprocessed_df['Legendary'].value_counts()
Out[44]:
0 735
1 65
Name: Legendary, dtype: int64
[Positive와 Native를 1:1 비율로 샘플링]¶
In [45]:
positive_random_idx = preprocessed_df[preprocessed_df['Legendary']==1].sample(65, random_state=22).index.tolist()
negative_random_idx = preprocessed_df[preprocessed_df['Legendary']==0].sample(65, random_state=22).index.tolist()
[Train/Test set 설정]¶
In [46]:
random_idx = positive_random_idx+negative_random_idx
x = preprocessed_df.loc[random_idx, preprocessed_df.columns !='Legendary']
y = preprocessed_df['Legendary'][random_idx]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)
In [47]:
x_train.shape, x_test.shape
Out[47]:
((97, 31), (33, 31))
[모델 재학습]¶
In [48]:
#1. Fit in Training Set
logit2 = LogisticRegression(random_state=0)
logit2.fit(x_train, y_train)
#2. Predict in Test set
y_pred2 = logit2.predict(x_test)
[모델 재평가]¶
In [49]:
#3. Classification Result
print("Accuracy : %.2f"% accuracy_score(y_test,y_pred2))
print("Precision : %.2f"% precision_score(y_test,y_pred2))
print("Recall : %.2f"% recall_score(y_test,y_pred2))
print("F1 : %.2f"% f1_score(y_test,y_pred2))
Accuracy : 0.82
Precision : 0.75
Recall : 0.94
F1 : 0.83
In [50]:
#Confusion Matrix
confu2 = confusion_matrix(y_true=y_test, y_pred=y_pred2)
plt.figure(figsize=(4,3))
sns.heatmap(confu2, annot=True, annot_kws={'size':15}, cmap='OrRd', fmt='.10g')
plt.title('Confusion Matrix')
plt.show()
4. 군집분석(비지도학습_K-Means Clustering)¶
4-1. K-Means 군집 분석¶
(1) 2차원 군집분석¶
In [51]:
from sklearn.cluster import KMeans
# k-means train & Elbow method
x = preprocessed_df[['Attack', 'Defense']]
k_list = []
cost_list = []
for k in range (1,8):
kmeans = KMeans(n_clusters=k).fit(x)
inertia = kmeans.inertia_ #inertia : sum of squared distances of samples to their closest cluster center.
print("k:", k, "| cost:", inertia)
k_list.append(k)
cost_list.append(inertia)
plt.figure(figsize=(8,6))
plt.plot(k_list, cost_list)
k: 1 | cost: 1599.9999999999998
k: 2 | cost: 853.3477298974243
k: 3 | cost: 642.4259559026589
k: 4 | cost: 480.49450250321513
k: 5 | cost: 403.8367296740125
k: 6 | cost: 343.53968485943227
k: 7 | cost: 295.10797374168703
Out[51]:
[<matplotlib.lines.Line2D at 0x23f4c4dd5b0>]
- Elbow Rule에 따르면 4 clusters이 가장 적당한 것으로 보입니다.
- Cluster을 4로 지정한 후 학습시켜, 각 데이터가 분류된 clustering 결과를 원 데이터셋에 추가합니다.
In [52]:
# k-means fitting and predict
kmeans = KMeans(n_clusters=4).fit(x)
cluster_num = pd.Series(kmeans.predict(x))
preprocessed_df['cluster_num'] = cluster_num.values
preprocessed_df.head()
Out[52]:
Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Legendary | Bug | Dark | ... | Rock | Steel | Water | Generation_1 | Generation_2 | Generation_3 | Generation_4 | Generation_5 | Generation_6 | cluster_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.976765 | -0.950626 | -0.924906 | -0.797154 | -0.239130 | -0.248189 | -0.801503 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | -0.251088 | -0.362822 | -0.524130 | -0.347917 | 0.219560 | 0.291156 | -0.285015 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0.749845 | 0.420917 | 0.092448 | 0.293849 | 0.831146 | 1.010283 | 0.403635 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1.583957 | 0.420917 | 0.647369 | 1.577381 | 1.503891 | 1.729409 | 0.403635 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 |
4 | -1.051836 | -1.185748 | -0.832419 | -0.989683 | -0.392027 | -0.787533 | -0.112853 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 33 columns
In [53]:
print(preprocessed_df['cluster_num'].value_counts().sort_index())
0 309
1 253
2 110
3 128
Name: cluster_num, dtype: int64
[Cluster 시각화]
In [54]:
plt.scatter(preprocessed_df[preprocessed_df['cluster_num']==0]['Attack'],
preprocessed_df[preprocessed_df['cluster_num']==0]['Defense'],
s=50, c='red', alpha=0.5, label='Pokemon Group 1')
plt.scatter(preprocessed_df[preprocessed_df['cluster_num']==1]['Attack'],
preprocessed_df[preprocessed_df['cluster_num']==1]['Defense'],
s=50, c='green', alpha=0.5, label='Pokemon Group 1')
plt.scatter(preprocessed_df[preprocessed_df['cluster_num']==2]['Attack'],
preprocessed_df[preprocessed_df['cluster_num']==2]['Defense'],
s=50, c='blue', alpha=0.5, label='Pokemon Group 1')
plt.scatter(preprocessed_df[preprocessed_df['cluster_num']==3]['Attack'],
preprocessed_df[preprocessed_df['cluster_num']==3]['Defense'],
s=50, c='yellow', alpha=0.5, label='Pokemon Group 1')
plt.title('Pokemon Clustering by Attack & Defense')
plt.xlabel('Attack')
plt.ylabel('Defense')
plt.legend()
plt.show()
(2) 다차원 군집분석¶
In [55]:
from sklearn.cluster import KMeans
# K-Means train & Elbow method
x=preprocessed_df[['HP','Attack','Defense','Sp. Atk', 'Sp. Def', 'Speed']]
k_list=[]
cost_list=[]
for k in range (1,15) :
kmeans = KMeans(n_clusters=k).fit(x)
inertia = kmeans.inertia_
print("k:",k,"| cost:", inertia)
k_list.append(k)
cost_list.append(inertia)
plt.figure(figsize=(8,6))
plt.plot(k_list, cost_list)
k: 1 | cost: 4799.999999999997
k: 2 | cost: 3275.3812330305987
k: 3 | cost: 2863.2667087778505
k: 4 | cost: 2566.99730904829
k: 5 | cost: 2328.1703679900997
k: 6 | cost: 2184.347139933859
k: 7 | cost: 2068.7487035521403
k: 8 | cost: 1961.2094162381982
k: 9 | cost: 1859.1836088009418
k: 10 | cost: 1773.3020306872008
k: 11 | cost: 1704.9590743177353
k: 12 | cost: 1645.7546401112147
k: 13 | cost: 1575.6760121444784
k: 14 | cost: 1535.1999592854825
Out[55]:
[<matplotlib.lines.Line2D at 0x23f4c9f69a0>]
- cluster이 5일 때 가장 적당해 보입니다.
In [56]:
# k-means fitting and predict
kmeans = KMeans(n_clusters=5).fit(x)
cluster_num = pd.Series(kmeans.predict(x))
preprocessed_df['cluster_num'] = cluster_num.values
preprocessed_df.head()
Out[56]:
Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Legendary | Bug | Dark | ... | Rock | Steel | Water | Generation_1 | Generation_2 | Generation_3 | Generation_4 | Generation_5 | Generation_6 | cluster_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.976765 | -0.950626 | -0.924906 | -0.797154 | -0.239130 | -0.248189 | -0.801503 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | -0.251088 | -0.362822 | -0.524130 | -0.347917 | 0.219560 | 0.291156 | -0.285015 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 4 |
2 | 0.749845 | 0.420917 | 0.092448 | 0.293849 | 0.831146 | 1.010283 | 0.403635 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 4 |
3 | 1.583957 | 0.420917 | 0.647369 | 1.577381 | 1.503891 | 1.729409 | 0.403635 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 |
4 | -1.051836 | -1.185748 | -0.832419 | -0.989683 | -0.392027 | -0.787533 | -0.112853 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 33 columns
[군집별 특성 시각화]
- 각 feature의 군집별 특성을 확인하겠습니다.
In [57]:
# HP
sns.boxplot(x='cluster_num', y='HP', data = preprocessed_df)
plt.title('군집별 HP 분포')
plt.show()
In [58]:
# Attack
sns.boxplot(x = 'cluster_num', y = 'Attack', data=preprocessed_df)
plt.title('군집별 Attack 분포')
plt.show()
In [59]:
#Defense
sns.boxplot(x='cluster_num', y='Defense', data=preprocessed_df)
plt.title('군집별 defense 분포')
plt.show()
In [60]:
# Attack
sns.boxplot(x='cluster_num', y='Attack', data=preprocessed_df)
plt.title('군집별 Attack 분포')
plt.show()
In [61]:
# Sp. Atk
sns.boxplot(x='cluster_num', y='Sp. Atk', data=preprocessed_df)
plt.title('군집별 Sp. Atk 분포')
plt.show()
In [62]:
# Sp.def
sns.boxplot(x='cluster_num', y='Sp. Def', data=preprocessed_df)
plt.title('군집별 Sp. Def 분포')
plt.show()
In [63]:
# Speed
sns.boxplot(x='cluster_num', y='Speed', data=preprocessed_df)
plt.title('군집별 Speed 분포')
plt.show()
In [ ]: