Classification_포켓몬 데이터셋

박째롱 2022. 2. 25. 16:48

[Pokemon Classification]¶

0. 개요¶

Pokemon Dataset을 이용한 분류분석을 진행하겠습니다.

지도학습(Logistic Regression) : 전설의 포켓몬 여부 예측 ('Legendary' 0 or 1)
비지도학습(K-Means Clustering) : 포켓몬 군집 분석

1. Library & Data Import¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [2]:

# 포켓몬 데이터셋 import
df = pd.read_csv("https://raw.githubusercontent.com/yoonkt200/FastCampusDataset/master/Pokemon.csv")

In [3]:

plt.rcParams['figure.figsize'] = (8, 6) #fig size 설정
plt.rcParams['font.family'] = 'Malgun Gothic' #한글 깨짐 방지
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>")) #창 맞추기위함

In [4]:

df.head()

Out[4]:

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

Feature Description¶

Name : 포켓몬 이름
Type 1/2 : 포켓몬 타입1/2
Total : 포켓몬 총 능력치 (Sum of Attack, Sp. Atk, Defense, Sp. Def, Speed and HP)
HP : 포켓몬 HP 능력치
Attack : 포켓몬 Attack 능력치
Defense : 포켓몬 Defense 능력치
Sp. Atk : 포켓몬 Sp. Atk 능력치
Sp. Def : 포켓몬 Sp. Def 능력치
Speed : 포켓몬 Speed 능력치
Generation : 포켓몬 세대
Legendary : 전설의 포켓몬 여부

2. EDA(Exploratory Data Aanalysis : 탐색적 데이터 분석)¶

2-1. 기본정보¶

[전체 Data Set 탐색]

In [5]:

# Dimension
df.shape

Out[5]:

(800, 13)

In [6]:

# Data Type 및 Not-null
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB

In [7]:

# 결측치
df.isnull().sum()

Out[7]:

#               0
Name            0
Type 1          0
Type 2        386
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

[개별 데이터셋 탐색]

"Legendary" (분류 목표 Feature)

In [8]:

# class 별 데이터 수 확인
df['Legendary'].value_counts()

Out[8]:

False    735
True      65
Name: Legendary, dtype: int64

"Generation" (포켓몬 세대)

In [9]:

# 세대 별 데이터 수 확인
df['Generation'].value_counts().sort_index()

Out[9]:

1    166
2    106
3    160
4    121
5    165
6     82
Name: Generation, dtype: int64

In [10]:

# 세대 별 데이터 수 시각화
df['Generation'].value_counts().sort_index().plot.bar()
plt.title('Generation 별 데이터 수')

Out[10]:

Text(0.5, 1.0, 'Generation 별 데이터 수')

"Type 1" & "Type 2" (포켓몬 타입)

In [11]:

# Type 1 구분 별 데이터 수 시각화
df['Type 1'].value_counts().plot.bar()

Out[11]:

<AxesSubplot:>

In [12]:

# Type 2 구분 별 데이터 수 시각화
df['Type 2'].value_counts().plot.bar()

Out[12]:

<AxesSubplot:>

In [13]:

len(df[df['Type 2'].notnull()]['Type 2'].unique())

Out[13]:

2-2. 변수(Feature) 특징 탐색¶

전체 데이터를 대상으로 각 변수의 분포를 확인합니다.
분류 목표인 'Legendary'의 class 구분에 따른 각 변수의 분포 또한 추가적으로 확인해 봅니다.

[능력치 분포]

In [14]:

# 전체 포켓몬의 능력치 분포
stat = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
sns.boxplot(data=df[stat])
plt.title("능력치 분포")
plt.show()

In [15]:

# 전설의 포켓몬 유무에 따른 능력치 분포

fig = plt.figure(figsize = (16,8))

plt.subplot(1,2,1)
sns.boxplot(data=df[df['Legendary']==1][stat])
plt.title('Legendary = True')

plt.subplot(1,2,2)
sns.boxplot(data=df[df['Legendary']==0][stat])
plt.title('Legendary = False')

fig.suptitle('능력치 분포', fontsize=20)
plt.show()

Legnedary = True의 경우 False와 비교하여 모든 능력치가 전반적으로 높았습니다.
Attack, Sp.Atk에서 특히 상대적으로 높은 값을 가지는 것으로 보입니다.

[총 능력치(Total) 분포]

In [16]:

# 전체 포켓몬의 총 능력치 분포
df['Total'].hist(bins=50)
plt.title('총 능력치(Total) 분포')

Out[16]:

Text(0.5, 1.0, '총 능력치(Total) 분포')

In [17]:

# 전설의 포켓몬 유무에 따른 총 능력치 분포
sns.histplot(data=df, x='Total', hue = "Legendary", bins=50)
plt.title("Legendary 유무에 따른 총 능력치(Total) 분포")
plt.show()

In [18]:

# 세대별 총 능력치 분포
sns.boxplot(data=df, x='Generation', y='Total')
plt.title("세대별 총 능력치 분포")
plt.show()

In [19]:

# 전설의 포켓몬 유무에 따른 세대별 총 능력치 분포
sns.boxplot(data=df, x='Generation', y='Total', hue='Legendary')
plt.title("전설의 포켓몬 유무에 따른 세대별 총 능력치 분포")
plt.show()

In [20]:

# Type 1 별 총 능력치 분포
sns.boxplot(data=df, x='Type 1', y='Total')
plt.title('Type 1 별 총 능력치 분포')
plt.xticks(rotation=45)
plt.show()

[Type 1 & 2의 분포]

In [21]:

# Type 1의 분포

df['Type 1'].value_counts(sort=False).sort_index().plot.bar()
plt.title("Type 1의 분포")

Out[21]:

Text(0.5, 1.0, 'Type 1의 분포')

In [22]:

# 전설의 포켓몬 여부에 따른 Type 1의 분포

T1_Total = pd.DataFrame(df['Type 1'].value_counts().sort_index())
T1_NotLeg = pd.DataFrame(df[df['Legendary']==0].groupby('Type 1').size())
T1_count = pd.concat([T1_Total, T1_NotLeg], axis=1)
T1_count.columns = ['Total', 'Not Legend']
T1_count['Legend'] = T1_count['Total'] - T1_count['Not Legend']
T1_count

Out[22]:

	Total	Not Legend	Legend
Bug	69	69	0
Dark	31	29	2
Dragon	32	20	12
Electric	44	40	4
Fairy	17	16	1
Fighting	27	27	0
Fire	52	47	5
Flying	4	2	2
Ghost	32	30	2
Grass	70	67	3
Ground	32	28	4
Ice	24	22	2
Normal	98	96	2
Poison	28	28	0
Psychic	57	43	14
Rock	44	40	4
Steel	27	23	4
Water	112	108	4

In [23]:

T1_count[['Not Legend','Legend']].plot.bar()

Out[23]:

<AxesSubplot:>

In [24]:

# Type 2의 분포
df['Type 2'].value_counts(sort=False).sort_index().plot.bar()
plt.title('Type 2의 분포')

Out[24]:

Text(0.5, 1.0, 'Type 2의 분포')

In [25]:

# 전설의 포켓몬 여부에 따른 Type 2의 분포

T2_total = pd.DataFrame(df['Type 2'].value_counts().sort_index())
T2_NotLeg = pd.DataFrame(df[df['Legendary']==0].groupby('Type 2').size())
T2_Count = pd.concat([T2_total,T2_NotLeg],axis=1)
T2_Count.columns = ['Total', 'Not Legend']
T2_Count['Legend'] = T2_Count['Total']-T2_Count['Not Legend']
T2_Count

Out[25]:

	Total	Not Legend	Legend
Bug	3	3	0
Dark	20	19	1
Dragon	18	14	4
Electric	6	5	1
Fairy	23	21	2
Fighting	26	22	4
Fire	12	9	3
Flying	97	84	13
Ghost	14	13	1
Grass	25	25	0
Ground	35	34	1
Ice	14	11	3
Normal	4	4	0
Poison	34	34	0
Psychic	33	28	5
Rock	14	14	0
Steel	22	21	1
Water	14	13	1

In [26]:

T2_Count[['Not Legend', 'Legend']].plot.bar()
plt.title("전설의 포켓몬 여부에 따른 Type 2 분포")

Out[26]:

Text(0.5, 1.0, '전설의 포켓몬 여부에 따른 Type 2 분포')

[세대(Generation) 분포]

In [27]:

Gen_NotLeg = pd.DataFrame(df[df['Legendary']==0].groupby('Generation').size())
Gen_Leg = pd.DataFrame(df[df['Legendary']==1].groupby('Generation').size())
Gen_Count = pd.concat([Gen_NotLeg, Gen_Leg], axis=1)
Gen_Count.columns = ['Not Legend', 'Legend']
Gen_Count['Total'] = Gen_Count['Not Legend']+Gen_Count['Legend']
Gen_Count

Out[27]:

	Not Legend	Legend	Total
Generation
1	160	6	166
2	101	5	106
3	142	18	160
4	108	13	121
5	150	15	165
6	74	8	82

In [28]:

Gen_Count[['Not Legend', 'Legend']].plot.bar()

Out[28]:

<AxesSubplot:xlabel='Generation'>

3. 분류 분석(지도 학습)_Logistic Regression¶

3-1. 데이터 전처리¶

[데이터 타입 변경]¶

In [29]:

df.head()

Out[29]:

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

In [30]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB

'Legendary' : bool(True/false) -> int(0/1)으로 변경 필요
'Generaion' : int -> str(문자열)으로 변경 필요

In [31]:

df['Legendary'] = df['Legendary'].astype(int)
df['Generation'] = df['Generation'].astype(int)
preprocessed_df= df[['Type 1','Type 2', 'Total', 'HP', 'Attack', 'Defense', 
                      'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']]
preprocessed_df.head()

Out[31]:

	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
0	Grass	Poison	318	45	49	49	65	65	45	1
1	Grass	Poison	405	60	62	63	80	80	60	1
2	Grass	Poison	525	80	82	83	100	100	80	1
3	Grass	Poison	625	80	100	123	122	120	80	1
4	Fire	NaN	309	39	52	43	60	50	65	1

[Multi Label Encoding - Type]¶

Type 1, Type 2를 하나의 변수로 묶고, Multi-label Encoding 진행.

In [32]:

# Type 1, Type 2를 통합한 Type list 생성
def make_list(x1,x2):
    type_list = []
    type_list.append(x1)
    if x2 is not np.nan:
        type_list.append(x2)
    return type_list

preprocessed_df['Type'] = preprocessed_df.apply(lambda x: make_list(x['Type 1'], x['Type 2']), axis=1)
preprocessed_df.head()

Out[32]:

	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Type
0	Grass	Poison	318	45	49	49	65	65	45	1	[Grass, Poison]
1	Grass	Poison	405	60	62	63	80	80	60	1	[Grass, Poison]
2	Grass	Poison	525	80	82	83	100	100	80	1	[Grass, Poison]
3	Grass	Poison	625	80	100	123	122	120	80	1	[Grass, Poison]
4	Fire	NaN	309	39	52	43	60	50	65	1	[Fire]

In [33]:

del preprocessed_df['Type 1']
del preprocessed_df['Type 2']

In [34]:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
preprocessed_df = preprocessed_df.join(pd.DataFrame(mlb.fit_transform(preprocessed_df.pop('Type')),columns=mlb.classes_))

In [35]:

preprocessed_df.head()

Out[35]:

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	...	Grass	Poison
0	318	45	49	49	65	65	45	1	...	1	1
1	405	60	62	63	80	80	60	1	...	1	1
2	525	80	82	83	100	100	80	1	...	1	1
3	625	80	100	123	122	120	80	1	...	1	1
4	309	39	52	43	60	50	65	1	...	0	0

5 rows × 27 columns

[One-Hot Encoding - Generation]¶

In [36]:

preprocessed_df = pd.get_dummies(preprocessed_df, columns = ['Generation'])
preprocessed_df.head()

Out[36]:

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	...	Generation_1
0	318	45	49	49	65	65	45	...	1
1	405	60	62	63	80	80	60	...	1
2	525	80	82	83	100	100	80	...	1
3	625	80	100	123	122	120	80	...	1
4	309	39	52	43	60	50	65	...	1

5 rows × 32 columns

Feature 표준화¶

In [37]:

preprocessed_df.describe()

Out[37]:

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Legendary	Bug	Dark	...	Psychic	Rock	Steel	Water	Generation_1	Generation_2	Generation_3	Generation_4	Generation_5	Generation_6
count	800.00000	800.000000	800.000000	800.000000	800.000000	800.000000	800.000000	800.00000	800.000000	800.00000	...	800.000000	800.000000	800.000000	800.000000	800.00000	800.000000	800.00000	800.000000	800.000000	800.000000
mean	435.10250	69.258750	79.001250	73.842500	72.820000	71.902500	68.277500	0.08125	0.090000	0.06375	...	0.112500	0.072500	0.061250	0.157500	0.20750	0.132500	0.20000	0.151250	0.206250	0.102500
std	119.96304	25.534669	32.457366	31.183501	32.722294	27.828916	29.060474	0.27339	0.286361	0.24446	...	0.316178	0.259476	0.239938	0.364499	0.40577	0.339246	0.40025	0.358517	0.404865	0.303494
min	180.00000	1.000000	5.000000	5.000000	10.000000	20.000000	5.000000	0.00000	0.000000	0.00000	...	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.00000	0.000000	0.000000	0.000000
25%	330.00000	50.000000	55.000000	50.000000	49.750000	50.000000	45.000000	0.00000	0.000000	0.00000	...	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.00000	0.000000	0.000000	0.000000
50%	450.00000	65.000000	75.000000	70.000000	65.000000	70.000000	65.000000	0.00000	0.000000	0.00000	...	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.00000	0.000000	0.000000	0.000000
75%	515.00000	80.000000	100.000000	90.000000	95.000000	90.000000	90.000000	0.00000	0.000000	0.00000	...	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.00000	0.000000	0.000000	0.000000
max	780.00000	255.000000	190.000000	230.000000	194.000000	230.000000	180.000000	1.00000	1.000000	1.00000	...	1.000000	1.000000	1.000000	1.000000	1.00000	1.000000	1.00000	1.000000	1.000000	1.000000

8 rows × 32 columns

표준화를 통해 Feature간의 Scale 차이를 조정

In [38]:

# Standardization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scale_columns = ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def','Speed']
preprocessed_df[scale_columns] = scaler.fit_transform(preprocessed_df[scale_columns])
preprocessed_df.head()

Out[38]:

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	...	Generation_1
0	-0.976765	-0.950626	-0.924906	-0.797154	-0.239130	-0.248189	-0.801503	...	1
1	-0.251088	-0.362822	-0.524130	-0.347917	0.219560	0.291156	-0.285015	...	1
2	0.749845	0.420917	0.092448	0.293849	0.831146	1.010283	0.403635	...	1
3	1.583957	0.420917	0.647369	1.577381	1.503891	1.729409	0.403635	...	1
4	-1.051836	-1.185748	-0.832419	-0.989683	-0.392027	-0.787533	-0.112853	...	1

5 rows × 32 columns

[Train/Test set 설정]¶

In [39]:

from sklearn.model_selection import train_test_split

x = preprocessed_df.loc[:,preprocessed_df.columns !='Legendary']
y = preprocessed_df['Legendary']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.25, random_state=1)

In [40]:

x_train.shape, x_test.shape

Out[40]:

((600, 31), (200, 31))

3-2. Logistic Regression 모델 학습¶

[모델 학습]¶

In [41]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#1. Fit in Traning Set
logit = LogisticRegression(random_state=1)
logit.fit(x_train,y_train)

#2. Predict in Test Set
y_pred = logit.predict(x_test)

[모델 평가]¶

In [42]:

# Classification result
print("Accuracy : %.3f" % accuracy_score(y_test, y_pred))
print("Precision : %.3f" % precision_score(y_test, y_pred))
print("Recall : %.3f" % recall_score(y_test, y_pred))
print("F1 : %.3f" % f1_score(y_test, y_pred))

Accuracy : 0.930
Precision : 0.615
Recall : 0.471
F1 : 0.533

Accuracy는 93%이나, 그 외 Precision(정밀도), Recall(재현율), F1 score 등은 매우 낮은 값을 보임.
학습 데이터의 클래스 불균형으로 인한 정확도 함정. -> 불균형 조정 필요.
Confusion Matrix를 통해 확인.

In [43]:

from sklearn.metrics import confusion_matrix

# Print Confusion Matrix
confu = confusion_matrix(y_true = y_test, y_pred = y_pred)

plt.figure(figsize=(4,3))
sns.heatmap(confu, annot=True, annot_kws={'size':15}, cmap='OrRd', fmt='.10g')
plt.title('Confusion Matrix')
plt.show()

3-3. 클래스 불균형 조정¶

In [44]:

preprocessed_df['Legendary'].value_counts()

Out[44]:

0    735
1     65
Name: Legendary, dtype: int64

[Positive와 Native를 1:1 비율로 샘플링]¶

In [45]:

positive_random_idx = preprocessed_df[preprocessed_df['Legendary']==1].sample(65, random_state=22).index.tolist()
negative_random_idx = preprocessed_df[preprocessed_df['Legendary']==0].sample(65, random_state=22).index.tolist()

[Train/Test set 설정]¶

In [46]:

random_idx = positive_random_idx+negative_random_idx
x = preprocessed_df.loc[random_idx, preprocessed_df.columns !='Legendary']
y = preprocessed_df['Legendary'][random_idx]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)

In [47]:

x_train.shape, x_test.shape

Out[47]:

((97, 31), (33, 31))

[모델 재학습]¶

In [48]:

#1. Fit in Training Set
logit2 = LogisticRegression(random_state=0)
logit2.fit(x_train, y_train)
#2. Predict in Test set
y_pred2 = logit2.predict(x_test)

[모델 재평가]¶

In [49]:

#3. Classification Result

print("Accuracy : %.2f"% accuracy_score(y_test,y_pred2))
print("Precision : %.2f"% precision_score(y_test,y_pred2))
print("Recall : %.2f"% recall_score(y_test,y_pred2))
print("F1 : %.2f"% f1_score(y_test,y_pred2))

Accuracy : 0.82
Precision : 0.75
Recall : 0.94
F1 : 0.83

In [50]:

#Confusion Matrix
confu2 = confusion_matrix(y_true=y_test, y_pred=y_pred2)
plt.figure(figsize=(4,3))
sns.heatmap(confu2, annot=True, annot_kws={'size':15}, cmap='OrRd', fmt='.10g')
plt.title('Confusion Matrix')
plt.show()

4. 군집분석(비지도학습_K-Means Clustering)¶

4-1. K-Means 군집 분석¶

(1) 2차원 군집분석¶

In [51]:

from sklearn.cluster import KMeans

# k-means train & Elbow method

x = preprocessed_df[['Attack', 'Defense']]

k_list = []
cost_list = []
for k in range (1,8):
    kmeans = KMeans(n_clusters=k).fit(x)
    inertia = kmeans.inertia_ #inertia : sum of squared distances of samples to their closest cluster center.
    print("k:", k, "| cost:", inertia)
    k_list.append(k)
    cost_list.append(inertia)
    
plt.figure(figsize=(8,6))
plt.plot(k_list, cost_list)

k: 1 | cost: 1599.9999999999998
k: 2 | cost: 853.3477298974243
k: 3 | cost: 642.4259559026589
k: 4 | cost: 480.49450250321513
k: 5 | cost: 403.8367296740125
k: 6 | cost: 343.53968485943227
k: 7 | cost: 295.10797374168703

Out[51]:

[<matplotlib.lines.Line2D at 0x23f4c4dd5b0>]

Elbow Rule에 따르면 4 clusters이 가장 적당한 것으로 보입니다.
Cluster을 4로 지정한 후 학습시켜, 각 데이터가 분류된 clustering 결과를 원 데이터셋에 추가합니다.

In [52]:

# k-means fitting and predict
kmeans = KMeans(n_clusters=4).fit(x)
cluster_num = pd.Series(kmeans.predict(x))
preprocessed_df['cluster_num'] = cluster_num.values
preprocessed_df.head()

Out[52]:

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	...	Generation_1	cluster_num
0	-0.976765	-0.950626	-0.924906	-0.797154	-0.239130	-0.248189	-0.801503	...	1	1
1	-0.251088	-0.362822	-0.524130	-0.347917	0.219560	0.291156	-0.285015	...	1	0
2	0.749845	0.420917	0.092448	0.293849	0.831146	1.010283	0.403635	...	1	0
3	1.583957	0.420917	0.647369	1.577381	1.503891	1.729409	0.403635	...	1	2
4	-1.051836	-1.185748	-0.832419	-0.989683	-0.392027	-0.787533	-0.112853	...	1	1

5 rows × 33 columns

In [53]:

print(preprocessed_df['cluster_num'].value_counts().sort_index())

0    309
1    253
2    110
3    128
Name: cluster_num, dtype: int64

[Cluster 시각화]

In [54]:

plt.scatter(preprocessed_df[preprocessed_df['cluster_num']==0]['Attack'],
            preprocessed_df[preprocessed_df['cluster_num']==0]['Defense'],
            s=50, c='red', alpha=0.5, label='Pokemon Group 1')
plt.scatter(preprocessed_df[preprocessed_df['cluster_num']==1]['Attack'],
            preprocessed_df[preprocessed_df['cluster_num']==1]['Defense'],
            s=50, c='green', alpha=0.5, label='Pokemon Group 1')
plt.scatter(preprocessed_df[preprocessed_df['cluster_num']==2]['Attack'],
            preprocessed_df[preprocessed_df['cluster_num']==2]['Defense'],
            s=50, c='blue', alpha=0.5, label='Pokemon Group 1')
plt.scatter(preprocessed_df[preprocessed_df['cluster_num']==3]['Attack'],
            preprocessed_df[preprocessed_df['cluster_num']==3]['Defense'],
            s=50, c='yellow', alpha=0.5, label='Pokemon Group 1')
plt.title('Pokemon Clustering by Attack & Defense')
plt.xlabel('Attack')
plt.ylabel('Defense')
plt.legend()
plt.show()

(2) 다차원 군집분석¶

In [55]:

from sklearn.cluster import KMeans

# K-Means train & Elbow method
x=preprocessed_df[['HP','Attack','Defense','Sp. Atk', 'Sp. Def', 'Speed']]

k_list=[]
cost_list=[]
for k in range (1,15) :
    kmeans = KMeans(n_clusters=k).fit(x)
    inertia = kmeans.inertia_
    print("k:",k,"| cost:", inertia)
    k_list.append(k)
    cost_list.append(inertia)
    
plt.figure(figsize=(8,6))
plt.plot(k_list, cost_list)

k: 1 | cost: 4799.999999999997
k: 2 | cost: 3275.3812330305987
k: 3 | cost: 2863.2667087778505
k: 4 | cost: 2566.99730904829
k: 5 | cost: 2328.1703679900997
k: 6 | cost: 2184.347139933859
k: 7 | cost: 2068.7487035521403
k: 8 | cost: 1961.2094162381982
k: 9 | cost: 1859.1836088009418
k: 10 | cost: 1773.3020306872008
k: 11 | cost: 1704.9590743177353
k: 12 | cost: 1645.7546401112147
k: 13 | cost: 1575.6760121444784
k: 14 | cost: 1535.1999592854825

Out[55]:

[<matplotlib.lines.Line2D at 0x23f4c9f69a0>]

cluster이 5일 때 가장 적당해 보입니다.

In [56]:

# k-means fitting and predict
kmeans = KMeans(n_clusters=5).fit(x)
cluster_num = pd.Series(kmeans.predict(x))
preprocessed_df['cluster_num'] = cluster_num.values
preprocessed_df.head()

Out[56]:

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	...	Generation_1	cluster_num
0	-0.976765	-0.950626	-0.924906	-0.797154	-0.239130	-0.248189	-0.801503	...	1	0
1	-0.251088	-0.362822	-0.524130	-0.347917	0.219560	0.291156	-0.285015	...	1	4
2	0.749845	0.420917	0.092448	0.293849	0.831146	1.010283	0.403635	...	1	4
3	1.583957	0.420917	0.647369	1.577381	1.503891	1.729409	0.403635	...	1	3
4	-1.051836	-1.185748	-0.832419	-0.989683	-0.392027	-0.787533	-0.112853	...	1	0

5 rows × 33 columns

[군집별 특성 시각화]

각 feature의 군집별 특성을 확인하겠습니다.

In [57]:

# HP
sns.boxplot(x='cluster_num', y='HP', data = preprocessed_df)
plt.title('군집별 HP 분포')
plt.show()

In [58]:

# Attack
sns.boxplot(x = 'cluster_num', y = 'Attack', data=preprocessed_df)
plt.title('군집별 Attack 분포')
plt.show()

In [59]:

#Defense
sns.boxplot(x='cluster_num', y='Defense', data=preprocessed_df)
plt.title('군집별 defense 분포')
plt.show()

In [60]:

# Attack
sns.boxplot(x='cluster_num', y='Attack', data=preprocessed_df)
plt.title('군집별 Attack 분포')
plt.show()

In [61]:

# Sp. Atk
sns.boxplot(x='cluster_num', y='Sp. Atk', data=preprocessed_df)
plt.title('군집별 Sp. Atk 분포')
plt.show()

In [62]:

# Sp.def
sns.boxplot(x='cluster_num', y='Sp. Def', data=preprocessed_df)
plt.title('군집별 Sp. Def 분포')
plt.show()

In [63]:

# Speed
sns.boxplot(x='cluster_num', y='Speed', data=preprocessed_df)
plt.title('군집별 Speed 분포')
plt.show()

In [ ]:

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	...	Grass	Poison
0	318	45	49	49	65	65	45	1	...	1	1
1	405	60	62	63	80	80	60	1	...	1	1
2	525	80	82	83	100	100	80	1	...	1	1
3	625	80	100	123	122	120	80	1	...	1	1
4	309	39	52	43	60	50	65	1	...	0	0

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	...	Generation_1
0	318	45	49	49	65	65	45	...	1
1	405	60	62	63	80	80	60	...	1
2	525	80	82	83	100	100	80	...	1
3	625	80	100	123	122	120	80	...	1
4	309	39	52	43	60	50	65	...	1

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	...	Grass	Poison
0	318	45	49	49	65	65	45	1	...	1	1
1	405	60	62	63	80	80	60	1	...	1	1
2	525	80	82	83	100	100	80	1	...	1	1
3	625	80	100	123	122	120	80	1	...	1	1
4	309	39	52	43	60	50	65	1	...	0	0

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	...	Generation_1
0	318	45	49	49	65	65	45	...	1
1	405	60	62	63	80	80	60	...	1
2	525	80	82	83	100	100	80	...	1
3	625	80	100	123	122	120	80	...	1
4	309	39	52	43	60	50	65	...	1

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	...	Grass	Poison
0	318	45	49	49	65	65	45	1	...	1	1
1	405	60	62	63	80	80	60	1	...	1	1
2	525	80	82	83	100	100	80	1	...	1	1
3	625	80	100	123	122	120	80	1	...	1	1
4	309	39	52	43	60	50	65	1	...	0	0

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	...	Generation_1
0	318	45	49	49	65	65	45	...	1
1	405	60	62	63	80	80	60	...	1
2	525	80	82	83	100	100	80	...	1
3	625	80	100	123	122	120	80	...	1
4	309	39	52	43	60	50	65	...	1