Must-have 캐글: 2부 7장 이진분류

#0 개요

대회의 목적은 범주형 피쳐 23개를 활용해 해당 데이터가 타깃값 1에 속할 확률을 예측하는 것,

또한 특이점이 세가지 있음

1) 인위적으로 만든 데이터를 사용.

2) 각 피처와 타깃값의 의미를 알 수 없음. 즉 배경 지식등으로 유추가 불가능. 따라서 순전히 데이터만 보고 접근해야함.

3) 제공되는 데이터가 모두 범주형임. 타깃값도 범주형이면서 동시에 두개만 있으니 이진분류에 속함.

목표: 범주형 데이터 인코딩 방법을 숙달

#1 페이지 별 정리

p241

EDA 순서도

캐글 환경설정 > 데이터 둘러보고 피쳐 요약본 생성 > 데이터 시각화(타깃값 분포, 이진/명목형/순서형/날짜 피처 분포) > 분석 정리 및 모델링 전략

p242 인덱스 지정하는 index_col = 'id'

import pandas as pd

data_path = '/kaggle/input/cat-in-the-dat/'

train = pd.read_csv(data_path + 'train.csv', index_col = 'id')
test = pd.read_csv(data_path + 'test.csv', index_col = 'id')
submission = pd.read_csv(data_path + 'sample_submission.csv', index_col = 'id')

해당 열을 인덱스로 만듬.

p243

df.shape로 행과 열 출력

p243

train.head().T 로 행과 열의 위치 바꿀 수 있음.

p245 요약해주는 함수 스켈레톤

def resumetable(df):
    print(f"데이터셋 형성: {df.shape}")
    summary = pd.DataFrame(df.dtypes, columns=['데이터 타입'])
    summary = summary.reset_index()
    summary = summary.rename(columns = {'index': '피처'})
    summary['결측값 갯수'] = df.isnull().sum().values
    summary['고윳값 갯수'] = df.nunique().values
    summary['첫 번째 값'] = df.loc[0].values
    summary['두 번째 값'] = df.loc[1].values
    summary['세 번째 값'] = df.loc[2].values
    
    return summary

resumetable(train)

결과

p246 피처별 설명

1. 이진 피처: bin_0~

2. 명목형 피처 : nom_0~

3. 순서형 피처 : ord_0~

4. 그 외 피처 : day, month, target

T, F 또는 Y, N인 이진 피처는 1, 0으로 변경하기.

p247 명목형 피처의 경우, 고유값 갯수로 그 특징을 짐작할 수 있음.

p248 고윳값들 출력해보고 특징 알아보기

for i in range(3):
    feature = 'ord_' + str(i)
    print(f'{feature} 고윳값: {train[feature].unique()}')

p249 알파벳으로 이루어져 있는 아래의 피처들은 알파벳 순으로 인코딩

즉, 순서형 피처는 고윳값들의 순서에 맞게 인코딩해줘야 함.

p250 데이터 분포 파악해서 불균형 정도 파악하기

p252 그래프 관련

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

mpl.rc('font', size=15)
plt.figure(figsize = (7,6))

ax = sns.countplot(x='target', data = train)
ax.set_title('Target Distribution')

print(ax.patches)

rectangle = ax.patches[0]
print('사각형 높이:', rectangle.get_height())
print('사각형 너비:', rectangle.get_width())
print('사각형 왼쪽 테두리의 x축 위치:', rectangle.get_x())

p253 막대그래프 위에 퍼센트를 표시하는 함수

#막대 상단에 비율 표시
def write_percent(ax, total_size):
    for patch in ax.patches:
        height = patch.get_height()
        width = patch.get_width()
        left_coord = patch.get_x()
        percent = height/total_size * 100 #타깃값 비율
        
        #좌표에 입력
        ax.text(x=left_coord + width/2.0,
               y=height + total_size * 0.001,
               s=f'{percent:1.1f}%',
               ha='center')
        
plt.figure(figsize=(7,6))

ax = sns.countplot(x='target', data=train)
write_percent(ax, len(train))
ax.set_title('Target Distribution')

p254

이진 피처를 결과 분포랑 비교해보는 함수

import matplotlib.gridspec as gridspec

mpl.rc('font', size=12)
grid = gridspec.GridSpec(3, 2)
plt.figure(figsize=(10, 16))
plt.subplots_adjust(wspace=0.4, hspace=0.3)

bin_features = ['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4']

for idx, feature in enumerate(bin_features):
    ax = plt.subplot(grid[idx])
    
    sns.countplot(x=feature,
                 data=train,
                 hue='target',
                 palette='pastel',
                 ax=ax)
    
    ax.set_title(f'{feature} Distribution by Target')
    write_percent(ax, len(train))

p258 많지 않은 명목형 피처는 모드 원-핫 인코딩. 많지 않고 순서를 무시해도 되기 때문.

p259 crosstab 함수로 target간 교차분석표 만들기

pd.crosstab(train['nom_0'], train['target'])

crosstab = pd.crosstab(train['nom_0'], train['target'], normalize='index') * 100
crosstab

def plot_pointplot(ax, feature, crosstab):
    ax2 = ax.twinx()
    ax2 = sns.pointplot(x=feature, y=1, data=crosstab, order=crosstab[feature].values, color='black')
    ax2.set_ylim(crosstab[1].min()-5, crosstab[1].max()*1.1)
    ax2.set_ylabel("Target 1 Ratio(%)")
    
def plot_cat_dist_with_true_ratio(df, features, num_rows, num_cols, size=(15, 20)):
    plt.figure(figsize=size)
    grid = gridspec.GridSpec(num_rows, num_cols)
    plt.subplots_adjust(wspace=0.45, hspace=0.3)
    
    for idx, feature in enumerate(features):
        ax = plt.subplot(grid[idx])
        crosstab = get_crosstab(df, feature)
        
        sns.countplot(x = feature, data = df, order = crosstab[feature].values, color='skyblue', ax=ax)
        
        write_percent(ax, len(df))
        
        plot_pointplot(ax, feature, crosstab)
        
        ax.set_title(f'{feature} Distribution')
        
nom_features = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4']
plot_cat_dist_with_true_ratio(train, nom_features, num_rows=3, num_cols=2)

p267 날짜 분포도

'데이터분석 > 캐글' 카테고리의 다른 글

NLP2: Toxic Comment Classification Challenge (3)	2024.12.27
NLP1: Spooky Author Identification (0)	2024.11.11
Must-have 캐글: 2부 6장-자전거 대여 수요 예측 (2) (0)	2024.09.23
Must-have 캐글: 2부 6장-자전거 대여 수요 예측 (1) (0)	2024.09.22
Must-have 캐글: 1부-3, 4장 캐글 프로세스, 시각화 (0)	2024.09.18

Must-have 캐글: 2부 7장 이진분류

#0 개요

#1 페이지 별 정리

'데이터분석 > 캐글' 카테고리의 다른 글

공지사항

전체 카테고리

블로그 인기글

티스토리툴바