NLP1: Spooky Author Identification

#0 위치

https://www.kaggle.com/c/spooky-author-identification

#1 문제의 이해

1. Spooky Author Identification

공포(으스스한) 소설 작가들의 분류 문제이다.

Share code and discuss insights to identify horror authors from their writings

코드를 공유하고 인사이트에 대해 토의하기, 공포 작가들을 그들의 글 작품으로 분류하기 위해.

2. Description

마치 공포 소설처럼 설명해두었다.

As I scurried across the candlelit chamber, manuscripts in hand, I thought I'd made it. Nothing would be able to hurt me anymore. Little did I know there was one last fright lurking around the corner. DING! My phone pinged me with a disturbing notification. It was Will, the scariest of Kaggle moderators, sharing news of another data leak. "ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn!" I cried as I clumsily dropped my crate of unbound, spooky books. Pages scattered across the chamber floor. How will I ever figure out how to put them back together according to the authors who wrote them? Or are they lost, forevermore? Wait, I thought... I know, machine learning! In this year's Halloween playground competition, you're challenged to predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. We're encouraging you (with cash prizes!) to share your insights in the competition's discussion forum and code in Kernels. We've designated prizes to reward authors of kernels and discussion threads that are particularly valuable to the community. Click the "Prizes" tab on this overview page to learn more. Getting Started New to Kernels or working with natural language data? We've put together some starter kernels in Python and R to help you hit the ground running.

내가 촛불로 밝혀진 방을 허둥지둥 뛰면서, 원고를 손에 들고, 드디어 해냈다고 생각했다. 더 이상 나를 해칠 수 있는건 없다고. 내가 거의 알지 못한것은 마지막 공포가 코너에 숨어있었다는 것이었다. (놀라서 내가 들고 있던 원고를 손에서 놓쳐서 작가들의 원고가 뒤섞였다는 이야기)

대충 이들을 머신러닝으로 작가별로 분류하는 코드를 짜야겠다고 생각했다는 것.

따라서 이번 주제는 작가별 text style을 분류하는 코드를 작성하는 대회가 되겠다.

3. 평가 기준

제출물은 다중 클래스 로그 손실을 사용한다.

각 소설마다 세 작가의 각각에 대한 예측 확률을 제출해서, 수식처럼 계산한다.

N이 테스트 세트의 관찰치? (소설?)이고 M이 클래스의 라벨이다. (라벨이 작가 세명)

관측치 N가 M에 속하면 y가 1 (해당 작가의 소설이면 1을 곱해서 로그 계산값을 그대로 사용한다는 것)

관측치 N가 M에 속하면 y가 0으로 전부 값을 날려버린 다는 것 같다.

관측치마다 각 작가를 분류할 때 기준이 되는 확률의 합이 1일 필요는 없다.

어차피 재조정을 하기 때문.

log0 처럼 값이 주어지면 극단적인 값이 나오기 때문에 예측 확률이 일정 범위에서 제한된다고 함,

id가 붙은 소설, 작가1 확률, 2확률, 3확률 형식의 CSV로 제출

4. 평가 기준에서 도출할 수 있는 인사이트

1) 확률 예측의 중요성: 평가 지표로 다중 클래스 로그 손실을 사용하기 때문에 단순히 어떤 작가인지 맞추는 것 이상으로, 각 클래스에 대한 정확한 확률을 예측하는 것이 중요함

2) 과신의 위험성: 잘못된 클래스에 대해 높은 확률을 예측하면 1을 곱해진 값이 전부 오차가 되므로 지나치게 확신하지 않도록 주의해야 함.

3) 확률의 합이 1이 아닐 수 있음: 그러나 모델의 해석을 위해 1로 만드는 게 좋음. 어차피 재조정 되니까.

유사성을 수치적으로 각각 표현한 후, 소프트 맥스 함수를 사용하여 합이 1인 비율로 만들어줄 수 있겠음.

4) 클래스 불균형 고려: 각 작가별로 데이터 수가 몇개인지 확인하여 데이터의 양이 균등한지 확인해서 모델 학습에 고려해야함.

5) 정확도 향상을 위해 무엇을 할 것인가? 피처 엔지니어은 텍스트의 어휘나 문체 혹은 문법적인 특징을 가지고 각 작가의 특성을 잘 반영하는 피처를 생성해야함 모델 앙상블, 정규화등도 고려.

6) 평가 지표에 최대한 맞게 모델을 튜닝해야함. 이건 정확도랑은 조금 다를 수 있음.

#2 진행중 관련 인사이트

1. 작가 조사

EAP - 에드거 앨런 포: 시와 단편 소설을 쓴 미국 작가로, 미스터리와 섬뜩하고 음산한 이야기를 중심으로 활동했습니다. 아마도 그의 가장 유명한 작품은 시 "까마귀"이며, 그는 또한 탐정 소설 장르의 개척자로 널리 인정받고 있습니다.

HPL - H.P. 러브크래프트: 공포 소설을 저술한 것으로 가장 잘 알려져 있으며, 그가 가장 찬사를 받은 이야기들은 악명 높은 생물체 "크툴루"의 가상 신화를 중심으로 전개됩니다. 크툴루는 문어의 머리와 인간의 몸, 그리고 등에 날개가 달린 하이브리드 키메라 생명체입니다.

MWS - 메리 셸리: 소설가, 극작가, 여행 작가, 전기 작가 등 다양한 문학적 활동에 참여한 것으로 보입니다. 그녀는 과학자 프랑켄슈타인, 즉 "현대의 프로메테우스"가 자신의 이름과 연관된 괴물을 창조하는 고전적인 이야기 "프랑켄슈타인"으로 가장 유명합니다.

=> 인사이트

1) 작가별 문체와 주제의 특성을 파악한다.

EAP는 미스터리, 음산함, 그로테스크한 요소나, 탐정 소설 장르 특유의 논리적 추리, 떡밥등이 있을 수 있음.

HPL 러브크래프트는 크툴루라는 기괴하고 압도적인 장르로 유명함.

MWS는 과학과 윤리, 인간의 본성의 이야기나 요소. 감정적인 이야기를 쓰기도 함.

2) 피처 엔지니어링 기준을 세워본다.

각각 장르가 미세하게 다르다.

=> LDA(Latent Dirichlet Allocation) 기법을 사용해서 각 텍스트의 장르(주제)를 파악하고 비교할 수 있음.

LDA란?

토픽 모델링 기법 중 하나로, 문서 집합에서 숨겨진 주제를 발견하는 데 사용되는 확률 기반 모델이라고 함.

sklearn에도 LatentDirichletAllocation 이라는 이름으로 존재하기도 함.

이외에도 크툴루같이 고유한 세계관의 단어는 특정 작가의 글에만 나타날 수 있다는 것을 고려하는 등의 과정이 필요함.

2. 데이터 EDA

plotly를 사용하면 아래와 같다

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

all_words = train['text'].str.split(expand=True).unstack().value_counts()

data = [go.Bar(
    x = all_words.index.values[2:50],
    y = all_words.values[2:50],
    marker = dict(colorscale='Jet', color=all_words.values[2:100]),
    text = 'Word counts'
)]

layout = go.Layout(
    title='Top 50 (Uncleaned) Word frequencies in the trianing dataset'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

TF-IDF를 적용한 코드

from sklearn.feature_extraction.text import TfidfVectorizer

#stop_word는 불용어로 의미 없는 단어를 의미한다. 영어 불용어 목록을 사용한다는 뜻.
#max_features는 어휘에서 고려할 최대 특징 수이다.
#모델을 반환하고, 모델은 희소행렬을 반환한다.
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

def get_top_n_words(texts, num):
    #TF-IDF 행렬 계산
    tfidf_matrix = tfidf.fit_transform(texts)

    #단어 목록 추출
    feature_names = tfidf.get_feature_names_out()
    #각 단어의 TF-idf 합 계산
    sum_tfidf = tfidf_matrix.sum(axis=0)
    
    # print(type(sum_tfidf))
    # numpy.matrix.A1은 n차원 행렬의 자신을 1차원으로 늘려서 반환한다.
    sum_tfidf_array = sum_tfidf.A1

    # print(sum_tfidf_array)

    # 단어와 TF-IDF 합을 딕셔너리로 생성
    tfidf_scores = dict(zip(feature_names, sum_tfidf_array))
    # TF-IDF 합 기준으로 상위 n개 단어 추출.
    top_n_words = sorted(tfidf_scores.items(), key=lambda x: x[1], reverse=True)[:num]
    return top_n_words


top_n = 20  # 상위 20개 단어 추출

#3 첫번째 필사

tf-idf 원리로 벡터화를 진행하는 TfidfVectorizer와 LogisticRegression을 사용하여 간단하게 진행한다.

작성한 코드: https://www.kaggle.com/code/imjh99/nlp-1/edit

결과:

#4 BERT를 사용해보기

#3의 방법은 성능이 별로이니, 텍스트 특징 추출을 잘하는 BERT 모델을 사용해보고자 한다.

사전에 학습된 BERT 모델과 BertForSequenceClassification를 사용해서 구현해야 한다. (공부중)

import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import random
import numpy as np
import torch
import math
import csv

# 시드 고정 함수
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

# 라벨 매핑
author2label = {'EAP':0, 'HPL':1, 'MWS':2}
label2author = {0:'EAP', 1:'HPL', 2:'MWS'}

# 1. 데이터 로드 및 준비
train_df = pd.read_csv('./train.csv')   # id, text, author
test_df = pd.read_csv('./test.csv')     # id, text (라벨 없음, 최종 제출용)

# train/validation 분할
train_texts, eval_texts, train_authors, eval_authors = train_test_split(
    train_df['text'].tolist(), train_df['author'].tolist(), test_size=0.2, random_state=42
)

# 라벨 정수 인코딩
train_labels = [author2label[a] for a in train_authors]
eval_labels = [author2label[a] for a in eval_authors]

train_dataset = Dataset.from_dict({'text': train_texts, 'label': train_labels})
eval_dataset = Dataset.from_dict({'text': eval_texts, 'label': eval_labels})

# submission dataset (test 데이터, 라벨 없음)
submission_dataset = Dataset.from_dict({'id': test_df['id'].tolist(), 'text': test_df['text'].tolist()})

# 2. 토크나이저 로드
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 3. 데이터 전처리 (토큰화 함수)
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
submission_dataset = submission_dataset.map(tokenize_function, batched=True)

# 텐서 형식 지정
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
eval_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
submission_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

# 4. 모델 로드
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# 5. 훈련 인자 설정
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy='epoch',
    save_strategy='no',    # 필요에 따라 조정
    logging_strategy='epoch',
    load_best_model_at_end=False
)

# Multi-class Log Loss 계산 함수
def multiclass_log_loss(y_true, y_pred_proba):
    # y_true: 정수 라벨 리스트
    # y_pred_proba: 확률 예측, shape = (N, 3)
    eps = 1e-15
    N = len(y_true)
    # y_true를 one-hot으로 변환
    y_true_oh = np.zeros((N, 3))
    for i, lbl in enumerate(y_true):
        y_true_oh[i, lbl] = 1.0

    # 행별 확률 재정규화
    row_sums = y_pred_proba.sum(axis=1, keepdims=True)
    row_sums[row_sums == 0] = 1.0  # 0 division 회피
    y_pred_proba = y_pred_proba / row_sums

    # 클리핑
    y_pred_proba = np.clip(y_pred_proba, eps, 1 - eps)

    logloss = -np.sum(y_true_oh * np.log(y_pred_proba)) / N
    return logloss

# 7. 평가 함수 정의 (Trainer에서 호출)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # logits -> softmax로 확률 변환
    probs = torch.softmax(torch.tensor(logits), dim=-1).numpy()
    return {"logloss": multiclass_log_loss(labels, probs)}

# 8. Trainer 초기화
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# 9. 모델 학습
trainer.train()

# 10. 모델 평가
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")

# 테스트 데이터셋에 대한 예측
predictions = trainer.predict(submission_dataset)
logits = predictions.predictions
probs = torch.softmax(torch.tensor(logits), dim=-1).numpy()

# 최종 제출 파일 작성
# 제출 형식: id, EAP, HPL, MWS
# 각 행은 해당 문장에 대해 EAP/HPL/MWS에 대한 확률
# 확률 재정규화 & clipping 수행
eps = 1e-15
probs_sums = probs.sum(axis=1, keepdims=True)
probs_sums[probs_sums == 0] = 1.0
probs_norm = probs / probs_sums
probs_norm = np.clip(probs_norm, eps, 1 - eps)

submission_data = []
submission_data.append(["id","EAP","HPL","MWS"])
for i, doc_id in enumerate(test_df['id'].tolist()):
    p_eap, p_hpl, p_mws = probs_norm[i]
    submission_data.append([doc_id, p_eap, p_hpl, p_mws])

with open('submission.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerows(submission_data)

print("Submission file created: submission.csv")

내 노트북으로는 학습까지 11시간...

'데이터분석 > 캐글' 카테고리의 다른 글

NLP2: Toxic Comment Classification Challenge (3)	2024.12.27
Must-have 캐글: 2부 7장 이진분류 (0)	2024.09.29
Must-have 캐글: 2부 6장-자전거 대여 수요 예측 (2) (0)	2024.09.23
Must-have 캐글: 2부 6장-자전거 대여 수요 예측 (1) (1)	2024.09.22
Must-have 캐글: 1부-3, 4장 캐글 프로세스, 시각화 (0)	2024.09.18