[16-2] Word2Vec Using Konlpy

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

또르르's 개발 Story

[16-2] Word2Vec Using Konlpy 본문

부스트캠프 AI 테크 U stage/실습

[16-2] Word2Vec Using Konlpy

또르르21 2021. 2. 16. 01:36

1️⃣ 설정

konlpy를 설치합니다.

konlpy는 다양한 한국어 형태소 분석기가 클래스로 구현되어 있습니다.

!pip install konlpy

설치한 konlpy와 나머지 필요한 모듈을 import 합니다.

from tqdm import tqdm

from konlpy.tag import Okt

from torch import nn

from torch.nn import functional as F

from torch.utils.data import Dataset, DataLoader

from collections import defaultdict


import torch

import copy

import numpy as np

2️⃣ 데이터 전처리

데이터를 확인하고 Word2Vec 형식에 맞게 전처리합니다.

train_data와 embedding 할 words는 아래와 같습니다.

train_data = [
  "정말 맛있습니다. 추천합니다.",
  "기대했던 것보단 별로였네요.",
  "다 좋은데 가격이 너무 비싸서 다시 가고 싶다는 생각이 안 드네요.",
  "완전 최고입니다! 재방문 의사 있습니다.",
  "음식도 서비스도 다 만족스러웠습니다.",
  "위생 상태가 좀 별로였습니다. 좀 더 개선되기를 바랍니다.",
  "맛도 좋았고 직원분들 서비스도 너무 친절했습니다.",
  "기념일에 방문했는데 음식도 분위기도 서비스도 다 좋았습니다.",
  "전반적으로 음식이 너무 짰습니다. 저는 별로였네요.",
  "위생에 조금 더 신경 썼으면 좋겠습니다. 조금 불쾌했습니다."       
]

test_words = ["음식", "맛", "서비스", "위생", "가격"]

KoNLPy 패키지에서 제공하는 Twitter(Okt) tokenizer를 사용하여 tokenization 합니다.

tokenizer = Okt()

def make_tokenized(data):

  tokenized = []
  
  for sent in tqdm(data):
  
    tokens = tokenizer.morphs(sent, stem=True)
    
    tokenized.append(tokens)
    

  return tokenized

train_tokenized = make_tokenized(train_data)

word_count는 tokenized 된 token들의 각각 개수를 계산합니다.

word_count = defaultdict(int)


for tokens in tqdm(train_tokenized):

  for token in tokens:
  
    word_count[token] += 1

word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)

>>> print(list(word_count))


[('.', 14), ('도', 7), ('이다', 4), ('좋다', 4), ('별로', 3), ('다', 3), ('이', 3), ('너무', 3), ('음식', 3), ('서비스', 3), ('하다', 2), ('방문', 2), ('위생', 2), ('좀', 2), ('더', 2), ('에', 2), ('조금', 2), ('정말', 1), ('맛있다', 1), ('추천', 1), ('기대하다', 1), ('것', 1), ('보단', 1), ('가격', 1), ('비싸다', 1), ('다시', 1), ('가다', 1), ('싶다', 1), ('생각', 1), ('안', 1), ('드네', 1), ('요', 1), ('완전', 1), ('최고', 1), ('!', 1), ('재', 1), ('의사', 1), ('있다', 1), ('만족스럽다', 1), ('상태', 1), ('가', 1), ('개선', 1), ('되다', 1), ('기르다', 1), ('바라다', 1), ('맛', 1), ('직원', 1), ('분들', 1), ('친절하다', 1), ('기념일', 1), ('분위기', 1), ('전반', 1), ('적', 1), ('으로', 1), ('짜다', 1), ('저', 1), ('는', 1), ('신경', 1), ('써다', 1), ('불쾌하다', 1)]

w2i는 key: 단어, value:단어 index로 구성된 vocabulary입니다.

w2i = {}

for pair in tqdm(word_count):

  if pair[0] not in w2i:
  
    w2i[pair[0]] = len(w2i)

>>> print(w2i)

{'.': 0, '도': 1, '이다': 2, '좋다': 3, '별로': 4, '다': 5, '이': 6, '너무': 7, '음식': 8, '서비스': 9, '하다': 10, '방문': 11, '위생': 12, '좀': 13, '더': 14, '에': 15, '조금': 16, '정말': 17, '맛있다': 18, '추천': 19, '기대하다': 20, '것': 21, '보단': 22, '가격': 23, '비싸다': 24, '다시': 25, '가다': 26, '싶다': 27, '생각': 28, '안': 29, '드네': 30, '요': 31, '완전': 32, '최고': 33, '!': 34, '재': 35, '의사': 36, '있다': 37, '만족스럽다': 38, '상태': 39, '가': 40, '개선': 41, '되다': 42, '기르다': 43, '바라다': 44, '맛': 45, '직원': 46, '분들': 47, '친절하다': 48, '기념일': 49, '분위기': 50, '전반': 51, '적': 52, '으로': 53, '짜다': 54, '저': 55, '는': 56, '신경': 57, '써다': 58, '불쾌하다': 59}

3️⃣ Dataset과 모델 Class 구현

word2vec의 모델에는 CBOW와 Skip-gram 두 가지 방식이 있습니다.

CBOW(Continuous Bag-of-Words)

주변 단어들을 가지고 중심 단어를 예측하는 방식으로 학습합니다.
주변 단어들의 one-hot encoding 벡터를 각각 embedding layer에 projection 하여 각각의 embedding 벡터를 얻고 이 embedding들을 element-wise한 덧셈으로 합친 뒤, 다시 linear transformation 하여 예측하고자 하는 중심 단어의 one-hot encoding 벡터와 같은 사이즈의 벡터로 만든 뒤, 중심 단어의 one-hot encoding 벡터와의 loss를 계산합니다.
예) I am going to school & window size: 2
- Input(주변 단어): "I", "am", "to", "school"
- Output(중심 단어): "going"

Skip-gram

중심 단어를 가지고 주변 단어들을 예측하는 방식으로 학습합니다.
중심 단어의 one-hot encoding 벡터를 embedding layer에 projection하여 해당 단어의 embedding 벡터를 얻고 이 벡터를 다시 linear transformation하여 예측하고자 하는 각각의 주변 단어들과의 one-hot encoding 벡터와 같은 사이즈의 벡터로 만든 뒤, 그 주변 단어들의 one-hot encoding 벡터와의 loss를 각각 계산합니다.
예) I am going to school & window size: 2
- Input(중심 단어): "going"
- Output(주변 단어): "I", "am", "to", "school"

1) Dataset 정의

이 두 가지 모델의 Dataset을 정의합니다.

CBOW의 Dataset은 $x$에는 주변 vector들이 들어가고, $y$에는 중심 vector가 들어갑니다.

Skip-gram의 Dataset은 $x$에는 중심 vector의 2*window_size 개수만큼 들어가고, $y$에는 주변 vector가 들어갑니다.

class CBOWDataset(Dataset):

  def __init__(self, train_tokenized, window_size=2):   # 묶일 단어는 2*window_size +1
  
    self.x = []
    
    self.y = []
    

    for tokens in tqdm(train_tokenized):
    
      token_ids = [w2i[token] for token in tokens]    # 각 token을 index로 바꿔줌
      
      for i, id in enumerate(token_ids):
      
        if i-window_size >= 0 and i+window_size < len(token_ids):
        
          self.x.append(token_ids[i-window_size:i] + token_ids[i+1:i+window_size+1])
          
          self.y.append(id)
          

    self.x = torch.LongTensor(self.x)  # (전체 데이터 개수, 2 * window_size)
    
    self.y = torch.LongTensor(self.y)  # (전체 데이터 개수)
    

  def __len__(self):
  
    return self.x.shape[0]
    

  def __getitem__(self, idx):
  
    return self.x[idx], self.y[idx]

class SkipGramDataset(Dataset):

  def __init__(self, train_tokenized, window_size=2):
  
    self.x = []
    
    self.y = []
    

    for tokens in tqdm(train_tokenized):
    
      token_ids = [w2i[token] for token in tokens]
      
      for i, id in enumerate(token_ids):
      
        if i-window_size >= 0 and i+window_size < len(token_ids):
        
          self.y += (token_ids[i-window_size:i] + token_ids[i+1:i+window_size+1])
          
          self.x += [id] * 2 * window_size
          

    self.x = torch.LongTensor(self.x)  # (전체 데이터 개수)
    
    self.y = torch.LongTensor(self.y)  # (전체 데이터 개수)
    

  def __len__(self):
  
    return self.x.shape[0]
    

  def __getitem__(self, idx):
  
    return self.x[idx], self.y[idx]

각 모델에 맞는 Dataset 객체를 생성합니다.

cbow_set = CBOWDataset(train_tokenized)

skipgram_set = SkipGramDataset(train_tokenized)

print(list(skipgram_set))

2) 모델 Class 구현

차례대로 두 가지 Word2Vec 모델을 구현합니다.

self.embedding : vocab_size 크기의 one-hot vector를 특정 크기의 dim 차원으로 embedding 시키는 layer.
self.linear : 변환된 embedding vector를 다시 원래 vocab_size로 바꾸는 layer.

class CBOW(nn.Module):

  def __init__(self, vocab_size, dim):
  
    super(CBOW, self).__init__()
    
    self.embedding = nn.Embedding(vocab_size, dim, sparse=True)
    
    self.linear = nn.Linear(dim, vocab_size)
    

  # B: batch size, W: window size, d_w: word embedding size, V: vocab size
  
  def forward(self, x):  # x: (B, 2W)
  
    embeddings = self.embedding(x)  # (B, 2W, d_w)
    
    embeddings = torch.sum(embeddings, dim=1)  # (B, d_w)
    
    output = self.linear(embeddings)  # (B, V)
    
    return output

class SkipGram(nn.Module):

  def __init__(self, vocab_size, dim):
  
    super(SkipGram, self).__init__()
    
    self.embedding = nn.Embedding(vocab_size, dim, sparse=True)
    
    self.linear = nn.Linear(dim, vocab_size)
    

  # B: batch size, W: window size, d_w: word embedding size, V: vocab size
  
  def forward(self, x): # x: (B)
  
    embeddings = self.embedding(x)  # (B, d_w)
    
    output = self.linear(embeddings)  # (B, V)
    
    return output

두 가지 모델을 생성합니다.

cbow = CBOW(vocab_size=len(w2i), dim=256)

skipgram = SkipGram(vocab_size=len(w2i), dim=256)

4️⃣ 모델 학습

다음과 같이 hyperparamter를 세팅하고 DataLoader 객체를 만듭니다.

batch_size=4

learning_rate = 5e-4

num_epochs = 5

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')


cbow_loader = DataLoader(cbow_set, batch_size=batch_size)

skipgram_loader = DataLoader(skipgram_set, batch_size=batch_size)

첫 번째로 CBOW 모델 학습입니다.

loss_function은 CrossEntorypLoss를 사용하고, optimizer는 SGD를 사용했습니다.

cbow.train()
cbow = cbow.to(device)

optim = torch.optim.SGD(cbow.parameters(), lr=learning_rate)

loss_function = nn.CrossEntropyLoss()


for e in range(1, num_epochs+1):

  print("#" * 50)
  
  print(f"Epoch: {e}")
  
  for batch in tqdm(cbow_loader):
  
    x, y = batch
    
    x, y = x.to(device), y.to(device) # (B, W), (B)
    
    output = cbow(x)  # (B, V)
    
 
    optim.zero_grad()
    
    loss = loss_function(output, y)
    
    loss.backward()
    
    optim.step()
    

    print(f"Train loss: {loss.item()}")
    

print("Finished.")

다음으로 Skip-gram 모델 학습입니다.

loss_function은 CrossEntorypLoss를 사용하고, optimizer는 SGD를 사용했습니다.

skipgram.train()

skipgram = skipgram.to(device)

optim = torch.optim.SGD(skipgram.parameters(), lr=learning_rate)

loss_function = nn.CrossEntropyLoss()


for e in range(1, num_epochs+1):

  print("#" * 50)
  
  print(f"Epoch: {e}")
  
  for batch in tqdm(skipgram_loader):
  
    x, y = batch
    
    x, y = x.to(device), y.to(device) # (B, W), (B)
    
    output = skipgram(x)  # (B, V)
    

    optim.zero_grad()
    
    loss = loss_function(output, y)
    
    loss.backward()
    
    optim.step()
    

    print(f"Train loss: {loss.item()}")
    

print("Finished.")

5️⃣ 테스트

학습된 각 모델을 이용하여 test 단어들의 word embedding을 확인합니다.

CBOW embedding을 사용하면 아래와 같이 출력됩니다.

for word in test_words:

  input_id = torch.LongTensor([w2i[word]]).to(device)
  
  emb = cbow.embedding(input_id)
  

  print(f"Word: {word}")
  
  print(emb.squeeze(0))

Word: 음식
tensor([ 0.7871, -0.3074,  0.4734,  0.4573,  1.2907,  2.1604, -1.0026,  1.3560,
        -1.6656, -0.5797, -0.1133,  1.8729,  0.0372,  0.6972,  0.6047, -1.1175,
        -0.8053, -0.1157, -0.0836,  0.2930,  0.8856, -0.3225,  0.1877, -0.1684,
	...
         2.5086, -0.5354, -3.1007, -0.2636, -1.1620,  1.2101, -0.1410,  0.5012,
         0.0395, -0.4021, -0.2023, -0.6053, -1.1326,  0.3702, -1.0018, -1.1885,
        -1.2665, -1.3071,  0.0562, -0.4163, -0.4033, -1.2125, -0.2763,  1.3753],
       device='cuda:0', grad_fn=<SqueezeBackward1>)
Word: 맛
tensor([-1.2578e+00,  1.3162e+00, -1.2269e+00, -9.4777e-01,  8.7179e-01,
        -4.9126e-01, -4.0730e-01,  3.1209e-01, -7.5379e-01,  2.2323e+00,
         1.2187e+00, -8.5848e-01, -3.4187e-01, -1.0128e+00, -9.4951e-02,
	...
        -1.7632e-01, -7.0913e-01,  6.9236e-01,  1.8028e+00, -7.7268e-01,
        -4.5270e-01, -7.8029e-01,  1.1691e+00, -9.5050e-01, -1.1641e+00,
        -1.2241e-01,  4.3775e-01,  6.7700e-01, -2.1887e-01, -5.8933e-02,
         6.1698e-01], device='cuda:0', grad_fn=<SqueezeBackward1>)
Word: 서비스
tensor([-1.2396e-01,  2.8120e+00, -8.4324e-01, -1.5637e-01,  4.5518e-01,
         3.9645e-01,  6.0823e-01, -6.6805e-01,  5.6258e-01, -1.5786e+00,
        -2.5151e+00, -9.8164e-01,  1.9362e-01, -7.7830e-01,  3.4442e-01,
       	...
        -2.4733e-01,  1.4230e+00,  8.6770e-02, -1.4932e+00,  3.1906e-03,
         1.0010e+00,  8.7868e-01, -8.6652e-01,  2.4954e-02,  1.1432e+00,
         5.9686e-02, -7.1561e-01,  1.4564e+00,  8.5984e-02,  1.7097e+00,
         5.4189e-01], device='cuda:0', grad_fn=<SqueezeBackward1>)
Word: 위생
tensor([ 9.0346e-01,  1.3885e+00,  3.1470e-01,  6.4606e-03,  8.0281e-01,
        -2.4418e+00, -1.0277e+00,  1.1815e+00, -6.9252e-01, -4.1188e-01,
        -1.7976e-01,  1.8906e-01,  1.0705e+00, -1.3597e+00, -9.8227e-02,
        ...
        -1.1619e+00, -3.4318e-01,  1.4190e-01,  2.6423e-01, -1.9008e-01,
         6.1568e-01,  1.3538e-01,  1.3466e+00, -5.8256e-01, -7.1663e-01,
        -6.8595e-01, -1.2774e-01, -4.8664e-01,  5.2942e-01,  5.2076e-01,
        -1.8389e-01], device='cuda:0', grad_fn=<SqueezeBackward1>)
Word: 가격
tensor([-1.7174, -0.2584, -1.2358, -0.0502,  0.9441, -1.4775, -0.0181,  0.2480,
        -0.3489,  1.4455, -0.8877, -1.0397,  0.6699, -0.7992,  0.7862, -1.0841,
        -0.5846, -0.0729, -0.0404, -0.0876, -0.9881,  1.6955, -0.9078,  0.3367,
        ...
        -0.1070, -0.4423,  0.2717,  0.9903, -0.5689,  0.6526, -1.1282, -1.3012,
         0.5867, -0.6132, -1.3153,  0.8662, -0.0330, -0.0179, -0.4557, -0.6534,
        -0.5898,  0.6272, -0.2705, -0.7067, -0.8818,  0.1508, -0.3256, -1.0571],
       device='cuda:0', grad_fn=<SqueezeBackward1>)

Skip-gram embedding을 사용하면 아래와 같이 출력됩니다.

for word in test_words:

  input_id = torch.LongTensor([w2i[word]]).to(device)
  
  # 32비트의 유동 소수점은 torch.FloatTensor를, 64비트의 부호 있는 정수는 torch.LongTensor를 사용
  
  emb = skipgram.embedding(input_id)
  

  print(f"Word: {word}")
  
  print(max(emb.squeeze(0)))

Word: 음식
tensor(3.2107, device='cuda:0', grad_fn=<UnbindBackward>)
Word: 맛
tensor(2.8692, device='cuda:0', grad_fn=<UnbindBackward>)
Word: 서비스
tensor(2.7438, device='cuda:0', grad_fn=<UnbindBackward>)
Word: 위생
tensor(3.1363, device='cuda:0', grad_fn=<UnbindBackward>)
Word: 가격
tensor(2.8667, device='cuda:0', grad_fn=<UnbindBackward>)

'부스트캠프 AI 테크 U stage > 실습' 카테고리의 다른 글

[17-1] LSTM / GRU with PyTorch (0)	2021.02.16
[16-3] Spacy를 이용한 영어 전처리 (0)	2021.02.16
[16-1] NaiveBayes Classifier Using Konlpy (0)	2021.02.16
[14-3] Scaled Dot-Product Attention (SDPA) using PyTorch (0)	2021.02.04
[14-2] LSTM using PyTorch (0)	2021.02.04

'부스트캠프 AI 테크 U stage/실습' Related Articles

Comments

또르르's 개발 Story

[16-2] Word2Vec Using Konlpy 본문

[16-2] Word2Vec Using Konlpy

1️⃣ 설정

2️⃣ 데이터 전처리

3️⃣ Dataset과 모델 Class 구현

1) Dataset 정의

2) 모델 Class 구현

4️⃣ 모델 학습

5️⃣ 테스트

'부스트캠프 AI 테크 U stage > 실습' 카테고리의 다른 글

티스토리툴바