또르르's 개발 Story

[20-1] HuggingFace's Transformers - BERT 본문

부스트캠프 AI 테크 U stage/실습

[20-1] HuggingFace's Transformers - BERT

또르르21 2021. 2. 20. 00:32

HuggingFace는 Transformer에 기반한 다양한 모델을 제공합니다.

Huggingface's의 다양한 모델과 사용법들은 아래 링크에 존재합니다.

 

https://huggingface.co/transformers/index.html

 

Hugging Face – On a mission to solve NLP, one commit at a time.

Text Classification • Updated Dec 11, 2020 • 507k

huggingface.co

https://github.com/huggingface/transformers

 

huggingface/transformers

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0. - huggingface/transformers

github.com

https://huggingface.co/models

 

Transformers — transformers 4.3.0 documentation

Blenderbot (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. Ble

huggingface.co

 

1️⃣ 설정

 

transformers를 install합니다.

!pip install transformers

필요한 모듈을 import 합니다.

from transformers import *

from torch import nn

from tqdm import tqdm


import torch

 

2️⃣ BERT 불러오기

 

Pre-train된 BERT의 config, tokenizer, model을 각각 불러올 수 있습니다.

bert_name = 'bert-base-uncased'   # bert의 크기는 base, uncased - 소문자로 처리

from_pretrained 함수는 대량 데이터로 훈련된 모델을 가지고 옵니다.

만약 from_pretrained로 불러오지 않으면 구조는 똑같지만, 훈련이 되지않은 모델을 가지고 옵니다.

주의할점은 model과 tokenizer가 같은 모델을 불러와야합니다. (small, base, large와 같이 model과 tokenizer는 같은 데이터로 학습되어 있기 때문에)

config = BertConfig.from_pretrained(bert_name)         # model의 구조와 hyper parameter를 담고 있음

tokenizer = BertTokenizer.from_pretrained(bert_name)    # 주의할점은 model과 tokenizer가 같은 모델을 불러와야함(small, base, large와 같이 같은 데이터로 학습됨)

model = BertModel.from_pretrained(bert_name)

config는 아래와 같은 구조를 가집니다.

>>> config


BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,		# postion_embedding의 max가 512개 (최대 512 token까지밖에 못 집어넣음)
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,			# attention layer 12개
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.3.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

tokenizer는 vocab_size나 speical_tokens를 보여줍니다.

>>> tokenizer		# special_token이 있음


PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

model은 각 Attention들과 in_feature와 out_feature을 보여주며, 최종 fine-tuning이 어떻게 나오는지 알려줍니다.

>>> model


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)				# segment Embedding (문장 seperate)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
     
     ...
     
      (11): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

 

3️⃣ Tokenizer 사용

 

예시 문장은 아래와 같습니다.

sentence = "I want to go home."

1) token -> ids

예시 문장을 tokenizer 하면 다음과 같은 구성을 가지게 됩니다.

  • input_ids : (pre-trained 된) vocab 내에 정의된 index => 자동으로 앞에 101 '[CLS]', 뒤에 102 '[SEP]'를 붙여줌
  • token_type_ids : 문장을 index, 한 문장이기 때문에 다 0
  • attention_mask : padding(0으로 된 값)만 가려주는 값
output = tokenizer(sentence)


>>> output

{'input_ids': [101, 1045, 2215, 2000, 2175, 2188, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

혹은 직접 tokenize 함수를 호출할 수도 있습니다.

tokenized = tokenizer.tokenize(sentence)
>>> tokenized   				# no indexing, no special token

['i', 'want', 'to', 'go', 'home', '.']		# uncased이기 떄문에 I -> i로 나옴

또한 다음과 vocabulary를 확인할 수 있습니다.

vocab = tokenizer.get_vocab()


>>> print(len(vocab))

30522

여기서 [CLS] token과 [SEP] token은 다음과 같은 index에 위치합니다.

>>> print(vocab['[CLS]'])

101


>>> print(vocab['[SEP]'])

102

tokenizer의 convert_tokens_to_ids를 사용하면 token_ids를 쉽게 출력할 수 있습니다.

여기서 [CLS] token과 [SEP] token은 포함되지 않습니다.

token_ids = tokenizer.convert_tokens_to_ids(tokenized)


>>> print(token_ids)

[1045, 2215, 2000, 2175, 2188, 1012]

tokenizer의 encode를 사용하면 [CLS] token과 [SEP] token을 포함한 token_ids를 출력합니다.

token_ids = tokenizer.encode(sentence)


>>> print(token_ids)

[101, 1045, 2215, 2000, 2175, 2188, 1012, 102]

2) ids -> token

반대로 원문으로 되돌릴 수도 있습니다.

sentence = tokenizer.convert_tokens_to_string(tokenized)


>>> print(sentence)

I want to go home.

마찬가지로 convert_ids_to_tokens 함수를 사용하면 ids를 token으로 되돌릴 수 있습니다.

tokens = tokenizer.convert_ids_to_tokens(token_ids)


>>> print(tokens)

['[CLS]', 'i', 'want', 'to', 'go', 'home', '.', '[SEP]']

convert_tokens_to_string 함수를 사용하면 string 형태로 나오게됩니다.

sentence = tokenizer.convert_tokens_to_string(tokens)


>>> print(sentence)

[CLS] i want to go home . [SEP]

3) 문장 2개

문장이 두개의 경우에는 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]가 0과 1로 구분됩니다.

>>> tokenizer("I want to go home.", "Me too.")

{'input_ids': [101, 1045, 2215, 2000, 2175, 2188, 1012, 102, 2033, 2205, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

 

4️⃣ 데이터 전처리

 

Sample data를 BERT에 넣을 수 있는 형태로 전처리합니다.

data = [
  "I want to go home.",
  "My dog's name is Max.",
  "Natural Language Processing is my favorite research field.",
  "Welcome. How can I help you?",
  "Shoot for the moon. Even if you miss, you'll land among the stars."
]

padding을 수행합니다.

max_len = 0

batch = []


for sent in tqdm(data):

  token_ids = tokenizer.encode(sent)
  
  max_len = max(max_len, len(token_ids))
  
  batch.append(token_ids)

tokenizer를 수행해 token을 id로 변환해줍니다.

pad_id = tokenizer._convert_token_to_id('[PAD]')    # PAD처리


for i, token_ids in enumerate(tqdm(batch)):

  if len(token_ids) < max_len:
  
    batch[i] = token_ids + [pad_id] * (max_len-len(token_ids))
batch = torch.LongTensor(batch)


>>> print(batch)

tensor([[ 101, 1045, 2215, 2000, 2175, 2188, 1012,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 2026, 3899, 1005, 1055, 2171, 2003, 4098, 1012,  102,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 3019, 2653, 6364, 2003, 2026, 5440, 2470, 2492, 1012,  102,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 6160, 1012, 2129, 2064, 1045, 2393, 2017, 1029,  102,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 5607, 2005, 1996, 4231, 1012, 2130, 2065, 2017, 3335, 1010, 2017,
         1005, 2222, 2455, 2426, 1996, 3340, 1012,  102]])
         
         
>>> print(batch.shape)

torch.Size([5, 20])

batch 크기만큼의 mask를 만들어줍니다.

batch_mask = (batch != pad_id).float()


>>> print(batch_mask)

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1.]])
         
         
>>> print(batch.shape)

torch.Size([5, 20])

 

 

5️⃣ BERT 사용

 

BERT model에다가 batch와 attention_mask를 넣어줍니다.

outputs = model(input_ids=batch, attention_mask=batch_mask)
>>> outputs


BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[ 0.0350,  0.3950, -0.0622,  ..., -0.0456,  0.2563,  0.5969],
                                                        [ 0.2861,  0.5091, -0.1923,  ...,  0.0136,  0.4276,  0.4311],
                                                        [ 0.1017,  0.3032,  1.1099,  ..., -0.0641, -0.0841,  0.3642],
                                                        ...,
                                                        [-0.0647,  0.2084,  0.2231,  ...,  0.3165, -0.1867,  0.1380],
                                                        [ 0.1437,  0.3288,  0.3981,  ...,  0.0445, -0.2554,  0.2308],
                                                        [ 0.2338,  0.2403,  0.3440,  ...,  0.0508, -0.2114,  0.0998]],
                                               
                                                       [[-0.1119,  0.2266, -0.2985,  ..., -0.2968,  0.5495,  0.5525],
                                                        [-0.0327,  0.1727, -0.3103,  ..., -0.1726,  0.7786,  0.2142],
                                                        [ 0.6370,  0.3274,  0.1777,  ..., -1.0487,  0.7025,  0.0337],
                                                        ...,
                                                        [-0.2139, -0.0164,  0.1756,  ...,  0.1646, -0.0521, -0.0087],
                                                        [-0.2137, -0.0305,  0.1925,  ...,  0.1855,  0.0185, -0.0185],
                                                        [-0.4911, -0.2284, -0.0021,  ...,  0.5878,  0.5304, -0.3678]],
                                               
                                                       [[-0.0133,  0.0811, -0.5912,  ..., -0.1440,  0.1487,  0.6923],
                                                        [-0.0363,  0.0629, -1.0613,  ..., -0.4580,  0.3329,  0.2224],
                                                        [-0.6297,  0.2951,  0.1461,  ..., -0.6709, -0.2904, -0.0189],
                                                        ...,
                                                        [ 0.0750, -0.1738,  0.0185,  ..., -0.1565, -0.3160,  0.2773],
                                                        [ 0.1279, -0.0480,  0.0221,  ..., -0.1945, -0.3751,  0.2981],
                                                        [ 0.1724, -0.0383,  0.0592,  ...,  0.0196, -0.3513,  0.3146]],
                                               
                                                       [[-0.3544, -0.0152, -0.1947,  ..., -0.3146,  0.1046,  0.5122],
                                                        [ 0.2056,  0.2471,  0.0427,  ...,  0.1570,  0.1739,  0.2585],
                                                        [-0.6213, -0.1445,  0.1371,  ...,  0.2898,  0.0139, -0.0427],
                                                        ...,
                                                        [ 0.0088,  0.0541,  0.4920,  ...,  0.3901, -0.0534,  0.0487],
                                                        [-0.1122, -0.0189,  0.4724,  ...,  0.4794, -0.1016, -0.0111],
                                                        [-0.2063, -0.7242,  0.0756,  ...,  0.6130, -0.1388, -0.0284]],
                                               
                                                       [[-0.0181,  0.0981,  0.0771,  ..., -0.2977, -0.1505,  0.6382],
                                                        [ 0.5404,  0.0063, -0.4138,  ..., -0.1347,  0.7193,  0.1365],
                                                        [ 0.9114,  0.4606,  0.1483,  ..., -0.0142, -0.0662, -0.0710],
                                                        ...,
                                                        [ 0.7115,  0.4008,  1.0012,  ..., -0.4991,  0.4392, -0.2713],
                                                        [-0.0758, -0.6092,  0.2912,  ...,  0.2968, -0.0182, -0.4343],
                                                        [ 0.1991,  0.5681,  0.4349,  ...,  0.0724, -0.6479,  0.2521]]],
                                                      grad_fn=<NativeLayerNormBackward>)),
                                              ('pooler_output',
                                               tensor([[-0.8837, -0.3119, -0.7145,  ..., -0.4839, -0.5941,  0.9018],
                                                       [-0.8932, -0.4881, -0.9386,  ..., -0.8893, -0.7298,  0.9271],
                                                       [-0.9098, -0.5320, -0.9271,  ..., -0.8821, -0.6992,  0.9104],
                                                       [-0.9680, -0.6045, -0.9794,  ..., -0.9240, -0.8610,  0.9598],
                                                       [-0.8507, -0.4667, -0.9166,  ..., -0.8087, -0.7228,  0.8411]],
                                                      grad_fn=<TanhBackward>))])

여기서 자세히 보면 0번째 index에 'last_hidden_state'가 존재하는 것을 알 수 있습니다.

따라서 hidden state의 마지막 layer 출력값을 가져옵니다.

# B: batch size, L: max length, d_h: hidden size

last_hidden_states = outputs[0]  # (B, L, d_h)


>>> print(last_hidden_states.shape)

torch.Size([5, 20, 768])

pooler는 추가적인 linear layer와 non-linear unit을 CLS token 자리에 있는 outputs[1]만 꺼내서 hidden state vector를 가지고 온 것입니다.

pooler_output = outputs[1]


>>> print(pooler_output.shape)

torch.Size([5, 768])

Sentence-level classification을 위해 "[CLS]" token을 이용합니다.

num_classes = 10    # classification 개수

sent_linear = nn.Linear(config.hidden_size, num_classes)    # config에서 가지고온 hidden_size

BERT의 sentence classification은 앞에있는 CLS token이 모든 encoding 정보를 가지고 있다고 가정합니다.

따라서 batch size(0번째) 그대로 나두고, 두번째 (length)에서 가장 앞에 것만 가지고 오게 됩니다.

# CLS token의 encoding결과만 가지고 온 것

# 왜냐하면 (B, L, d_h)의 두번째(length)에서 가장 앞에 CLS token이 들어가기 때문

cls_output = last_hidden_states[:, 0, :]


>>> print(cls_output)

tensor([[ 0.0350,  0.3950, -0.0622,  ..., -0.0456,  0.2563,  0.5969],
        [-0.1119,  0.2266, -0.2985,  ..., -0.2968,  0.5495,  0.5525],
        [-0.0133,  0.0811, -0.5912,  ..., -0.1440,  0.1487,  0.6923],
        [-0.3544, -0.0152, -0.1947,  ..., -0.3146,  0.1046,  0.5122],
        [-0.0181,  0.0981,  0.0771,  ..., -0.2977, -0.1505,  0.6382]],
       grad_fn=<SliceBackward>)
       
       
>>> print(cls_output.shape)  

torch.Size([5, 768])

CLS token 값을 linear로 보내 10개의 classification으로 구분합니다.

sent_output = sent_linear(cls_output)


>>> print(sent_output)

tensor([[ 0.1730, -0.3463, -0.2265, -0.0191, -0.1322, -0.3607, -0.1373, -0.1125,
          0.2349,  0.3238],
        [ 0.3057, -0.5396, -0.1199,  0.2707,  0.1432, -0.5335, -0.3624,  0.0758,
          0.2298,  0.3593],
        [ 0.2556, -0.2744, -0.1431,  0.0627, -0.0405, -0.4138, -0.4265,  0.0114,
          0.2790,  0.3270],
        [ 0.3045, -0.1953, -0.4695,  0.4246,  0.0690, -0.0995, -0.1821, -0.1829,
          0.0689,  0.2625],
        [ 0.0471, -0.2544, -0.0842,  0.1946,  0.1609,  0.1369, -0.0738, -0.1213,
          0.4147,  0.1887]], grad_fn=<AddmmBackward>)


>>> print(sent_output.shape)  

torch.Size([5, 10])			  # dimension이 10으로 수렴

Token-level classification을 위해 전체 sequence의 hidden state를 활용합니다.

num_classes = 50

token_linear = nn.Linear(config.hidden_size, num_classes)
token_output = token_linear(last_hidden_states)


>>> print(token_output)

tensor([[[-0.4240, -0.0809, -0.1891,  ..., -0.4091, -0.0597, -0.1531],
         [ 0.1173, -0.1280, -0.5075,  ..., -0.1931, -0.3523,  0.1603],
         [-0.3172, -0.1473, -0.4104,  ...,  0.0065, -0.1860,  0.4348],
         ...,
         [-0.0823, -0.0853, -0.1976,  ..., -0.1266,  0.2178,  0.1725],
         [ 0.1148, -0.1549,  0.2117,  ..., -0.1581,  0.0653,  0.4681],
         [ 0.1700, -0.1526,  0.2520,  ..., -0.1332,  0.0881,  0.4569]],

        [[-0.2749,  0.1350,  0.0296,  ..., -0.0090,  0.0654, -0.1307],
         [ 0.2562,  0.0345, -0.4457,  ..., -0.0153,  0.0034,  0.0024],
         [-0.1136, -0.3402,  0.3856,  ...,  0.1139,  0.0576,  0.1821],
         ...,
         [-0.0569, -0.0408,  0.0691,  ..., -0.1416,  0.4455,  0.2434],
         [-0.0157, -0.0271,  0.1235,  ..., -0.1180,  0.4439,  0.2506],
         [-0.0440,  0.2745, -0.4567,  ..., -0.1834,  0.6945, -0.0269]],

        [[-0.3818,  0.2000, -0.0656,  ..., -0.1287,  0.0346, -0.2580],
         [-0.3653,  0.0806, -0.2898,  ..., -0.4338, -0.0799, -0.6714],
         [-0.4184,  0.2781, -0.3370,  ..., -0.2542, -0.1669, -0.2853],
         ...,
         [-0.0364,  0.0521, -0.0686,  ..., -0.2326,  0.1584, -0.0324],
         [-0.0170,  0.1039, -0.1363,  ..., -0.2243,  0.1475, -0.0528],
         [ 0.0137,  0.0961, -0.0712,  ..., -0.1326,  0.2231,  0.0328]],

        [[-0.2572,  0.0130,  0.0312,  ..., -0.1893,  0.0650, -0.2823],
         [-0.1158, -0.1213, -0.2284,  ..., -0.3106, -0.0287, -0.0872],
         [-0.6114,  0.0659, -0.2104,  ...,  0.1462, -0.1411, -0.0926],
         ...,
         [-0.0275, -0.0704,  0.1242,  ..., -0.1996,  0.2381,  0.2470],
         [ 0.0416,  0.0063,  0.0567,  ..., -0.2011,  0.3007,  0.2860],
         [-0.0611,  0.2284, -0.3293,  ..., -0.1689,  0.3786, -0.0570]],

        [[-0.2710, -0.1798,  0.4292,  ..., -0.0402,  0.0454, -0.2510],
         [-0.3849, -0.1658,  0.0374,  ..., -0.4138, -0.1476, -0.0957],
         [-0.3024, -0.2653, -0.2197,  ..., -0.3227, -0.1199, -0.1087],
         ...,
         [-0.0071, -0.0239,  0.5017,  ..., -0.1317,  0.0427, -0.3688],
         [-0.4590, -0.1514, -0.0962,  ...,  0.1622,  0.2298, -0.3776],
         [-0.2516, -0.2136,  0.2195,  ...,  0.1512,  0.0119,  0.0284]]],
       grad_fn=<AddBackward0>)
       
       
>>> print(token_output.shape)

torch.Size([5, 20, 50])

 

6️⃣ BERT 다양한 모델

 

BERT의 다양한 head를 추가한 모델을 제공합니다.

 

https://huggingface.co/transformers/model_doc/bert.html

 

BERT — transformers 4.3.0 documentation

past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) – Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_he

huggingface.co

 

  •  BertForSequenceClassification
seq_model = BertForSequenceClassification.from_pretrained(bert_name)

BertForSequenceClassification은 마지막 classification에서 2개(binary)의 clasification을 수행합니다.

>>> seq_model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        
        ...
        
        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)		# out_feature 2개로 output해서 사용
)

 

  • BertForMaskedLM
lm_model = BertForMaskedLM.from_pretrained(bert_name, config=config)

BertForMaskedLM은 Language Model이기 떄문에 vocab size만큼의 vector를 출력합니다.

(단어를 맞춰야하기 때문에)

>>> lm_model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        
        ...
        
        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=30522, bias=True)			 # vocab size만큼의 vector (단어를 맞추는 것이기 때문에)
    )
  )
)
Comments