일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
- Numpy data I/O
- scatter
- subplot
- type hints
- Array operations
- Python 유래
- 카테고리분포 MLE
- dtype
- Operation function
- 최대가능도 추정법
- 정규분포 MLE
- Python 특징
- BOXPLOT
- groupby
- python 문법
- Comparisons
- 표집분포
- 딥러닝
- VSCode
- 가능도
- boolean & fancy index
- 부스트캠프 AI테크
- namedtuple
- Python
- linalg
- pivot table
- seaborn
- Numpy
- ndarray
- unstack
- Today
- Total
또르르's 개발 Story
[20-1] HuggingFace's Transformers - BERT 본문
HuggingFace는 Transformer에 기반한 다양한 모델을 제공합니다.
Huggingface's의 다양한 모델과 사용법들은 아래 링크에 존재합니다.
https://huggingface.co/transformers/index.html
Hugging Face – On a mission to solve NLP, one commit at a time.
Text Classification • Updated Dec 11, 2020 • 507k
huggingface.co
https://github.com/huggingface/transformers
huggingface/transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0. - huggingface/transformers
github.com
Transformers — transformers 4.3.0 documentation
Blenderbot (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. Ble
huggingface.co
1️⃣ 설정
transformers를 install합니다.
!pip install transformers
필요한 모듈을 import 합니다.
from transformers import *
from torch import nn
from tqdm import tqdm
import torch
2️⃣ BERT 불러오기
Pre-train된 BERT의 config, tokenizer, model을 각각 불러올 수 있습니다.
bert_name = 'bert-base-uncased' # bert의 크기는 base, uncased - 소문자로 처리
from_pretrained 함수는 대량 데이터로 훈련된 모델을 가지고 옵니다.
만약 from_pretrained로 불러오지 않으면 구조는 똑같지만, 훈련이 되지않은 모델을 가지고 옵니다.
주의할점은 model과 tokenizer가 같은 모델을 불러와야합니다. (small, base, large와 같이 model과 tokenizer는 같은 데이터로 학습되어 있기 때문에)
config = BertConfig.from_pretrained(bert_name) # model의 구조와 hyper parameter를 담고 있음
tokenizer = BertTokenizer.from_pretrained(bert_name) # 주의할점은 model과 tokenizer가 같은 모델을 불러와야함(small, base, large와 같이 같은 데이터로 학습됨)
model = BertModel.from_pretrained(bert_name)
config는 아래와 같은 구조를 가집니다.
>>> config
BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512, # postion_embedding의 max가 512개 (최대 512 token까지밖에 못 집어넣음)
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12, # attention layer 12개
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.3.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
tokenizer는 vocab_size나 speical_tokens를 보여줍니다.
>>> tokenizer # special_token이 있음
PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
model은 각 Attention들과 in_feature와 out_feature을 보여주며, 최종 fine-tuning이 어떻게 나오는지 알려줍니다.
>>> model
BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768) # segment Embedding (문장 seperate)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
...
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
3️⃣ Tokenizer 사용
예시 문장은 아래와 같습니다.
sentence = "I want to go home."
1) token -> ids
예시 문장을 tokenizer 하면 다음과 같은 구성을 가지게 됩니다.
- input_ids : (pre-trained 된) vocab 내에 정의된 index => 자동으로 앞에 101 '[CLS]', 뒤에 102 '[SEP]'를 붙여줌
- token_type_ids : 문장을 index, 한 문장이기 때문에 다 0
- attention_mask : padding(0으로 된 값)만 가려주는 값
output = tokenizer(sentence)
>>> output
{'input_ids': [101, 1045, 2215, 2000, 2175, 2188, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
혹은 직접 tokenize 함수를 호출할 수도 있습니다.
tokenized = tokenizer.tokenize(sentence)
>>> tokenized # no indexing, no special token
['i', 'want', 'to', 'go', 'home', '.'] # uncased이기 떄문에 I -> i로 나옴
또한 다음과 vocabulary를 확인할 수 있습니다.
vocab = tokenizer.get_vocab()
>>> print(len(vocab))
30522
여기서 [CLS] token과 [SEP] token은 다음과 같은 index에 위치합니다.
>>> print(vocab['[CLS]'])
101
>>> print(vocab['[SEP]'])
102
tokenizer의 convert_tokens_to_ids를 사용하면 token_ids를 쉽게 출력할 수 있습니다.
여기서 [CLS] token과 [SEP] token은 포함되지 않습니다.
token_ids = tokenizer.convert_tokens_to_ids(tokenized)
>>> print(token_ids)
[1045, 2215, 2000, 2175, 2188, 1012]
tokenizer의 encode를 사용하면 [CLS] token과 [SEP] token을 포함한 token_ids를 출력합니다.
token_ids = tokenizer.encode(sentence)
>>> print(token_ids)
[101, 1045, 2215, 2000, 2175, 2188, 1012, 102]
2) ids -> token
반대로 원문으로 되돌릴 수도 있습니다.
sentence = tokenizer.convert_tokens_to_string(tokenized)
>>> print(sentence)
I want to go home.
마찬가지로 convert_ids_to_tokens 함수를 사용하면 ids를 token으로 되돌릴 수 있습니다.
tokens = tokenizer.convert_ids_to_tokens(token_ids)
>>> print(tokens)
['[CLS]', 'i', 'want', 'to', 'go', 'home', '.', '[SEP]']
convert_tokens_to_string 함수를 사용하면 string 형태로 나오게됩니다.
sentence = tokenizer.convert_tokens_to_string(tokens)
>>> print(sentence)
[CLS] i want to go home . [SEP]
3) 문장 2개
문장이 두개의 경우에는 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]가 0과 1로 구분됩니다.
>>> tokenizer("I want to go home.", "Me too.")
{'input_ids': [101, 1045, 2215, 2000, 2175, 2188, 1012, 102, 2033, 2205, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
4️⃣ 데이터 전처리
Sample data를 BERT에 넣을 수 있는 형태로 전처리합니다.
data = [
"I want to go home.",
"My dog's name is Max.",
"Natural Language Processing is my favorite research field.",
"Welcome. How can I help you?",
"Shoot for the moon. Even if you miss, you'll land among the stars."
]
padding을 수행합니다.
max_len = 0
batch = []
for sent in tqdm(data):
token_ids = tokenizer.encode(sent)
max_len = max(max_len, len(token_ids))
batch.append(token_ids)
tokenizer를 수행해 token을 id로 변환해줍니다.
pad_id = tokenizer._convert_token_to_id('[PAD]') # PAD처리
for i, token_ids in enumerate(tqdm(batch)):
if len(token_ids) < max_len:
batch[i] = token_ids + [pad_id] * (max_len-len(token_ids))
batch = torch.LongTensor(batch)
>>> print(batch)
tensor([[ 101, 1045, 2215, 2000, 2175, 2188, 1012, 102, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 101, 2026, 3899, 1005, 1055, 2171, 2003, 4098, 1012, 102, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 101, 3019, 2653, 6364, 2003, 2026, 5440, 2470, 2492, 1012, 102, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 101, 6160, 1012, 2129, 2064, 1045, 2393, 2017, 1029, 102, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 101, 5607, 2005, 1996, 4231, 1012, 2130, 2065, 2017, 3335, 1010, 2017,
1005, 2222, 2455, 2426, 1996, 3340, 1012, 102]])
>>> print(batch.shape)
torch.Size([5, 20])
batch 크기만큼의 mask를 만들어줍니다.
batch_mask = (batch != pad_id).float()
>>> print(batch_mask)
tensor([[1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.]])
>>> print(batch.shape)
torch.Size([5, 20])
5️⃣ BERT 사용
BERT model에다가 batch와 attention_mask를 넣어줍니다.
outputs = model(input_ids=batch, attention_mask=batch_mask)
>>> outputs
BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
tensor([[[ 0.0350, 0.3950, -0.0622, ..., -0.0456, 0.2563, 0.5969],
[ 0.2861, 0.5091, -0.1923, ..., 0.0136, 0.4276, 0.4311],
[ 0.1017, 0.3032, 1.1099, ..., -0.0641, -0.0841, 0.3642],
...,
[-0.0647, 0.2084, 0.2231, ..., 0.3165, -0.1867, 0.1380],
[ 0.1437, 0.3288, 0.3981, ..., 0.0445, -0.2554, 0.2308],
[ 0.2338, 0.2403, 0.3440, ..., 0.0508, -0.2114, 0.0998]],
[[-0.1119, 0.2266, -0.2985, ..., -0.2968, 0.5495, 0.5525],
[-0.0327, 0.1727, -0.3103, ..., -0.1726, 0.7786, 0.2142],
[ 0.6370, 0.3274, 0.1777, ..., -1.0487, 0.7025, 0.0337],
...,
[-0.2139, -0.0164, 0.1756, ..., 0.1646, -0.0521, -0.0087],
[-0.2137, -0.0305, 0.1925, ..., 0.1855, 0.0185, -0.0185],
[-0.4911, -0.2284, -0.0021, ..., 0.5878, 0.5304, -0.3678]],
[[-0.0133, 0.0811, -0.5912, ..., -0.1440, 0.1487, 0.6923],
[-0.0363, 0.0629, -1.0613, ..., -0.4580, 0.3329, 0.2224],
[-0.6297, 0.2951, 0.1461, ..., -0.6709, -0.2904, -0.0189],
...,
[ 0.0750, -0.1738, 0.0185, ..., -0.1565, -0.3160, 0.2773],
[ 0.1279, -0.0480, 0.0221, ..., -0.1945, -0.3751, 0.2981],
[ 0.1724, -0.0383, 0.0592, ..., 0.0196, -0.3513, 0.3146]],
[[-0.3544, -0.0152, -0.1947, ..., -0.3146, 0.1046, 0.5122],
[ 0.2056, 0.2471, 0.0427, ..., 0.1570, 0.1739, 0.2585],
[-0.6213, -0.1445, 0.1371, ..., 0.2898, 0.0139, -0.0427],
...,
[ 0.0088, 0.0541, 0.4920, ..., 0.3901, -0.0534, 0.0487],
[-0.1122, -0.0189, 0.4724, ..., 0.4794, -0.1016, -0.0111],
[-0.2063, -0.7242, 0.0756, ..., 0.6130, -0.1388, -0.0284]],
[[-0.0181, 0.0981, 0.0771, ..., -0.2977, -0.1505, 0.6382],
[ 0.5404, 0.0063, -0.4138, ..., -0.1347, 0.7193, 0.1365],
[ 0.9114, 0.4606, 0.1483, ..., -0.0142, -0.0662, -0.0710],
...,
[ 0.7115, 0.4008, 1.0012, ..., -0.4991, 0.4392, -0.2713],
[-0.0758, -0.6092, 0.2912, ..., 0.2968, -0.0182, -0.4343],
[ 0.1991, 0.5681, 0.4349, ..., 0.0724, -0.6479, 0.2521]]],
grad_fn=<NativeLayerNormBackward>)),
('pooler_output',
tensor([[-0.8837, -0.3119, -0.7145, ..., -0.4839, -0.5941, 0.9018],
[-0.8932, -0.4881, -0.9386, ..., -0.8893, -0.7298, 0.9271],
[-0.9098, -0.5320, -0.9271, ..., -0.8821, -0.6992, 0.9104],
[-0.9680, -0.6045, -0.9794, ..., -0.9240, -0.8610, 0.9598],
[-0.8507, -0.4667, -0.9166, ..., -0.8087, -0.7228, 0.8411]],
grad_fn=<TanhBackward>))])
여기서 자세히 보면 0번째 index에 'last_hidden_state'가 존재하는 것을 알 수 있습니다.
따라서 hidden state의 마지막 layer 출력값을 가져옵니다.
# B: batch size, L: max length, d_h: hidden size
last_hidden_states = outputs[0] # (B, L, d_h)
>>> print(last_hidden_states.shape)
torch.Size([5, 20, 768])
pooler는 추가적인 linear layer와 non-linear unit을 CLS token 자리에 있는 outputs[1]만 꺼내서 hidden state vector를 가지고 온 것입니다.
pooler_output = outputs[1]
>>> print(pooler_output.shape)
torch.Size([5, 768])
Sentence-level classification을 위해 "[CLS]" token을 이용합니다.
num_classes = 10 # classification 개수
sent_linear = nn.Linear(config.hidden_size, num_classes) # config에서 가지고온 hidden_size
BERT의 sentence classification은 앞에있는 CLS token이 모든 encoding 정보를 가지고 있다고 가정합니다.
따라서 batch size(0번째) 그대로 나두고, 두번째 (length)에서 가장 앞에 것만 가지고 오게 됩니다.
# CLS token의 encoding결과만 가지고 온 것
# 왜냐하면 (B, L, d_h)의 두번째(length)에서 가장 앞에 CLS token이 들어가기 때문
cls_output = last_hidden_states[:, 0, :]
>>> print(cls_output)
tensor([[ 0.0350, 0.3950, -0.0622, ..., -0.0456, 0.2563, 0.5969],
[-0.1119, 0.2266, -0.2985, ..., -0.2968, 0.5495, 0.5525],
[-0.0133, 0.0811, -0.5912, ..., -0.1440, 0.1487, 0.6923],
[-0.3544, -0.0152, -0.1947, ..., -0.3146, 0.1046, 0.5122],
[-0.0181, 0.0981, 0.0771, ..., -0.2977, -0.1505, 0.6382]],
grad_fn=<SliceBackward>)
>>> print(cls_output.shape)
torch.Size([5, 768])
CLS token 값을 linear로 보내 10개의 classification으로 구분합니다.
sent_output = sent_linear(cls_output)
>>> print(sent_output)
tensor([[ 0.1730, -0.3463, -0.2265, -0.0191, -0.1322, -0.3607, -0.1373, -0.1125,
0.2349, 0.3238],
[ 0.3057, -0.5396, -0.1199, 0.2707, 0.1432, -0.5335, -0.3624, 0.0758,
0.2298, 0.3593],
[ 0.2556, -0.2744, -0.1431, 0.0627, -0.0405, -0.4138, -0.4265, 0.0114,
0.2790, 0.3270],
[ 0.3045, -0.1953, -0.4695, 0.4246, 0.0690, -0.0995, -0.1821, -0.1829,
0.0689, 0.2625],
[ 0.0471, -0.2544, -0.0842, 0.1946, 0.1609, 0.1369, -0.0738, -0.1213,
0.4147, 0.1887]], grad_fn=<AddmmBackward>)
>>> print(sent_output.shape)
torch.Size([5, 10]) # dimension이 10으로 수렴
Token-level classification을 위해 전체 sequence의 hidden state를 활용합니다.
num_classes = 50
token_linear = nn.Linear(config.hidden_size, num_classes)
token_output = token_linear(last_hidden_states)
>>> print(token_output)
tensor([[[-0.4240, -0.0809, -0.1891, ..., -0.4091, -0.0597, -0.1531],
[ 0.1173, -0.1280, -0.5075, ..., -0.1931, -0.3523, 0.1603],
[-0.3172, -0.1473, -0.4104, ..., 0.0065, -0.1860, 0.4348],
...,
[-0.0823, -0.0853, -0.1976, ..., -0.1266, 0.2178, 0.1725],
[ 0.1148, -0.1549, 0.2117, ..., -0.1581, 0.0653, 0.4681],
[ 0.1700, -0.1526, 0.2520, ..., -0.1332, 0.0881, 0.4569]],
[[-0.2749, 0.1350, 0.0296, ..., -0.0090, 0.0654, -0.1307],
[ 0.2562, 0.0345, -0.4457, ..., -0.0153, 0.0034, 0.0024],
[-0.1136, -0.3402, 0.3856, ..., 0.1139, 0.0576, 0.1821],
...,
[-0.0569, -0.0408, 0.0691, ..., -0.1416, 0.4455, 0.2434],
[-0.0157, -0.0271, 0.1235, ..., -0.1180, 0.4439, 0.2506],
[-0.0440, 0.2745, -0.4567, ..., -0.1834, 0.6945, -0.0269]],
[[-0.3818, 0.2000, -0.0656, ..., -0.1287, 0.0346, -0.2580],
[-0.3653, 0.0806, -0.2898, ..., -0.4338, -0.0799, -0.6714],
[-0.4184, 0.2781, -0.3370, ..., -0.2542, -0.1669, -0.2853],
...,
[-0.0364, 0.0521, -0.0686, ..., -0.2326, 0.1584, -0.0324],
[-0.0170, 0.1039, -0.1363, ..., -0.2243, 0.1475, -0.0528],
[ 0.0137, 0.0961, -0.0712, ..., -0.1326, 0.2231, 0.0328]],
[[-0.2572, 0.0130, 0.0312, ..., -0.1893, 0.0650, -0.2823],
[-0.1158, -0.1213, -0.2284, ..., -0.3106, -0.0287, -0.0872],
[-0.6114, 0.0659, -0.2104, ..., 0.1462, -0.1411, -0.0926],
...,
[-0.0275, -0.0704, 0.1242, ..., -0.1996, 0.2381, 0.2470],
[ 0.0416, 0.0063, 0.0567, ..., -0.2011, 0.3007, 0.2860],
[-0.0611, 0.2284, -0.3293, ..., -0.1689, 0.3786, -0.0570]],
[[-0.2710, -0.1798, 0.4292, ..., -0.0402, 0.0454, -0.2510],
[-0.3849, -0.1658, 0.0374, ..., -0.4138, -0.1476, -0.0957],
[-0.3024, -0.2653, -0.2197, ..., -0.3227, -0.1199, -0.1087],
...,
[-0.0071, -0.0239, 0.5017, ..., -0.1317, 0.0427, -0.3688],
[-0.4590, -0.1514, -0.0962, ..., 0.1622, 0.2298, -0.3776],
[-0.2516, -0.2136, 0.2195, ..., 0.1512, 0.0119, 0.0284]]],
grad_fn=<AddBackward0>)
>>> print(token_output.shape)
torch.Size([5, 20, 50])
6️⃣ BERT 다양한 모델
BERT의 다양한 head를 추가한 모델을 제공합니다.
https://huggingface.co/transformers/model_doc/bert.html
BERT — transformers 4.3.0 documentation
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) – Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_he
huggingface.co
- BertForSequenceClassification
seq_model = BertForSequenceClassification.from_pretrained(bert_name)
BertForSequenceClassification은 마지막 classification에서 2개(binary)의 clasification을 수행합니다.
>>> seq_model
BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
...
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=2, bias=True) # out_feature 2개로 output해서 사용
)
- BertForMaskedLM
lm_model = BertForMaskedLM.from_pretrained(bert_name, config=config)
BertForMaskedLM은 Language Model이기 떄문에 vocab size만큼의 vector를 출력합니다.
(단어를 맞춰야하기 때문에)
>>> lm_model
BertForMaskedLM(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
...
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
)
(cls): BertOnlyMLMHead(
(predictions): BertLMPredictionHead(
(transform): BertPredictionHeadTransform(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(decoder): Linear(in_features=768, out_features=30522, bias=True) # vocab size만큼의 vector (단어를 맞추는 것이기 때문에)
)
)
)
'부스트캠프 AI 테크 U stage > 실습' 카테고리의 다른 글
[20-2] HuggingFace's Transformers - GPT-2 (0) | 2021.02.21 |
---|---|
[20-3] KoELECTRA (수정) (0) | 2021.02.20 |
[19-3] Masked Multi-head Attention Using PyTorch (0) | 2021.02.19 |
[19-2] Multi-head Attention Using PyTorch (0) | 2021.02.19 |
[19-1] Byte Pair Encoding with Python (0) | 2021.02.19 |