Attention 메커니즘 완벽 구현: Bahdanau vs Luong Attention

Jan 24, 2025

Read time: 1 minute

Attention이 필요한 이유¶

Seq2Seq의 고정 길이 context vector는 긴 문장에서 정보 병목을 일으킵니다. Attention 메커니즘은 디코더가 매 시점마다 인코더의 모든 hidden state에 접근할 수 있게 하여 이 문제를 해결합니다.

1. Bahdanau Attention (Additive Attention)¶

수식¶

e_ij = V_a^T tanh(W_a s_{i-1} + U_a h_j)

α_ij = softmax(e_ij)

c_i = Σ α_ij * h_j

s_{i-1}: 디코더의 이전 시점 은닉 상태
h_j: 인코더의 j번째 은닉 상태
W_a, U_a, V_a: 학습 가능한 가중치 행렬

PyTorch 구현¶

class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.W = nn.Linear(hidden_size, hidden_size, bias=False)
        self.U = nn.Linear(hidden_size, hidden_size, bias=False)
        self.V = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, decoder_hidden, encoder_outputs):
        # s_{i-1}을 W_a로 변환
        s_proj = self.W(decoder_hidden.permute(1, 0, 2))
        # h_j를 U_a로 변환
        h_proj = self.U(encoder_outputs)
        # 점수 계산: tanh(W*s + U*h)
        scores = self.V(torch.tanh(s_proj + h_proj))
        # Softmax로 attention weights 계산
        attn_weights = F.softmax(scores, dim=1)
        # Context vector = Σ α * h
        context = torch.bmm(attn_weights.permute(0, 2, 1), encoder_outputs)

        return context, attn_weights

Decoder에 Attention 통합¶

class BahdanauAttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super().__init__()
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.attention = BahdanauAttention(hidden_size)
        # GRU 입력 = embedding + context (2*hidden_size)
        self.gru = nn.GRU(hidden_size*2, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, decoder_input, decoder_hidden, encoder_outputs):
        embedded = self.embedding(decoder_input)
        # Attention으로 context vector 계산
        context, attn_weights = self.attention(decoder_hidden, encoder_outputs)
        # Embedding과 context를 concat
        rnn_input = torch.cat((embedded, context), dim=2)
        output, hidden = self.gru(rnn_input, decoder_hidden)
        output = F.log_softmax(self.out(output.squeeze(1)), dim=1)
        return output, hidden, attn_weights

2. Luong Attention (Multiplicative Attention)¶

핵심 차이점¶

Bahdanau: 디코더의 이전 은닉 상태 사용 (s_{i-1})
Luong: 디코더의 현재 은닉 상태 사용 (h_t)
계산이 더 간단하고 효율적 (내적 기반)

수식 (General 방식)¶

score(h_t, h̄_s) = h_t^T W_a h̄_s

c_t = Σ softmax(score) * h̄_s

PyTorch 구현¶

class LuongAttention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attn = nn.Linear(hidden_size, hidden_size, bias=False)

    def forward(self, decoder_output, encoder_outputs):
        # 'general' 스코어 계산
        keys = self.attn(encoder_outputs)
        scores = torch.bmm(decoder_output, keys.transpose(1, 2))
        attn_weights = F.softmax(scores, dim=2)
        # Context vector 계산
        context = torch.bmm(attn_weights, encoder_outputs)
        return context, attn_weights

실습: 날짜 형식 변환기¶

Attention의 효과를 확인하기 위해 다양한 날짜 형식을 표준 형식으로 변환하는 모델을 만들어봅시다:

# 입력 형식들
"June 18, 2025"      →  "2025-06-18"
"18 Jun 2025"        →  "2025-06-18"
"2025/06/18"         →  "2025-06-18"
"06-18-2025"         →  "2025-06-18"

# 5000개 데이터로 학습
n_iters = 5000
for iteration in range(1, n_iters+1):
    pair = random.choice(raw_data)
    input_tensor = tensor_from_string(input_char_lang, pair[0])
    target_tensor = tensor_from_string(output_char_lang, pair[1])
    loss = train(input_tensor, target_tensor, encoder, decoder, optimizers, criterion)

# 결과
evaluate("September 20, 2025", encoder, decoder)
# Output: 2025-09-20 ✅

Attention Heatmap 시각화¶

Attention weights를 시각화하면 모델이 어떤 입력 부분에 집중하는지 확인할 수 있습니다. 예를 들어 "September"를 번역할 때는 "09"에 높은 attention weight가 나타납니다.

핵심 정리¶

Attention은 고정 길이 context vector의 정보 병목을 해결
Bahdanau는 additive, Luong은 multiplicative 방식
Attention weights 시각화로 모델의 해석 가능성 향상

Attention의 다음 진화 단계인 Self-Attention과 Transformer 아키텍처는 Udemy 강의 "Attention, Transformer, BERT, GPT 완벽 마스터"에서 Multi-Head Attention, Positional Encoding까지 전체 구현과 함께 배울 수 있습니다.