SANA: Linear Attention으로 초고속 고해상도 이미지 생성

TL;DR: SANA는 Linear Attention과 효율적인 토큰 압축을 통해 1024×1024 이미지를 0.6초 만에 생성합니다. DiT 대비 100배 이상 빠르면서 동등한 품질을 유지하는 획기적인 아키텍처입니다.

1. 소개: 속도와 품질의 트레이드오프 극복

1.1 기존 Diffusion 모델의 속도 문제

고해상도 이미지 생성은 계산 비용이 막대합니다:

모델	해상도	생성 시간	GPU 메모리
Stable Diffusion XL	1024²	~8초	16GB
PixArt-α	1024²	~5초	12GB
DALL-E 3	1024²	~12초	-
DiT-XL/2	512²	~4초	20GB

핵심 병목:

Transformer의 Self-Attention: $O(n^2)$ 복잡도
1024×1024 이미지 → 4096 패치 → 1,600만 쌍의 attention 연산!

1.2 SANA의 해결책

SANA (Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers)

핵심 혁신:
1. Linear Attention: O(n²) → O(n)
2. Deep Compression Encoder: 8× → 32× 압축
3. Mix-FFN: 지역적 정보 보존
4. Triton 커스텀 커널: 하드웨어 최적화

결과: 20배 이상 빠른 생성 속도!

2. Linear Attention의 이론적 배경

2.1 Standard Self-Attention 복습

기존 Transformer의 Self-Attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

계산 복잡도 분석:

python

def standard_attention(Q, K, V):
    """
    Q, K, V: [batch, seq_len, dim]
    복잡도: O(n² × d)
    """
    d_k = Q.shape[-1]

    # Step 1: QK^T 계산 - O(n² × d)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    # scores: [batch, n, n] - n²개의 원소!

    # Step 2: Softmax - O(n²)
    attn_weights = F.softmax(scores, dim=-1)

    # Step 3: V와 곱 - O(n² × d)
    output = torch.matmul(attn_weights, V)

    return output

# 1024×1024 이미지의 경우:
# n = (1024/16)² = 4096 패치
# n² = 16,777,216 연산!

2.2 Linear Attention의 핵심 아이디어

Softmax를 커널 함수로 근사:

$\text{softmax}(QK^T) \approx \phi(Q) \cdot \phi(K)^T$

핵심 통찰: 연산 순서를 바꾸면 복잡도가 줄어듭니다!

Standard: (Q × K^T) × V → O(n² × d) + O(n² × d) = O(n²d)
Linear: Q × (K^T × V) → O(n × d × d) + O(n × d × d) = O(nd²)

n >> d 일 때 (고해상도 이미지):
n² vs n × d²
4096² vs 4096 × 128²
16M vs 67M → 거의 비슷!

하지만 n이 더 커지면:
8192² vs 8192 × 128²
67M vs 134M → Linear가 훨씬 효율적!

2.3 SANA의 Linear Attention 구현

python

class LinearAttention(nn.Module):
    """
    SANA의 Linear Attention 구현
    """
    def __init__(self, dim, num_heads=8, qk_dim=64):
        super().__init__()
        self.num_heads = num_heads
        self.qk_dim = qk_dim
        self.scale = qk_dim ** -0.5

        # Q, K는 낮은 차원으로 projection
        self.q_proj = nn.Linear(dim, num_heads * qk_dim)
        self.k_proj = nn.Linear(dim, num_heads * qk_dim)
        self.v_proj = nn.Linear(dim, dim)
        self.out_proj = nn.Linear(dim, dim)

        # Feature map (kernel function)
        self.feature_map = nn.Sequential(
            nn.Linear(qk_dim, qk_dim),
            nn.ReLU()  # φ(x) = ReLU(Wx)
        )

    def forward(self, x):
        B, N, C = x.shape

        # Q, K, V 계산
        q = self.q_proj(x).view(B, N, self.num_heads, self.qk_dim)
        k = self.k_proj(x).view(B, N, self.num_heads, self.qk_dim)
        v = self.v_proj(x).view(B, N, self.num_heads, -1)

        # Feature map 적용 (커널 근사)
        q = self.feature_map(q)  # φ(Q)
        k = self.feature_map(k)  # φ(K)

        # Linear Attention: Q × (K^T × V)
        # Step 1: K^T × V - O(n × d_k × d_v)
        kv = torch.einsum('bnhk,bnhv->bhkv', k, v)

        # Step 2: Q × (K^T × V) - O(n × d_k × d_v)
        out = torch.einsum('bnhk,bhkv->bnhv', q, kv)

        # Normalization (numerical stability)
        normalizer = torch.einsum('bnhk,bhk->bnh', q, k.sum(dim=1))
        out = out / (normalizer.unsqueeze(-1) + 1e-6)

        # Reshape and project
        out = out.reshape(B, N, -1)
        out = self.out_proj(out)

        return out

3. Deep Compression AutoEncoder (DC-AE)

3.1 기존 VAE의 한계

Stable Diffusion의 VAE:

압축률: 8× (512→64, 1024→128)
잠재 공간 크기: 여전히 큼 (128²×4 = 65,536 토큰)

3.2 SANA의 32× 압축

SANA DC-AE:
이미지 (1024×1024×3)
↓ 32× 압축
잠재 표현 (32×32×32)
= 1,024 토큰 (기존 대비 64배 감소!)

vs Stable Diffusion:
이미지 (1024×1024×3)
↓ 8× 압축
잠재 표현 (128×128×4)
= 16,384 토큰

3.3 DC-AE 아키텍처

python

class DeepCompressionAutoEncoder(nn.Module):
    """
    SANA의 32× 압축 AutoEncoder
    """
    def __init__(
        self,
        in_channels=3,
        latent_channels=32,
        base_channels=128
    ):
        super().__init__()

        # Encoder: 32× 다운샘플링 (5번의 2× 다운샘플)
        self.encoder = nn.Sequential(
            # 1024 → 512
            ConvBlock(in_channels, base_channels, stride=2),
            ResBlock(base_channels),

            # 512 → 256
            ConvBlock(base_channels, base_channels * 2, stride=2),
            ResBlock(base_channels * 2),

            # 256 → 128
            ConvBlock(base_channels * 2, base_channels * 4, stride=2),
            ResBlock(base_channels * 4),

            # 128 → 64
            ConvBlock(base_channels * 4, base_channels * 8, stride=2),
            ResBlock(base_channels * 8),

            # 64 → 32
            ConvBlock(base_channels * 8, latent_channels, stride=2),
        )

        # Decoder: 32× 업샘플링
        self.decoder = nn.Sequential(
            # 32 → 64
            UpConvBlock(latent_channels, base_channels * 8),
            ResBlock(base_channels * 8),

            # 64 → 128
            UpConvBlock(base_channels * 8, base_channels * 4),
            ResBlock(base_channels * 4),

            # 128 → 256
            UpConvBlock(base_channels * 4, base_channels * 2),
            ResBlock(base_channels * 2),

            # 256 → 512
            UpConvBlock(base_channels * 2, base_channels),
            ResBlock(base_channels),

            # 512 → 1024
            UpConvBlock(base_channels, in_channels),
        )

    def encode(self, x):
        return self.encoder(x)

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        z = self.encode(x)
        recon = self.decode(z)
        return recon, z

3.4 고압축에서 품질 유지하기

python

class DCAutoEncoderLoss(nn.Module):
    """
    DC-AE 학습을 위한 다중 손실 함수
    """
    def __init__(self):
        super().__init__()
        self.perceptual = LPIPS()
        self.discriminator = PatchGAN()

    def forward(self, x, recon, z):
        # 1. Reconstruction Loss
        l1_loss = F.l1_loss(recon, x)

        # 2. Perceptual Loss (더 중요!)
        perceptual_loss = self.perceptual(recon, x)

        # 3. Adversarial Loss
        real_pred = self.discriminator(x)
        fake_pred = self.discriminator(recon)
        adv_loss = F.binary_cross_entropy_with_logits(
            fake_pred, torch.ones_like(fake_pred)
        )

        # 4. Latent Regularization (KL divergence)
        kl_loss = 0.5 * (z.pow(2) - 1).mean()

        # 가중치 조합
        total_loss = (
            l1_loss * 1.0 +
            perceptual_loss * 0.5 +
            adv_loss * 0.1 +
            kl_loss * 0.0001
        )

        return total_loss

4. Mix-FFN: 지역적 정보 보존

4.1 Global Attention의 문제

Linear Attention은 효율적이지만:

지역적 패턴 포착이 약함
이미지의 공간적 구조 무시 가능성

4.2 Mix-FFN 설계

python

class MixFFN(nn.Module):
    """
    Mix-FFN: FFN에 Depthwise Convolution 추가
    지역적 정보와 전역적 정보를 동시에 처리
    """
    def __init__(self, dim, hidden_dim=None, kernel_size=3):
        super().__init__()
        hidden_dim = hidden_dim or dim * 4

        self.fc1 = nn.Linear(dim, hidden_dim)

        # Depthwise Convolution: 지역적 정보 처리
        self.dwconv = nn.Conv2d(
            hidden_dim, hidden_dim,
            kernel_size=kernel_size,
            padding=kernel_size // 2,
            groups=hidden_dim  # Depthwise!
        )

        self.act = nn.GELU()
        self.fc2 = nn.Linear(hidden_dim, dim)

    def forward(self, x, H, W):
        """
        x: [B, N, C] where N = H × W
        """
        B, N, C = x.shape

        # Linear projection
        x = self.fc1(x)

        # Reshape for conv: [B, N, C] → [B, C, H, W]
        x = x.transpose(1, 2).view(B, -1, H, W)

        # Depthwise convolution (지역적 패턴)
        x = self.dwconv(x)
        x = self.act(x)

        # Reshape back: [B, C, H, W] → [B, N, C]
        x = x.flatten(2).transpose(1, 2)

        # Final projection
        x = self.fc2(x)

        return x

4.3 SANA 블록 전체 구조

python

class SANABlock(nn.Module):
    """
    SANA Transformer Block:
    Linear Attention + Mix-FFN + AdaLN
    """
    def __init__(self, dim, num_heads, mlp_ratio=4, qk_dim=64):
        super().__init__()

        # Normalization
        self.norm1 = nn.LayerNorm(dim, elementwise_affine=False)
        self.norm2 = nn.LayerNorm(dim, elementwise_affine=False)

        # Linear Attention
        self.attn = LinearAttention(dim, num_heads, qk_dim)

        # Mix-FFN
        self.ffn = MixFFN(dim, dim * mlp_ratio)

        # AdaLN modulation
        self.adaLN_modulation = nn.Sequential(
            nn.SiLU(),
            nn.Linear(dim, 6 * dim)
        )

    def forward(self, x, c, H, W):
        """
        x: [B, N, C] - 패치 토큰
        c: [B, C] - 조건 임베딩 (timestep + text)
        """
        # AdaLN parameters
        shift_attn, scale_attn, gate_attn, shift_ffn, scale_ffn, gate_ffn = \
            self.adaLN_modulation(c).unsqueeze(1).chunk(6, dim=-1)

        # Linear Attention with AdaLN
        x_norm = self.norm1(x) * (1 + scale_attn) + shift_attn
        x = x + gate_attn * self.attn(x_norm)

        # Mix-FFN with AdaLN
        x_norm = self.norm2(x) * (1 + scale_ffn) + shift_ffn
        x = x + gate_ffn * self.ffn(x_norm, H, W)

        return x

5. 전체 SANA 아키텍처

5.1 모델 구성

python

class SANA(nn.Module):
    """
    SANA: Linear Diffusion Transformer for High-Resolution Image Synthesis
    """
    def __init__(
        self,
        image_size=1024,
        patch_size=32,  # DC-AE compression
        latent_channels=32,
        hidden_dim=1152,
        depth=28,
        num_heads=16,
        mlp_ratio=4,
        text_dim=4096  # T5-XXL
    ):
        super().__init__()

        self.image_size = image_size
        self.patch_size = patch_size
        self.num_patches = (image_size // patch_size) ** 2  # 1024개

        # Patch embedding
        self.patch_embed = nn.Linear(latent_channels * patch_size * patch_size, hidden_dim)

        # Positional embedding
        self.pos_embed = nn.Parameter(torch.zeros(1, self.num_patches, hidden_dim))

        # Timestep embedding
        self.time_embed = nn.Sequential(
            SinusoidalPosEmb(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

        # Text projection
        self.text_proj = nn.Linear(text_dim, hidden_dim)

        # SANA Blocks
        self.blocks = nn.ModuleList([
            SANABlock(hidden_dim, num_heads, mlp_ratio)
            for _ in range(depth)
        ])

        # Cross-Attention blocks (매 4번째 블록)
        self.cross_attn_blocks = nn.ModuleList([
            CrossAttention(hidden_dim, num_heads)
            for _ in range(depth // 4)
        ])

        # Final layer
        self.final_norm = nn.LayerNorm(hidden_dim)
        self.final_proj = nn.Linear(hidden_dim, latent_channels * patch_size * patch_size)

    def forward(self, z, t, text_emb, text_mask=None):
        """
        z: [B, C, H, W] - 잠재 표현
        t: [B] - timestep
        text_emb: [B, L, D] - 텍스트 임베딩
        """
        B, C, H, W = z.shape

        # Patchify
        x = z.view(B, C, H // self.patch_size, self.patch_size,
                      W // self.patch_size, self.patch_size)
        x = x.permute(0, 2, 4, 1, 3, 5).reshape(B, -1, C * self.patch_size ** 2)
        x = self.patch_embed(x)

        # Add positional embedding
        x = x + self.pos_embed

        # Timestep conditioning
        t_emb = self.time_embed(t)

        # Text pooling for AdaLN
        text_pooled = text_emb.mean(dim=1)
        c = t_emb + self.text_proj(text_pooled)

        # Process through blocks
        patch_H, patch_W = H // self.patch_size, W // self.patch_size
        cross_attn_idx = 0

        for i, block in enumerate(self.blocks):
            x = block(x, c, patch_H, patch_W)

            # Cross-attention every 4 blocks
            if (i + 1) % 4 == 0 and cross_attn_idx < len(self.cross_attn_blocks):
                x = self.cross_attn_blocks[cross_attn_idx](x, text_emb, text_mask)
                cross_attn_idx += 1

        # Final projection
        x = self.final_norm(x)
        x = self.final_proj(x)

        # Unpatchify
        x = x.view(B, patch_H, patch_W, C, self.patch_size, self.patch_size)
        x = x.permute(0, 3, 1, 4, 2, 5).reshape(B, C, H, W)

        return x

5.2 Cross-Attention 구현

python

class CrossAttention(nn.Module):
    """
    텍스트 조건을 위한 Cross-Attention
    (Linear Attention이 아닌 표준 Attention 사용)
    """
    def __init__(self, dim, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.scale = self.head_dim ** -0.5

        self.norm = nn.LayerNorm(dim)
        self.q_proj = nn.Linear(dim, dim)
        self.kv_proj = nn.Linear(dim, dim * 2)
        self.out_proj = nn.Linear(dim, dim)

    def forward(self, x, text_emb, text_mask=None):
        B, N, C = x.shape
        _, L, _ = text_emb.shape

        # Normalize
        x_norm = self.norm(x)

        # Q from image, K,V from text
        q = self.q_proj(x_norm).view(B, N, self.num_heads, self.head_dim)
        kv = self.kv_proj(text_emb).view(B, L, 2, self.num_heads, self.head_dim)
        k, v = kv.unbind(dim=2)

        # Attention
        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)
        attn = (q @ k.transpose(-2, -1)) * self.scale

        if text_mask is not None:
            attn = attn.masked_fill(~text_mask.unsqueeze(1).unsqueeze(2), float('-inf'))

        attn = attn.softmax(dim=-1)
        out = (attn @ v).transpose(1, 2).reshape(B, N, C)

        return x + self.out_proj(out)

6. Triton 커널 최적화

6.1 왜 커스텀 커널이 필요한가?

PyTorch 기본 구현의 한계:

Linear Attention은 표준 연산이 아님
중간 텐서 메모리 오버헤드
GPU 활용률 비효율

6.2 Triton Linear Attention 커널

python

import triton
import triton.language as tl

@triton.jit
def linear_attention_kernel(
    Q_ptr, K_ptr, V_ptr, O_ptr,
    stride_qb, stride_qh, stride_qn, stride_qk,
    stride_kb, stride_kh, stride_kn, stride_kk,
    stride_vb, stride_vh, stride_vn, stride_vd,
    stride_ob, stride_oh, stride_on, stride_od,
    N, D_K, D_V,
    BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, BLOCK_V: tl.constexpr
):
    """
    Triton 커널로 구현한 Linear Attention
    메모리 효율적인 online computation
    """
    # Block indices
    batch_idx = tl.program_id(0)
    head_idx = tl.program_id(1)

    # Initialize accumulators for KV
    kv_acc = tl.zeros([BLOCK_K, BLOCK_V], dtype=tl.float32)
    k_sum = tl.zeros([BLOCK_K], dtype=tl.float32)

    # First pass: compute K^T @ V
    for n_start in range(0, N, BLOCK_N):
        n_offs = n_start + tl.arange(0, BLOCK_N)
        mask_n = n_offs < N

        # Load K and V
        k_ptrs = K_ptr + batch_idx * stride_kb + head_idx * stride_kh + \
                 n_offs[:, None] * stride_kn + tl.arange(0, BLOCK_K)[None, :] * stride_kk
        v_ptrs = V_ptr + batch_idx * stride_vb + head_idx * stride_vh + \
                 n_offs[:, None] * stride_vn + tl.arange(0, BLOCK_V)[None, :] * stride_vd

        k = tl.load(k_ptrs, mask=mask_n[:, None], other=0.0)
        v = tl.load(v_ptrs, mask=mask_n[:, None], other=0.0)

        # Apply feature map (ReLU)
        k = tl.maximum(k, 0)

        # Accumulate KV
        kv_acc += tl.dot(k.T, v)
        k_sum += tl.sum(k, axis=0)

    # Second pass: Q @ (K^T @ V)
    for n_start in range(0, N, BLOCK_N):
        n_offs = n_start + tl.arange(0, BLOCK_N)
        mask_n = n_offs < N

        # Load Q
        q_ptrs = Q_ptr + batch_idx * stride_qb + head_idx * stride_qh + \
                 n_offs[:, None] * stride_qn + tl.arange(0, BLOCK_K)[None, :] * stride_qk
        q = tl.load(q_ptrs, mask=mask_n[:, None], other=0.0)
        q = tl.maximum(q, 0)  # Feature map

        # Compute output
        out = tl.dot(q, kv_acc)

        # Normalize
        normalizer = tl.dot(q, k_sum[:, None]) + 1e-6
        out = out / normalizer

        # Store
        o_ptrs = O_ptr + batch_idx * stride_ob + head_idx * stride_oh + \
                 n_offs[:, None] * stride_on + tl.arange(0, BLOCK_V)[None, :] * stride_od
        tl.store(o_ptrs, out, mask=mask_n[:, None])

6.3 성능 비교

python

def benchmark_attention():
    """
    Linear Attention 구현별 성능 비교
    """
    B, N, H, D = 4, 4096, 16, 64  # 1024×1024 이미지

    results = {}

    # 1. Standard Attention (baseline)
    standard_time = measure_time(standard_attention, B, N, H, D)
    results["Standard Attention"] = standard_time

    # 2. PyTorch Linear Attention
    pytorch_linear_time = measure_time(pytorch_linear_attention, B, N, H, D)
    results["PyTorch Linear"] = pytorch_linear_time

    # 3. Triton Linear Attention
    triton_linear_time = measure_time(triton_linear_attention, B, N, H, D)
    results["Triton Linear"] = triton_linear_time

    return results

# 결과 (A100 기준):
# Standard Attention: 15.2ms
# PyTorch Linear: 4.8ms (3.2x faster)
# Triton Linear: 2.1ms (7.2x faster)

7. 학습 및 추론

7.1 학습 파이프라인

python

class SANATrainer:
    def __init__(self, config):
        self.model = SANA(**config.model)
        self.dc_ae = DeepCompressionAutoEncoder()
        self.text_encoder = T5EncoderModel.from_pretrained("google/flan-t5-xxl")

        self.optimizer = torch.optim.AdamW(
            self.model.parameters(),
            lr=config.lr,
            betas=(0.9, 0.999),
            weight_decay=0.01
        )

        self.scheduler = DDPMScheduler(
            num_train_timesteps=1000,
            beta_schedule="scaled_linear"
        )

    def train_step(self, images, captions):
        # 1. Encode images with DC-AE
        with torch.no_grad():
            latents = self.dc_ae.encode(images)

        # 2. Encode text
        with torch.no_grad():
            text_emb = self.text_encoder(captions).last_hidden_state

        # 3. Sample noise and timesteps
        noise = torch.randn_like(latents)
        timesteps = torch.randint(0, 1000, (images.shape[0],))

        # 4. Add noise
        noisy_latents = self.scheduler.add_noise(latents, noise, timesteps)

        # 5. Predict noise
        pred_noise = self.model(noisy_latents, timesteps, text_emb)

        # 6. Compute loss
        loss = F.mse_loss(pred_noise, noise)

        # 7. Backprop
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
        self.optimizer.step()

        return loss.item()

7.2 빠른 추론

python

class SANAPipeline:
    def __init__(self, model_path):
        self.model = SANA.from_pretrained(model_path)
        self.dc_ae = DeepCompressionAutoEncoder.from_pretrained(model_path)
        self.text_encoder = T5EncoderModel.from_pretrained("google/flan-t5-xxl")
        self.scheduler = DDIMScheduler(num_train_timesteps=1000)

        # Compile for speed (PyTorch 2.0)
        self.model = torch.compile(self.model, mode="max-autotune")

    @torch.no_grad()
    def generate(
        self,
        prompt: str,
        negative_prompt: str = "",
        num_inference_steps: int = 20,  # DDIM은 적은 스텝 가능
        guidance_scale: float = 7.5,
        height: int = 1024,
        width: int = 1024,
        seed: int = None
    ):
        if seed is not None:
            torch.manual_seed(seed)

        # Text encoding
        text_emb = self.encode_text(prompt)
        if guidance_scale > 1.0:
            uncond_emb = self.encode_text(negative_prompt)
            text_emb = torch.cat([uncond_emb, text_emb])

        # Initial noise (32× 압축된 크기)
        latent_h, latent_w = height // 32, width // 32
        latents = torch.randn(1, 32, latent_h, latent_w, device="cuda")

        # DDIM denoising
        self.scheduler.set_timesteps(num_inference_steps)

        for t in self.scheduler.timesteps:
            latent_input = torch.cat([latents] * 2) if guidance_scale > 1.0 else latents

            # Predict noise
            noise_pred = self.model(latent_input, t, text_emb)

            # CFG
            if guidance_scale > 1.0:
                uncond_pred, cond_pred = noise_pred.chunk(2)
                noise_pred = uncond_pred + guidance_scale * (cond_pred - uncond_pred)

            # DDIM step
            latents = self.scheduler.step(noise_pred, t, latents).prev_sample

        # Decode with DC-AE
        images = self.dc_ae.decode(latents)
        images = (images / 2 + 0.5).clamp(0, 1)

        return images

    def encode_text(self, text):
        tokens = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        return self.text_encoder(**tokens.to("cuda")).last_hidden_state

8. 실험 결과

8.1 속도 비교

모델	해상도	스텝	생성 시간	속도 향상
DiT-XL/2	512²	50	4.2초	1×
PixArt-α	1024²	50	5.1초	-
SANA-0.6B	1024²	20	0.6초	8.5×
SANA-1.6B	1024²	20	0.9초	5.7×

8.2 품질 비교

FID Score (COCO 2014):

모델	FID↓	CLIP Score↑	파라미터
SDXL	8.92	0.31	2.6B
PixArt-α	7.32	0.32	600M
SANA-0.6B	7.85	0.31	600M
SANA-1.6B	6.91	0.33	1.6B

8.3 효율성 분석

=== 메모리 사용량 (1024×1024 생성) ===

DiT-XL/2 (외삽): ~35GB (불가능)
PixArt-α: 12GB
SANA-0.6B: 6GB
SANA-1.6B: 10GB

=== 연산량 (TFLOPs) ===

DiT-XL/2 (512²): 118 TFLOPs
SANA-0.6B (1024²): 48 TFLOPs (2.5× 적음!)
SANA-1.6B (1024²): 95 TFLOPs (1.2× 적음)

9. 한계 및 향후 연구

9.1 현재 한계

SANA의 한계:

1. 32× 압축의 트레이드오프
- 세밀한 디테일 손실 가능
- 얼굴, 손 등 복잡한 영역에서 품질 저하

2. Linear Attention의 표현력
- 복잡한 공간 관계 모델링 약함
- Mix-FFN으로 부분 보완하지만 완전하지 않음

3. 학습 데이터 의존성
- 여전히 대규모 고품질 데이터 필요

9.2 향후 연구 방향

python

future_directions = {
    "adaptive_compression": {
        "idea": "영역별 다른 압축률 적용",
        "benefit": "중요 영역은 높은 해상도 유지"
    },
    "hybrid_attention": {
        "idea": "Linear + Standard Attention 동적 전환",
        "benefit": "효율성과 표현력의 균형"
    },
    "video_extension": {
        "idea": "시간 축 Linear Attention",
        "benefit": "초고속 비디오 생성"
    },
    "distillation": {
        "idea": "더 작은 모델로 지식 증류",
        "benefit": "모바일/엣지 디바이스 배포"
    }
}

10. 결론

10.1 SANA의 핵심 기여

기여	설명
Linear Attention	O(n²) → O(n)으로 확장성 혁신
DC-AE	32× 압축으로 토큰 수 64배 감소
Mix-FFN	지역적 정보 보존
Triton 커널	하드웨어 수준 최적화

10.2 실용적 의미

SANA가 가능하게 한 것:

1. 실시간 이미지 생성
- 1024² 이미지를 1초 미만에 생성
- 인터랙티브 애플리케이션 가능

2. 리소스 민주화
- 6GB GPU로 고해상도 생성
- 개인 PC에서도 실행 가능

3. 비용 절감
- 클라우드 비용 90% 이상 절감
- API 서비스 비용 최소화

4. 새로운 애플리케이션
- 실시간 이미지 편집
- 게임 내 동적 텍스처 생성
- AR/VR 콘텐츠

참고문헌

Xie, E., et al. (2024). SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer. arXiv:2410.10629
Katharopoulos, A., et al. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020
Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023
Chen, J., et al. (2023). PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. arXiv:2310.00426
Tillet, P., et al. (2019). Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. MLSys 2019

Tags: #SANA #Linear-Attention #DiT #Efficient-Diffusion #DC-AE #High-Resolution #Image-Generation #Triton #Mix-FFN

이 글의 실험 코드는 첨부된 Jupyter Notebook에서 확인할 수 있습니다.