Data & Analytics

코호트 분석 SQL로 직접 구현하기: 리텐션의 모든 것

GA4 없이도 코호트 분석 가능합니다. 가입월별 리텐션, N일 리텐션 SQL로 직접 구현.

코호트 분석 SQL로 직접 구현하기: 리텐션의 모든 것

코호트 분석 SQL로 직접 구현하기: 리텐션의 모든 것

GA4 없이도 코호트 분석 가능합니다. 가입월별 리텐션, N일 리텐션 SQL로 직접 구현.

TL;DR

  • 코호트(Cohort): 특정 시점에 같은 행동을 한 사용자 그룹
  • 리텐션: 코호트가 시간이 지나도 서비스를 계속 사용하는 비율
  • SQL로 구현: Window 함수와 CTE로 GA4 없이 직접 분석 가능
  • 핵심 인사이트: 어떤 코호트가 잔존율이 높은지, 언제 이탈이 심한지 파악

1. 코호트 분석이란?

왜 코호트 분석이 필요한가?

잘못된 분석:

sql
이번 달 DAU: 10,000명 → "좋아!"

올바른 분석:

sql
1월 가입자 중 3개월 후 활성 사용자: 20%
2월 가입자 중 3개월 후 활성 사용자: 35%
→ "2월에 뭔가 바뀌었구나!"

코호트 분석은 시간에 따른 사용자 행동 변화를 추적합니다.

코호트의 종류

코호트 기준예시
가입 시점2024년 1월 가입자
첫 구매 시점첫 구매가 2024년 Q1인 고객
유입 채널구글 광고로 유입된 사용자
첫 사용 기능프리미엄 기능을 먼저 사용한 사용자

2. 기본 테이블 구조

sql
-- 사용자 테이블
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    created_at TIMESTAMP,    -- 가입일
    signup_channel VARCHAR(50)
);

-- 활동 로그 테이블
CREATE TABLE user_activities (
    id INT PRIMARY KEY,
    user_id INT,
    activity_type VARCHAR(50),
    created_at TIMESTAMP
);

-- 주문 테이블
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    user_id INT,
    amount DECIMAL(10, 2),
    created_at TIMESTAMP
);

3. 월별 가입 코호트 리텐션

가장 기본적인 코호트 분석

sql
WITH user_cohorts AS (
    -- 각 사용자의 가입월 (코호트)
    SELECT
        user_id,
        DATE_TRUNC('month', created_at) as cohort_month
    FROM users
),
user_activities_monthly AS (
    -- 각 사용자의 월별 활동 여부
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('month', created_at) as activity_month
    FROM user_activities
),
cohort_data AS (
    -- 코호트와 활동 데이터 결합
    SELECT
        uc.cohort_month,
        ua.activity_month,
        -- 가입 후 몇 개월 차인지
        DATE_PART('year', ua.activity_month) * 12 + DATE_PART('month', ua.activity_month)
        - DATE_PART('year', uc.cohort_month) * 12 - DATE_PART('month', uc.cohort_month)
        as months_since_signup,
        COUNT(DISTINCT uc.user_id) as active_users
    FROM user_cohorts uc
    JOIN user_activities_monthly ua ON uc.user_id = ua.user_id
    WHERE ua.activity_month >= uc.cohort_month
    GROUP BY uc.cohort_month, ua.activity_month
),
cohort_sizes AS (
    -- 각 코호트의 총 사용자 수
    SELECT
        cohort_month,
        COUNT(*) as cohort_size
    FROM user_cohorts
    GROUP BY cohort_month
)
SELECT
    cd.cohort_month,
    cs.cohort_size,
    cd.months_since_signup,
    cd.active_users,
    ROUND(cd.active_users * 100.0 / cs.cohort_size, 2) as retention_rate
FROM cohort_data cd
JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month
ORDER BY cd.cohort_month, cd.months_since_signup;

결과:

cohort_monthcohort_sizemonths_since_signupactive_usersretention_rate
2024-01-01100001000100.00
2024-01-011000145045.00
2024-01-011000232032.00
2024-01-011000328028.00
2024-02-01120001200100.00
2024-02-011200160050.00

4. 피벗 테이블 형태로 보기

코호트 리텐션 매트릭스

sql
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', created_at) as cohort_month
    FROM users
),
user_activities_monthly AS (
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('month', created_at) as activity_month
    FROM user_activities
),
retention_data AS (
    SELECT
        uc.cohort_month,
        DATE_PART('year', ua.activity_month) * 12 + DATE_PART('month', ua.activity_month)
        - DATE_PART('year', uc.cohort_month) * 12 - DATE_PART('month', uc.cohort_month)
        as month_number,
        COUNT(DISTINCT uc.user_id) as users
    FROM user_cohorts uc
    LEFT JOIN user_activities_monthly ua
        ON uc.user_id = ua.user_id
        AND ua.activity_month >= uc.cohort_month
    GROUP BY uc.cohort_month, month_number
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(*) as size
    FROM user_cohorts
    GROUP BY cohort_month
)
SELECT
    TO_CHAR(r.cohort_month, 'YYYY-MM') as cohort,
    cs.size as cohort_size,
    MAX(CASE WHEN month_number = 0 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M0",
    MAX(CASE WHEN month_number = 1 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M1",
    MAX(CASE WHEN month_number = 2 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M2",
    MAX(CASE WHEN month_number = 3 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M3",
    MAX(CASE WHEN month_number = 4 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M4",
    MAX(CASE WHEN month_number = 5 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M5",
    MAX(CASE WHEN month_number = 6 THEN ROUND(users * 100.0 / cs.size, 1) END) as "M6"
FROM retention_data r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
GROUP BY r.cohort_month, cs.size
ORDER BY r.cohort_month;

결과 (리텐션 매트릭스):

cohortcohort_sizeM0M1M2M3M4M5M6
2024-011000100453228252322
2024-0212001005038322927-
2024-03150010055423633--
2024-041100100483530---

5. N일 리텐션 (Day 1, Day 7, Day 30)

모바일 앱에서 많이 사용하는 지표

sql
WITH user_first_activity AS (
    -- 각 사용자의 첫 활동일
    SELECT
        user_id,
        DATE(MIN(created_at)) as first_activity_date
    FROM user_activities
    GROUP BY user_id
),
daily_activities AS (
    -- 사용자별 활동 날짜
    SELECT DISTINCT
        user_id,
        DATE(created_at) as activity_date
    FROM user_activities
),
retention_flags AS (
    SELECT
        ufa.user_id,
        ufa.first_activity_date,
        -- Day N 활동 여부 체크
        MAX(CASE WHEN da.activity_date = ufa.first_activity_date + INTERVAL '1 day'
            THEN 1 ELSE 0 END) as day_1,
        MAX(CASE WHEN da.activity_date = ufa.first_activity_date + INTERVAL '7 days'
            THEN 1 ELSE 0 END) as day_7,
        MAX(CASE WHEN da.activity_date = ufa.first_activity_date + INTERVAL '14 days'
            THEN 1 ELSE 0 END) as day_14,
        MAX(CASE WHEN da.activity_date = ufa.first_activity_date + INTERVAL '30 days'
            THEN 1 ELSE 0 END) as day_30
    FROM user_first_activity ufa
    LEFT JOIN daily_activities da ON ufa.user_id = da.user_id
    GROUP BY ufa.user_id, ufa.first_activity_date
)
SELECT
    DATE_TRUNC('week', first_activity_date) as cohort_week,
    COUNT(*) as cohort_size,
    ROUND(SUM(day_1) * 100.0 / COUNT(*), 2) as day_1_retention,
    ROUND(SUM(day_7) * 100.0 / COUNT(*), 2) as day_7_retention,
    ROUND(SUM(day_14) * 100.0 / COUNT(*), 2) as day_14_retention,
    ROUND(SUM(day_30) * 100.0 / COUNT(*), 2) as day_30_retention
FROM retention_flags
GROUP BY DATE_TRUNC('week', first_activity_date)
ORDER BY cohort_week;

결과:

cohort_weekcohort_sizeday_1_retentionday_7_retentionday_14_retentionday_30_retention
2024-01-0150042.0025.0018.0012.00
2024-01-0852045.0028.0020.0014.00
2024-01-1548040.0022.0015.0010.00

6. 구매 코호트: 재구매율 분석

첫 구매 후 재구매 패턴

sql
WITH first_purchase AS (
    -- 각 고객의 첫 구매
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(created_at)) as first_purchase_month
    FROM orders
    GROUP BY user_id
),
monthly_purchases AS (
    -- 고객별 월별 구매 여부
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('month', created_at) as purchase_month
    FROM orders
),
repurchase_data AS (
    SELECT
        fp.first_purchase_month as cohort,
        DATE_PART('year', mp.purchase_month) * 12 + DATE_PART('month', mp.purchase_month)
        - DATE_PART('year', fp.first_purchase_month) * 12 - DATE_PART('month', fp.first_purchase_month)
        as months_since_first,
        COUNT(DISTINCT fp.user_id) as customers
    FROM first_purchase fp
    JOIN monthly_purchases mp ON fp.user_id = mp.user_id
    WHERE mp.purchase_month >= fp.first_purchase_month
    GROUP BY fp.first_purchase_month, months_since_first
),
cohort_sizes AS (
    SELECT first_purchase_month, COUNT(*) as size
    FROM first_purchase
    GROUP BY first_purchase_month
)
SELECT
    TO_CHAR(rd.cohort, 'YYYY-MM') as cohort,
    cs.size,
    rd.months_since_first,
    rd.customers,
    ROUND(rd.customers * 100.0 / cs.size, 2) as repurchase_rate
FROM repurchase_data rd
JOIN cohort_sizes cs ON rd.cohort = cs.first_purchase_month
WHERE rd.months_since_first <= 6
ORDER BY rd.cohort, rd.months_since_first;

7. 채널별 코호트 비교

어떤 채널의 사용자가 더 오래 남는가?

sql
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', created_at) as cohort_month,
        signup_channel
    FROM users
),
user_activities_monthly AS (
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('month', created_at) as activity_month
    FROM user_activities
),
channel_retention AS (
    SELECT
        uc.signup_channel,
        DATE_PART('year', ua.activity_month) * 12 + DATE_PART('month', ua.activity_month)
        - DATE_PART('year', uc.cohort_month) * 12 - DATE_PART('month', uc.cohort_month)
        as month_number,
        COUNT(DISTINCT uc.user_id) as active_users,
        COUNT(DISTINCT CASE WHEN ua.activity_month = uc.cohort_month
            THEN uc.user_id END) as cohort_size
    FROM user_cohorts uc
    LEFT JOIN user_activities_monthly ua
        ON uc.user_id = ua.user_id
        AND ua.activity_month >= uc.cohort_month
    GROUP BY uc.signup_channel, month_number
)
SELECT
    signup_channel,
    month_number,
    active_users,
    ROUND(active_users * 100.0 / NULLIF(MAX(cohort_size) OVER (PARTITION BY signup_channel), 0), 2) as retention_rate
FROM channel_retention
WHERE month_number BETWEEN 0 AND 6
ORDER BY signup_channel, month_number;

결과:

signup_channelmonth_numberactive_usersretention_rate
google_ads05000100.00
google_ads1200040.00
google_ads3100020.00
organic03000100.00
organic1180060.00
organic3120040.00

→ Organic 유입이 유료 광고보다 리텐션이 2배!

8. 롤링 리텐션 vs 클래식 리텐션

클래식 리텐션

"정확히 Day 7에 접속한 사용자"

롤링 리텐션 (더 관대함)

"Day 7 이후 언제든 접속한 사용자"

sql
WITH user_first_activity AS (
    SELECT
        user_id,
        DATE(MIN(created_at)) as first_date
    FROM user_activities
    GROUP BY user_id
),
daily_activities AS (
    SELECT DISTINCT user_id, DATE(created_at) as activity_date
    FROM user_activities
),
retention_comparison AS (
    SELECT
        ufa.user_id,
        ufa.first_date,
        -- 클래식: 정확히 Day 7
        MAX(CASE WHEN da.activity_date = ufa.first_date + 7 THEN 1 ELSE 0 END) as classic_d7,
        -- 롤링: Day 7 이후 아무 때나
        MAX(CASE WHEN da.activity_date >= ufa.first_date + 7 THEN 1 ELSE 0 END) as rolling_d7
    FROM user_first_activity ufa
    LEFT JOIN daily_activities da ON ufa.user_id = da.user_id
    GROUP BY ufa.user_id, ufa.first_date
)
SELECT
    DATE_TRUNC('week', first_date) as cohort_week,
    COUNT(*) as users,
    ROUND(SUM(classic_d7) * 100.0 / COUNT(*), 2) as classic_d7_retention,
    ROUND(SUM(rolling_d7) * 100.0 / COUNT(*), 2) as rolling_d7_retention
FROM retention_comparison
GROUP BY DATE_TRUNC('week', first_date)
ORDER BY cohort_week;

9. LTV by 코호트

코호트별 고객 생애 가치

sql
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', created_at) as cohort_month
    FROM users
),
user_revenue AS (
    SELECT
        user_id,
        SUM(amount) as total_revenue,
        COUNT(*) as order_count
    FROM orders
    GROUP BY user_id
)
SELECT
    TO_CHAR(uc.cohort_month, 'YYYY-MM') as cohort,
    COUNT(DISTINCT uc.user_id) as users,
    COUNT(DISTINCT ur.user_id) as paying_users,
    ROUND(COUNT(DISTINCT ur.user_id) * 100.0 / COUNT(DISTINCT uc.user_id), 2) as conversion_rate,
    ROUND(AVG(ur.total_revenue), 2) as avg_revenue_per_paying_user,
    ROUND(SUM(ur.total_revenue) / COUNT(DISTINCT uc.user_id), 2) as revenue_per_user
FROM user_cohorts uc
LEFT JOIN user_revenue ur ON uc.user_id = ur.user_id
GROUP BY uc.cohort_month
ORDER BY uc.cohort_month;

10. 실전 팁

1. 코호트 크기 필터링

작은 코호트는 통계적으로 불안정합니다.

sql
HAVING COUNT(*) >= 100  -- 최소 100명 이상

2. 시각화를 위한 데이터 포맷

sql
-- 히트맵 시각화용 long format
SELECT
    cohort_month,
    month_number,
    retention_rate
FROM retention_data
ORDER BY cohort_month, month_number;

3. 이탈률 = 100 - 리텐션

sql
SELECT
    cohort,
    month_number,
    retention_rate,
    100 - retention_rate as churn_rate,
    LAG(retention_rate) OVER (PARTITION BY cohort ORDER BY month_number) - retention_rate as monthly_churn
FROM retention_matrix;

결론

분석 유형질문SQL 핵심
월별 코호트언제 가입한 사용자가 잘 남는가?DATE_TRUNC + month 차이 계산
N일 리텐션Day 1/7/30에 몇 %가 돌아오는가?INTERVAL + CASE WHEN
채널별 비교어떤 채널이 LTV가 높은가?GROUP BY channel
롤링 리텐션이후 언제든 돌아오는 비율>= 비교

코호트 분석 핵심:

  • 시간에 따른 사용자 행동을 추적
  • 평균이 아닌 그룹별 트렌드 파악
  • 제품 개선의 효과를 정량적으로 측정

References

  1. Amplitude - Retention Analysis Guide
  2. Mixpanel - Cohort Analysis Best Practices
  3. Mode Analytics - SQL for Cohort Analysis