Data & Analytics

"We Need Python for This" — Handling Pivot, JSON, UTM, RFM All in SQL

Learn practical patterns to handle Pivot, JSON parsing, UTM extraction, and RFM segmentation with a single SQL query instead of 100 lines of Python.

"We Need Python for This" — Handling Pivot, JSON, UTM, RFM All in SQL

"We Need Python for This" — Handling Pivot, JSON, UTM, RFM All in SQL

Python 100 Lines vs SQL One Query

"We need to build ETL in Python" — every time I hear this, I wonder: Set up Airflow, manage dependencies, configure schedulers... Do we really need all that?

80% of data processing tasks in real work can be done with SQL alone.

This article covers patterns for handling Pivot, JSON parsing, UTM extraction, and RFM segmentation — tasks commonly thought to "need Python" — with just SQL.

1. SQL vs Python: When to Choose What

When SQL is Better

  • Aggregation: DB engine parallel processing beats pandas
  • Pivot/Unpivot: Native support in BigQuery, Snowflake
  • JSON parsing: Built-in functions like JSON_EXTRACT, JSON_VALUE
  • Regex extraction: REGEXP_EXTRACT for UTM, etc.
  • Large joins: Distributed processing, broadcast joins

When You Need Python

  • External APIs: REST APIs, web scraping
  • ML models: Predictions, clustering
  • Complex NLP: Tokenization, morphological analysis
  • Stateful processing: Cumulative calculations dependent on previous rows

Cost Comparison

ItemPython + AirflowSQL Scheduled Query
InfrastructureEC2/GCE neededServerless (pay per query)
Development2-3 days2-4 hours
MaintenanceHigh (dependencies)Low (just SQL)
DebuggingLog analysisImmediate with EXPLAIN

2. Pivot — Transform Rows to Report Format

CASE WHEN Method (Universal)

sql
SELECT
  category,
  COALESCE(SUM(CASE WHEN MONTH(order_date) = 1 THEN revenue END), 0) AS jan,
  COALESCE(SUM(CASE WHEN MONTH(order_date) = 2 THEN revenue END), 0) AS feb,
  COALESCE(SUM(CASE WHEN MONTH(order_date) = 3 THEN revenue END), 0) AS mar
FROM orders
GROUP BY category;

Key: Use COALESCE for NULL defense. NULL + number = NULL can distort aggregations.

BigQuery PIVOT Operator

sql
SELECT * FROM (
  SELECT category, EXTRACT(MONTH FROM order_date) AS month, revenue
  FROM orders
  WHERE order_date >= '2024-01-01' AND order_date < '2024-04-01'
)
PIVOT (
  SUM(revenue)
  FOR month IN (1 AS jan, 2 AS feb, 3 AS mar)
)
ORDER BY category;

Native PIVOT provides 10-30% performance improvement over CASE WHEN.

3. JSON Parsing — Handling Web Event Logs

Basic Parsing

sql
SELECT
  user_id,
  JSON_EXTRACT_SCALAR(event_params, '$.page_url') AS page_url,
  JSON_EXTRACT_SCALAR(event_params, '$.campaign') AS campaign,
  JSON_EXTRACT_SCALAR(event_params, '$.device') AS device
FROM events
WHERE DATE(event_timestamp) = '2024-01-15';

Performance: Parse Once with CTE

sql
WITH parsed AS (
  SELECT *, JSON_EXTRACT_SCALAR(event_params, '$.page_url') AS page_url
  FROM events
)
SELECT * FROM parsed WHERE page_url LIKE '%/products%';

Nested JSON Arrays

sql
WITH order_items AS (
  SELECT order_id, item_json
  FROM orders, UNNEST(JSON_EXTRACT_ARRAY(items, '$')) AS item_json
)
SELECT
  order_id,
  JSON_EXTRACT_SCALAR(item_json, '$.product_id') AS product_id,
  CAST(JSON_EXTRACT_SCALAR(item_json, '$.qty') AS INT64) AS quantity
FROM order_items;

4. UTM Extraction — Automate Channel Analysis

Regex Extraction

sql
WITH url_normalized AS (
  SELECT session_id,
    LOWER(REGEXP_REPLACE(
      REGEXP_REPLACE(landing_page, r'#.*$', ''),
      r'%20', ' '
    )) AS clean_url
  FROM sessions
)
SELECT
  session_id,
  REGEXP_EXTRACT(clean_url, r'utm_source=([^&]+)') AS utm_source,
  REGEXP_EXTRACT(clean_url, r'utm_medium=([^&]+)') AS utm_medium,
  REGEXP_EXTRACT(clean_url, r'utm_campaign=([^&]+)') AS utm_campaign
FROM url_normalized;

Channel Grouping

sql
CASE
  WHEN utm_medium IN ('cpc', 'ppc', 'paidsearch') THEN 'Paid Search'
  WHEN utm_medium IN ('email', 'newsletter') THEN 'Email'
  WHEN utm_source IN ('facebook', 'instagram') AND utm_medium = 'cpc' THEN 'Paid Social'
  WHEN utm_source IS NULL THEN 'Direct'
  ELSE 'Other'
END AS channel_group

5. RFM Segmentation — Auto Customer Classification

What is RFM

  • R (Recency): When was the last purchase?
  • F (Frequency): How often?
  • M (Monetary): How much?

RFM in SQL

sql
WITH customer_metrics AS (
  SELECT
    customer_id,
    DATE_DIFF(CURRENT_DATE(), MAX(order_date), DAY) AS recency_days,
    COUNT(DISTINCT order_id) AS frequency,
    SUM(order_amount) AS monetary
  FROM orders
  WHERE order_status = 'completed'
    AND order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
  GROUP BY customer_id
),
rfm_scores AS (
  SELECT
    customer_id,
    recency_days, frequency, monetary,
    NTILE(5) OVER (ORDER BY recency_days DESC) AS r_score,
    NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
    NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
  FROM customer_metrics
)
SELECT
  customer_id,
  CONCAT(r_score, f_score, m_score) AS rfm_score,
  CASE
    WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
    WHEN f_score >= 4 AND m_score >= 3 THEN 'Loyal Customers'
    WHEN r_score >= 4 AND f_score <= 2 THEN 'New Customers'
    WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk'
    WHEN r_score = 1 AND f_score <= 2 THEN 'Lost'
    ELSE 'Others'
  END AS segment
FROM rfm_scores;

6. Production ETL — Build Data Marts with Scheduled Queries

ELT Paradigm

ETL became ELT (Extract-Load-Transform). In BigQuery, Snowflake, load raw data first, then transform with SQL.

Layered Architecture

  1. Raw: Store as-is
  2. Clean: NULL handling, JSON parsing, UTM extraction
  3. Aggregate: Daily/weekly aggregations
  4. Mart: Final tables per business domain

7. Common Mistakes

Missing NULL Handling

sql
SELECT
  SUM(COALESCE(revenue, 0)) AS total,
  COUNTIF(revenue IS NULL) AS null_count
FROM orders;

Ignoring Timezones

sql
SELECT DATE(event_timestamp, 'Asia/Seoul') AS date FROM events;

Conclusion

Before saying "we need Python", check if SQL can do it first.

A well-written SQL query beats 100 lines of Python:

  • Easier to maintain
  • Better performance
  • Lower cost