(Big)Query complexity: Raw GA4 export vs GA4Dataform

Written by
Krisztian Korpa
Reading progress
GA4Dataform vs GA4 BigQuery export queries

You finally got your GA4 data in BigQuery. You write your first query. And then you meet UNNEST.

If you’ve worked with the GA4 BigQuery export for more than 5 minutes, you know the feeling. The raw export is incredibly powerful, but “powerful” and “easy to work with” are vastly different. Between the nested arrays, the multiple source fields, and the quirks of how GA4 exports data, there’s a lot of plumbing you need to handle before you can answer even basic reporting questions.

In this article, I’ll walk through several patterns where the raw export gets tricky, and show how GA4Dataform (both Core and Premium) handles each one. The goal is not to say “raw is bad” (it’s not), but to make it clear what you’re signing up for depending on where you query from.

Disclaimer: I will talk about several Premium (paid) features which you may deem a bit sales-y and you would be right. But I hope you can find value in the examples and can still take away knowledge even if we don’t give you the exact code. Thanks for your understanding!

The three options

Before we get into the examples, here’s what each layer gives you and where it sits in the stack.

Raw GA4 export: event-level rows with nested event_paramsitems, and user_properties arrays. Sharded daily tables (events_YYYYMMDD) with optional fresh_ and intraday_ variants. Traffic sources live in multiple places (collected_traffic_sourcesession_traffic_source_last_clicktraffic_source) and they all mean different things. No partitioning, no clustering, no incrementality logic. You get everything, and you’re responsible for making sense of it.

GA4Dataform Core (free): two main output tables, ga4_events and ga4_sessions. Events are flattened (parameters extracted to proper columns). Sessions are pre-built with attribution, device, geo, and landing page data. The output tables are date-partitioned and clustered, so BigQuery can prune efficiently. Incrementality is handled automatically with a date checkpoint and an is_final flag. To combine event-level data with session context, you LEFT JOIN the two tables on session_id.

GA4Dataform Premium (paid): everything in Core, plus session_totals and session_params on the ga4_sessions table. This means event counts, revenue sums, and custom parameter aggregations are already on the session row. Many reporting queries that previously required a join to ga4_events can now run on ga4_sessions alone. It also adds configurable channel groupings, so you can define your own attribution rules with a simpler config. For an extensive list of features, check this page.

UNNEST and event parameters

This is the one everyone runs into first. GA4 stores event parameters as a repeated RECORD (basically an array of key-value pairs), where each value can be a string_valueint_valuefloat_value, or double_value. To get a single parameter out, you write a scalar subquery:

SELECT
  (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location
FROM `project.dataset.events_*`

Two parameters, two subqueries. A typical production query might extract 10-15 parameters, and each one is the same boilerplate.

It gets worse with numeric values. The value event parameter stores its value in string_value on some hits and int_value on others (depending on how your tracking is set up). If you want a reliable numeric extraction, you end up with something like:

(SELECT COALESCE(
    value.int_value,
    value.double_value,
    SAFE_CAST(value.string_value AS FLOAT64)
  ) FROM UNNEST(event_params) WHERE key = 'value') AS event_value

GA4Dataform handles this in the transformation step. Every standard GA4 parameter is extracted to a flat column under event_params.*, with type-safe handling built in. For custom parameters, you add them to a config array:

CUSTOM_EVENT_PARAMS_ARRAY: [
  { name: "lead_score", type: "int" },
  { name: "signup_method", type: "string" },
  { name: "cart_value", type: "decimal" }
]

The type flag controls the extraction logic: "int" reads int_value"string" coalesces all value types to string, and "decimal" tries int_valuedouble_value, and float_value with fallbacks. These end up in event_params_custom.* as flat columns.

The same pattern applies to user_properties and items (item-scoped custom dimensions). In the raw export, each requires its own UNNEST. In GA4Dataform, it’s a config array.

After the transformation, a landing page query on the raw export:

-- Raw: UNNEST + ARRAY_AGG + regex
WITH landing_pages AS (
  SELECT
    CONCAT(
      user_pseudo_id,
      CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
    ) AS session_id,
    ARRAY_AGG(
      (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location')
      ORDER BY event_timestamp LIMIT 1
    )[SAFE_OFFSET(0)] AS landing_page_url
  FROM `project.dataset.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260301' AND '20260307'
    AND event_name = 'page_view'
  GROUP BY session_id
)
SELECT
  NET.HOST(landing_page_url) AS hostname,
  REGEXP_EXTRACT(landing_page_url, r'^https?://[^/?]+(\/[^?]*)') AS page_path,
  COUNT(*) AS sessions
FROM landing_pages
GROUP BY 1, 2
ORDER BY sessions DESC

…becomes this:

-- GA4Dataform: pre-parsed landing page
SELECT
  landing_page.page_hostname AS hostname,
  landing_page.page_path,
  COUNT(*) AS sessions
FROM `project.dataset.ga4_sessions`
WHERE session_date BETWEEN '2026-03-01' AND '2026-03-07'
GROUP BY 1, 2
ORDER BY sessions DESC

Not just shorter, but the hostname and path are already parsed (no regex needed).

Channel groupings

At some point you’ll want to know which channels are driving traffic. The raw export gives you session_traffic_source_last_click.cross_channel_campaign.default_channel_group, which is Google’s built-in grouping that uses last non-direct attribution (which is the closest you can get to the GA4 UI). It works fine. But what if you need to customize it?

Maybe you want to split “Paid Search” into Google vs Bing. Or you want your branded campaigns in their own channel instead of lumped into “Paid Search”. Or (and this one is increasingly relevant) you want to separate AI categories from regular organic/referral traffic.

With the raw export, you’re writing a CASE statement. And not a small one:

-- Raw: manual channel grouping (abbreviated)
SELECT
  session_id,
  CASE
    WHEN source IS NULL AND medium IS NULL THEN 'Direct'
    WHEN source = 'google' AND REGEXP_CONTAINS(medium, r'^(cpc|ppc|paid)$')
      AND shopping_campaign THEN 'Paid Shopping'
    WHEN source = 'google' AND REGEXP_CONTAINS(medium, r'^(cpc|ppc|paid)$')
      THEN 'Paid Search'
    WHEN source = 'bing' AND REGEXP_CONTAINS(medium, r'^(cpc|ppc|paid)$')
      THEN 'Paid Search'
    WHEN REGEXP_CONTAINS(source, r'facebook|instagram|twitter|linkedin')
      AND REGEXP_CONTAINS(medium, r'^(cpc|ppc|paid|retargeting)')
      THEN 'Paid Social'
    WHEN REGEXP_CONTAINS(source, r'facebook|instagram|twitter|linkedin')
      THEN 'Organic Social'
    WHEN medium = 'organic' THEN 'Organic Search'
    WHEN medium = 'referral' THEN 'Referral'
    WHEN medium = 'email' THEN 'Email'
    -- ... another 10+ conditions for video, display, affiliates, SMS, push, etc.
    ELSE '(Other)'
  END AS channel_group
FROM ...

Unless you build a SQL/JavaScript function, you will have to maintain/update this code everywhere you create groupings.

GA4Dataform Core ships with a maintained default channel grouping that ~mirrors Google’s logic but adds one useful extra: AI referral detection. The module includes a curated list of AI source patterns (ChatGPT, Claude, Gemini, Perplexity, Copilot, and more). When EXTRA_CHANNEL_GROUPS is enabled, traffic from these sources gets its own “Organic AI” channel instead of being lumped into generic “Referral”.

The grouping is available on both session_source.default_channel_grouping and last_non_direct_traffic_source.default_channel_grouping in the ga4_sessions table.

Premium takes this further with CUSTOM_CHANNEL_GROUPING. Instead of writing plain SQL, you define your rules in a config array:

CUSTOM_CHANNEL_GROUPING: [{
  groupingName: 'business_channels',
  useDefaultChannelGroupingAsFallback: true,
  channelDefinitions: [
    {
      name: 'Google Paid Search',
      criteria: `source = 'google'
        AND category = 'SOURCE_CATEGORY_SEARCH'
        AND regexp_contains(medium, r"^(cpc|ppc|paid)$")`
    },
    {
      name: 'Bing Paid Search',
      criteria: `source = 'bing'
        AND category = 'SOURCE_CATEGORY_SEARCH'
        AND regexp_contains(medium, r"^(cpc|ppc|paid)$")`
    },
    {
      name: 'Branded Paid',
      criteria: `regexp_contains(campaign, r"branded")
        AND regexp_contains(medium, r"^(cpc|ppc|paid)$")`
    },
    {
      name: 'AI Referral',
      criteria: `category = 'SOURCE_CATEGORY_AI'
        AND medium = 'referral'`
    }
  ]
}]

First match wins. Anything that doesn’t match falls back to the default channel grouping (because useDefaultChannelGroupingAsFallback is true). You can even reference custom columns from your data model in the criteria, like device category or geo region.

The result shows up as a column on ga4_sessions alongside the default grouping. You only need to maintain the categorization logic.

You can also use the same JavaScript function in a reporting (aggregated) table which means you have successfully standardized your channel groupings in your whole data pipeline.

Cost optimization

This one is quite important not just for data volume scaling, but query readability as well. The raw GA4 export uses date-sharded tables (events_20260301events_20260302, …). When you query with _TABLE_SUFFIX BETWEEN '20260301' AND '20260307', BigQuery resolves which tables to scan based on string matching. It works, but it’s the least efficient storage pattern BigQuery offers.

GA4Dataform’s output tables use native date partitioning (event_date / session_date) and clustering.

Partitioning means WHERE session_date BETWEEN '2026-03-01' AND '2026-03-07' triggers proper partition pruning (not string-based suffix matching). Clustering means filtering on event_name or session_id reads even less data. For high-traffic properties, the difference in bytes scanned can be significant.

A lot of session-level reporting questions require combining data from ga4_sessions (for attribution, device, geo) with aggregated data from ga4_events (for event counts, revenue). With Core, that’s a LEFT JOIN on session_id:

-- Core: join to count events per session
WITH events AS (
  SELECT
    session_id,
    COUNT(*) AS event_count,
    SUM(IF(event_name = 'purchase', ecommerce.purchase_revenue, 0)) AS purchase_revenue
  FROM `project.dataset.ga4_events`
  WHERE event_date BETWEEN '2026-03-01' AND '2026-03-07'
  GROUP BY session_id
)

SELECT
  s.session_source.default_channel_grouping AS channel,
  AVG(e.event_count) AS avg_events_per_session,
  SUM(e.purchase_revenue) AS total_revenue
FROM `project.dataset.ga4_sessions` AS s
LEFT JOIN events AS e ON s.session_id = e.session_id
WHERE s.session_date BETWEEN '2026-03-01' AND '2026-03-07'
GROUP BY 1

Even though both tables are clustered by session_id (which makes the join more efficient), every time this query runs it scans both tables. It’s not a huge concern with a proper reporting layer, but it is unnecessary since we can just pre-aggregate counts and sums on the session-level.

Premium’s CUSTOM_SESSION_TOTALS pre-computes these aggregations onto the session row:

CUSTOM_SESSION_TOTALS: {
  eventsToCount: ['page_view', 'purchase', 'add_to_cart'],
  sumFields: [
    {
      name: 'ecommerce.purchase_revenue',
      eventFilter: 'purchase',
      renameTo: 'purchase_revenue'
    }
  ],
  uniqueFields: [
    {
      name: 'ecommerce.transaction_id',
      eventFilter: 'purchase',
      renameTo: 'unique_transactions'
    }
  ]
}

Now the same query reads one table:

-- Premium: no join needed
SELECT
  session_source.default_channel_grouping AS channel,
  AVG(totals.count.events) AS avg_events_per_session,
  SUM(totals.sum.purchase_revenue) AS total_revenue
FROM `project.dataset.ga4_sessions`
WHERE session_date BETWEEN '2026-03-01' AND '2026-03-07'
GROUP BY 1

Event counts, revenue sums, distinct transaction counts, and boolean flags (like totals.has_event.purchase) are all on the session row. For dashboard heavy setups where the same base data gets sliced ten different ways, scanning one smaller table instead of joining two on every query is where the real savings are.

Custom session parameters

Here’s where the complexity gap is widening. Say you fire a membership_tier event parameter on login, and you want it as a session-level dimension so you can break down any report by membership tier.

With the raw export, you need to build sessions yourself and aggregate the parameter:

-- Raw: manual sessionization + UNNEST
SELECT
  CONCAT(
    user_pseudo_id,
    CAST((SELECT value.int_value FROM UNNEST(event_params)
      WHERE key = 'ga_session_id') AS STRING)
  ) AS session_id,
  ARRAY_AGG(
    (SELECT value.string_value FROM UNNEST(event_params)
      WHERE key = 'membership_tier')
    IGNORE NULLS ORDER BY event_timestamp LIMIT 1
  )[SAFE_OFFSET(0)] AS membership_tier
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260301' AND '20260307'
GROUP BY session_id

You’re constructing session IDs by hand, UNNESTing event parameters, using ARRAY_AGG with ordering and null handling to pick the first non-null value. And this is for one parameter. Need three session-level dimensions? Triple the boilerplate.

With Core, sessions exist and event parameters are flat, so the query is cleaner:

-- Core: sessions exist, but custom params aren't on them
SELECT
  e.session_id,
  ARRAY_AGG(
    e.event_params_custom.membership_tier
    IGNORE NULLS ORDER BY e.event_timestamp LIMIT 1
  )[SAFE_OFFSET(0)] AS membership_tier
FROM `project.dataset.ga4_events` AS e
WHERE e.event_date BETWEEN '2026-03-01' AND '2026-03-07'
GROUP BY e.session_id

Better. No UNNEST, no manual session ID construction. But you still need to scan the events table and write the aggregation logic. And if you want to join this back to ga4_sessions for attribution context, that’s another join.

Premium handles it with one line of config:

CUSTOM_SESSION_PARAMS_ARRAY: [
  {
    name: "event_params_custom.membership_tier",
    pick: "first",
    renameTo: "membership_tier",
    description: "First membership tier captured in session"
  }
]

The value shows up as session_params_custom.membership_tier on ga4_sessions. No SQL, no aggregation logic, no extra joins.

The pick option controls how the value is aggregated: "first" and "last" take the chronologically first or last non-null value, "max" and "min" do what you’d expect, and "boolean" tells you whether the parameter was present at all. You can also filter by event name:

{
  name: "event_params_custom.lead_score",
  pick: "max",
  eventFilter: "generate_lead",
  renameTo: "max_lead_score"
}

This picks the highest lead_score value, but only from generate_lead events. The same thing in raw SQL would be a MAX(CASE WHEN event_name = 'generate_lead' THEN ... END) wrapped around a subquery with UNNEST. In Core, it’s the same pattern minus the UNNEST. In Premium, it’s a config object.

Closing thoughts

The GA4 BigQuery export is one of the best things Google has done for the analytics community. Free, raw, event-level data in a warehouse that can handle it. But between the nested arrays, the type inconsistencies, the sharded tables, and the reality of building session-level dimensions from event-level data, there’s a lot of accidental complexity sitting between you and your reporting.

GA4Dataform sits on top of it and handles the plumbing so you can focus on the questions that actually matter to your business.

If you have any questions or suggestions about this topic, feel free to reach out!

If you are interested in taking your GA4 BigQuery data modelling to the next level, take a look at our Premium offering!

Krisztian Korpa

Published at March 23, 2026

Continue Reading