How to label job execution in Dataform?

Written by
Simon Breton
Reading progress
Photo by Siora Photography on Unsplash

When you start modeling GA4 data in BigQuery with Dataform, one of the main challenges you’ll quickly encounter is monitoring and controlling query costs. Setting up a cost dashboard becomes essential to track spend and identify queries to optimize. Ideally, this dashboard should allow you to drill down by repository, workspace, workflow, and even individual queries—showing exactly when each query was executed and how much it cost. That’s where Google Cloud Platform’s labeling functionality becomes especially useful. Labels are key-value pairs that you can attach to GCP resources, including BigQuery tables and jobs, to help you organize, filter, and manage them more effectively.

In this article, we’ll focus specifically on labeling jobs. While labeling datasets, tables, or views is built into Dataform, well-documented, and easy to use, job labeling is less straightforward. Although it’s also well documented for native GCP features like BigQuery, it’s more complex to implement for not-yet-fully-integrated tools like Dataform. In this article, we’ll explore how to label executions in order to better measure the cost of queries run by Dataform.

if you’re not familiar with INFORMATION_SCHEMA.JOBS and how Dataform executions are reflected in this table I would encourage you to read the small extra section added at the end of this article. Otherwise keep reading.

Built-in Dataform Labeling

Although it’s not documented, Dataform passes labels to jobs. You can find these labels by querying the INFORMATION_SCHEMA.JOBS table for any Dataform job:

Labels column with key-value pairs
Labels column with key-value pairs

These labels return standard information such as project numbers, execution id, repository id, location etc. They can be useful in some circumstances but come with several important limitations:

  • You can’t customize them or pass your own values.
  • Because it’s undocumented, it’s unclear exactly how it works or under what conditions we can rely on it.

Let’s explore other job labeling options provided by BigQuery to see if any are compatible with Dataform.

Jobs Labeling in BigQuery

It’s possible to label jobs with SQL using SET @@query_label :

  SET @@query_label = "ga4dataform:testing_label";

Let’s take a look at how it works with multi-statement SQL script using the following query:

SET @@query_label = "ga4dataform:testing_label";
CREATE OR REPLACE TABLE `mydataset.mytable` (
  id INT64,
  name STRING,
  score INT64,
  grade STRING
);

SELECT * FROM `mydataset.mytable`;

-- Step 2: Insert fake data using a CTE with CASE logic
INSERT INTO `mydataset.mytable`
WITH fake_data AS (
  SELECT * FROM UNNEST([
    STRUCT(1 AS id, 'Alice' AS name, 82 AS score),
    STRUCT(2 AS id, 'Bob' AS name, 67 AS score),
    STRUCT(3 AS id, 'Charlie' AS name, 90 AS score),
    STRUCT(4 AS id, 'Diana' AS name, 58 AS score)
  ])
),
graded_data AS (
  SELECT
    id,
    name,
    score,
    CASE
      WHEN score >= 85 THEN 'A'
      WHEN score >= 70 THEN 'B'
      WHEN score >= 60 THEN 'C'
      ELSE 'F'
    END AS grade
  FROM fake_data
)
SELECT * FROM graded_data;

SET @@query_label is added at the beginning of the query and will be applied to all operations running within it. By querying INFORMATION_SCHEMA.JOBS we can validate that indeed label is correctly passed to all the intermediate jobs:

Using SET @@query_label to pass label to query

SET @@query_label is BigQuery-session-scoped. It means that labels will be assigned to all future jobs of the session. BigQuery-Session can be created, terminated, attached to specific query etc. but only using bq command line and API. To the best of my knowledge, there is no way to manage BigQuery sessions in Dataform or via any SQL command. Let’s see if we can still make it work within Dataform.

Using SET @@query_label in Dataform

It is not immediately obvious how to use SET @@query_label in Dataform. How is BigQuery-Session handled? Let’s add it to a Dataform model to see what’s happening:

SET @@query_label in Dataform

SET @@query_label is passed in the pre_operations block. Let’s look at INFORMATION_SCHEMA.JOBS output:

SET @@query_label output in Dataform

What do we learn from the execution of this query in Dataform:

  • Labels are applied to all operations running after the label has been set.
  • BigQuery-Session is not defined (session_id column is empty)
  • Incidentally we can observe that Dataform built-in labeling is not passed to all the jobs.

Even though SET @@query_label is not an entirely satisfying solution at this point it definitely gets us closer to our goal. We now need to find a way to pass our labels across all jobs executed.

Extract Labels from Script statement_type

INFORMATION_SCHEMA.JOBS is reporting on Dataform execution at two levels: Individual query/job or entire SQL script. statement_type with value Script returns the entire SQL script executed with all the stored procedures and/or multi-step queries. We can then query string field and filter based on the entire @@query_label = “ga4dataform:testing_label”; statement:

SELECT 
    * 
FROM 
    `region-eu.INFORMATION_SCHEMA.JOBS` 
WHERE 
    REGEXP_CONTAINS(query, r' @@query_label = "ga4dataform:testing_label")

Model BigQuery-Sessions

It is also possible to model BigQuery-Sessions using parent_job_id and pass labels to all intermediate queries and operations using window function:

SELECT
  creation_time,
  parent_job_id,
  job_id,
  statement_type,
  session_info,
  MAX((SELECT value FROM UNNEST(labels) WHERE key = 'ga4dataform')) OVER (PARTITION BY parent_job_id) AS set_query_labels,
  total_bytes_processed
FROM
  `region-eu.INFORMATION_SCHEMA.JOBS`
WHERE
  DATE(creation_time) >= '2025-05-01'
  AND parent_job_id = 'dataform-gcp-668e761e1fc9ecf3e19aa921574e8693b60c9dba0af30e72971966e277f0bf84-0'
order by creation_time asc
Pass label to all jobs of parent job id

Summary

If you are orchestrating Dataform execution yourself you should use bq command line or BigQuery API to properly label job execution. If you are working with Dataform only and want to fully and efficiently capture processing cost then using SET @@query_label is mandatory:

  • Add SET @@query_label in each model in pre_operations block
  • Model BigQuery-Sessions or use Script statement_type to return full total processed cost from INFORMATION_SCHEMA.JOBS table

I do think that Dataform should offer a way to label workflow execution the same way we can label datasets, tables and views. Hopefully this article will become obsolete soon!

What’s next for GA4Dataform

  • Labeling tables and jobs is being added to Dataform Core version 17 as this blog post is being written.
  • A BigQuery cost monitoring dashboard will be included as one of the first features of GA4Dataform Premium, leveraging some of the elements exposed in this article.


Information schema jobs and Dataform

Here are a few tips to navigate between INFORMATION_SCHEMA.JOBS and Dataform jobs:

When statement_type != ‘script’ :

  • Dataform Job Id egual parent_job_id
  • parent_job_id will be associated with many job_id to get the details of each execution of the SQL Script
  • query field return the detail of the part of the SQL Script executed for each individual job
  • Sum of total_bytes_processed grouped by parent_job_id is equal to total_bytes_processed where statement_type = ‘script’

When statement_type = ‘script’ :

  • parent_job_id is null
  • Dataform Job Id egual job_id
  • Only one row per Dataform execution
  • query field return the entire SQL Script with all stored procedures and/or multi-step queries
  • total_bytes_processed is egual to the sum of all individual jobs where statement_type != ‘script’
Simon Breton

Published at May 14, 2025

Continue Reading