dbt Conventions
Model conventions
Models follow standard prefixes reflecting their layer in the data flow.
Staging (stg_)
Modular building blocks from source data.
- File naming:
stg_{source}__{entity}.sql contract: enforced: true— required on all staging models- Uniqueness test — required: either a single-column
unique:test ordbt_utils.unique_combination_of_columnsfor composite keys
Intermediate (int_)
Layers of logic with clear and specific purposes, preparing staging models to join into the entities we want.
- Folder structure: subdirectories by area of business concern
- File naming:
int_{business_concern}__{entity}_{verb}.sql - Business concerns:
assessments,surveys,people - Verbs:
pivot,unpivot,rollup - Uniqueness test — required: either a single-column
unique:test ordbt_utils.unique_combination_of_columnsfor composite keys
Marts / Extracts (rpt_)
Wide, rich views of the entities our organization cares about, or extracts consumed by reporting tools and applications.
contract: enforced: true— required on all marts and extract models. These are the last stop before data reaches an external reporting tool (Tableau, PowerSchool, Google Sheets, etc.). Schema changes break downstream exposures and must be made deliberately.- Uniqueness test — required: either a single-column
unique:test ordbt_utils.unique_combination_of_columnsfor composite keys
SQL conventions
- Use
union_dataset_join_clause()for any query that joins two CTEs built from unioned regional datasets. Pass the CTE aliases to prevent cross-region row matching:
{{ union_dataset_join_clause(left_alias="left_cte", right_alias="right_cte") }}
-- expands to:
regexp_extract(left_cte._dbt_source_relation, r'(kipp\w+)_')
= regexp_extract(right_cte._dbt_source_relation, r'(kipp\w+)_')
To extract a human-readable region label from _dbt_source_relation:
initcap(regexp_extract(s._dbt_source_relation, r'kipp(\w+)_')) as region
- No
GROUP BYwithout aggregation — useDISTINCTinstead. DISTINCTrequires a comment explaining why it is necessary.- No
ORDER BYinSELECTstatements — ordering belongs in the reporting layer, not in dbt models. - No
GROUP BY ALL— never use in production models. Always list grouping columns explicitly;GROUP BY ALLobscures intent and breaks silently when upstream columns are added or removed. - No
SELECT *in the finalSELECTofrpt_or mart models — list columns explicitly. Wildcards obscure column sourcing, create ambiguity when both sides of a join share a column name, and leave the contract declaration as the only readable spec. Pass-through CTEs (select * from ref(...)) are fine. - Filter conditions:
ONvsWHERE— row-filter conditions on the preserved table belong inWHERE, notON. ForINNER JOINthe result is identical but intent is hidden; forLEFT JOIN, a filter inONkeeps non-matching left rows whereasWHEREeliminates them — a silent semantic change. Exception: forFULL JOIN, conditions inONthat reference only one side are intentional and cannot be moved toWHERE. - Timezone-aware today — use
{{ var("local_timezone") }}socurrent_datereflects Eastern time rather than UTC:
current_date('{{ var("local_timezone") }}')
- Removing diacritical marks — to normalize names with accented characters:
regexp_replace(normalize(name_col, NFD), r'\pM', '')
- Time travel — query a table as it existed at a point in time:
select *
from my_table
for system_time as of timestamp('2025-08-22 23:59:59')
- New external sources — before building, stage them first:
uv run dbt run-operation stage_external_sources \
--vars "{'ext_full_refresh': 'true'}" \
--args 'select: [model_name]'
ref() and source()
Use ref() to reference other dbt models; use source() for raw source tables
declared in a sources: YAML file:
{{ ref("stg_amplify__benchmark_student_summary") }}
{{ source("amplify", "src_amplify__benchmark_student_summary") }}
BigQuery scalar functions
Shared UDFs in the functions dataset:
| Function | Returns |
|---|---|
functions.current_academic_year() |
Current academic year integer |
functions.date_to_sy(date_col) |
Academic year of a given date |
functions.region_join(left_col, right_col) |
Boolean equivalent of union_dataset_join_clause() |
select
functions.current_academic_year() as academic_year,
functions.date_to_sy(att_date) as att_academic_year,
from my_table
where
functions.region_join(co._dbt_source_relation, gpa._dbt_source_relation)
Model properties file
Every model must have a corresponding [model_name].yml properties file. Use
the scripts/dbt-yaml.py wrapper to generate and update model YAML — it handles
column ordering and data type inference automatically:
uv run scripts/dbt-yaml.py --select stg_my_model kipptaf
# add --dev to target your personal dev dataset instead of prod
Or generate the raw scaffold manually and save the console output as the .yml
file:
uv run dbt run-operation generate_model_yaml \
--args '{"model_names": ["model_name"]}'
Fill in the scaffold:
models:
- name: model_name
config:
contract:
enforced: false # keep false while building; remove this line before merging
columns: # required for contracted models
- name: column_name
data_type: string | int64 | date | ... # required for contracted models
data_tests: # optional column-level tests
- not_null
- accepted_values:
values: [...]
data_tests: # optional model-level tests
- dbt_utils.unique_combination_of_columns: # use for composite keys
arguments:
combination_of_columns:
- column_a
- column_b
config:
store_failures: true
Exposures
Every external tool that consumes our data must have a
dbt exposure defined in
the consuming project (typically src/dbt/kipptaf/models/exposures/). Exposures
make the dependency graph explicit and power Dagster asset lineage.
All exposures require a name, label, type, owner, depends_on
(listing every model the tool uses), and a url linking to the external
tool/workbook/sheet:
exposures:
- name: exposure_name_snake_case
label: Human Readable Title
type: dashboard | notebook | analysis | ml | application
owner:
name: Data Team
depends_on:
- ref("rpt_tableau__some_model")
- ref("another_model")
url: https://... # optional
config:
meta:
dagster:
kinds:
- tableau # or: googlesheets, powerschool, etc.
- ... # additional kinds
Tableau dashboards that refresh on a schedule must additionally include the
Tableau workbook LSID and a cron_schedule under asset.metadata. Workbooks
without a scheduled refresh can omit the asset block:
config:
meta:
dagster:
kinds:
- tableau
asset:
metadata:
id: <tableau-workbook-lsid-uuid>
cron_schedule: "0 7 * * *" # omit entirely if no scheduled refresh