Adding a Google Sheet source to dbt
1. Create your sheet
Data Integration / Google Sheets / In.
2. Add a named range
This should encompass the entirety of the tab.
3. Create (or update) a sources-drive.yml
file
Under the appropriate folder in the dbt project, make sure to replace the following variables:
Variable | Description |
---|---|
SOURCE_NAME | The name of the collection for your source tables |
SOURCE_TABLE_NAME | The name of the source table, as it will appear in the database |
SHEET_URL | The full URL of the Google Sheet, including https://... |
NAMED_RANGE | The name of the range you defined in step 2. |
version: 2
sources:
- name: SOURCE_NAME
schema: |
{% if env_var('DBT_CLOUD_ENVIRONMENT_TYPE', '') == 'dev' -%}z_dev_{%- endif -%}
{{ project_name }}_SOURCE_NAME
tables:
- name: src_SOURCE_NAME__SOURCE_TABLE_NAME
external:
options:
format: GOOGLE_SHEETS
uris:
- SHEET_URL
sheet_range: NAMED_RANGE
skip_leading_rows: 1
meta:
dagster:
asset_key:
- "{{ project_name }}"
- SOURCE_NAME
- SOURCE_TABLE_NAME
Tip
If updating an existing sources-drive.yml
, you will only need to add new
records under the table
attribute.
4. Create a staging model
Create a simple select *
statement--this will help catch unexpected changes to
the table schema--and add any calculated fields you require.
select *, spam + 1 as eggs, from {{ source("SOURCE_NAME", "SOURCE_TABLE_NAME") }}
5. Create a corresponding properties file for the staging model
File name: ../properties/STAGING_MODEL_NAME.yml
Variable | Description |
---|---|
models[].name |
The exact name of your staging model |
columns[].name |
Column names are case-sensitive |
data_type |
BigQuery Data Type Reference |
version: 2
models:
- name: STAGING_MODEL_NAME
config:
contract:
enforced: true
tags: google_sheet
columns:
- name: COLUMN_NAME_1
data_type: COLUMN_TYPE_1
- name: COLUMN_NAME_2
data_type: COLUMN_TYPE_2
...
- name: COLUMN_NAME_N
data_type: COLUMN_TYPE_N
6. Test your changes
Update the external source definition
dbt run-operation stage_external_sources --vars "ext_full_refresh: true" --args "select: SOURCE_NAME.SOURCE_TABLE_NAME"