Automate your dbt workflows with AI agents

Automate your dbt workflows with AI agents

Buster helps data teams eliminate repetitive data engineering work with AI agents. Agents can update dbt docs, maintain data quality, and more—all defined in simple YAML.

Auto-update docs

Data diffs

Sync schema changes

Flag breaking changes

Incremental model updates

JSON schema sync

Ensure test coverage

Null spike alerts

Convention enforcer

Model cleanup

Have an agent update dbt docs on every pull request. Keep documentation always in sync with your models.

name: docs-updater
description: Auto-update documentation for changed models

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: standard

prompt: |
  # Task: You are tasked with keeping dbt model documentation accurate and up-to-date when models change.

  ## Goal: Generate clear, accurate documentation that reflects the **current state of the data**—not change history.

  ## Approach
  1. **Profile the data** Use retrieve_metadata to understand row counts, column types, null rates, and distributions.
  2. **Describe what you observe** Document patterns, not absolutes. Use phrases like "At time of documentation" and "Current data shows".
  3. **Stay efficient** For existing docs, only update what changed. Preserve good existing context.
  4. **Validate and commit** Run `dbt parse` to ensure valid YAML, then commit changes to the PR.

  ## Output
  - Model descriptions with purpose, grain, and approximate row count
  - Column descriptions with business meaning and key patterns
  - All changes validated and committed to the PR branch

Auto-update docs

Data diffs

Sync schema changes

Flag breaking changes

Incremental model updates

JSON schema sync

Ensure test coverage

Null spike alerts

Convention enforcer

Model cleanup

Have an agent update dbt docs on every pull request. Keep documentation always in sync with your models.

name: docs-updater
description: Auto-update documentation for changed models

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: standard

prompt: |
  # Task: You are tasked with keeping dbt model documentation accurate and up-to-date when models change.

  ## Goal: Generate clear, accurate documentation that reflects the **current state of the data**—not change history.

  ## Approach
  1. **Profile the data** Use retrieve_metadata to understand row counts, column types, null rates, and distributions.
  2. **Describe what you observe** Document patterns, not absolutes. Use phrases like "At time of documentation" and "Current data shows".
  3. **Stay efficient** For existing docs, only update what changed. Preserve good existing context.
  4. **Validate and commit** Run `dbt parse` to ensure valid YAML, then commit changes to the PR.

  ## Output
  - Model descriptions with purpose, grain, and approximate row count
  - Column descriptions with business meaning and key patterns
  - All changes validated and committed to the PR branch

Auto-update docs

Sync schema changes

Data diffs

Flag breaking changes

Incremental model updates

JSON schema sync

Ensure test coverage

Null spike alerts

Convention enforcer

Model cleanup

.github/agents/docs-updater.yaml

name: docs-updater
description: Auto-update documentation for changed models

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: standard

prompt: |
  # Task: You are tasked with keeping dbt model documentation accurate and up-to-date when models change.

  ## Goal: Generate clear, accurate documentation that reflects the **current state of the data**—not change history.

  ## Approach
  1. **Profile the data** Use retrieve_metadata to understand row counts, column types, null rates, and distributions.
  2. **Describe what you observe** Document patterns, not absolutes. Use phrases like "At time of documentation" and "Current data shows".
  3. **Stay efficient** For existing docs, only update what changed. Preserve good existing context.
  4. **Validate and commit** Run `dbt parse` to ensure valid YAML, then commit changes to the PR.

  ## Output
  - Model descriptions with purpose, grain, and approximate row count
  - Column descriptions with business meaning and key patterns
  - All changes validated and committed to the PR branch

Have an agent update your dbt docs on every pull request. Keep documentation in sync with your models.

Auto-update docs

Sync schema changes

Data diffs

Flag breaking changes

Incremental model updates

JSON schema sync

Ensure test coverage

Null spike alerts

Convention enforcer

Model cleanup

.github/agents/docs-updater.yaml

name: docs-updater
description: Auto-update documentation for changed models

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: standard

prompt: |
  # Task: You are tasked with keeping dbt model documentation accurate and up-to-date when models change.

  ## Goal: Generate clear, accurate documentation that reflects the **current state of the data**—not change history.

  ## Approach
  1. **Profile the data**  
     Use retrieve_metadata to understand row counts, column types, null rates, and distributions.
  2. **Describe what you observe**  
     Document patterns, not absolutes. Use phrases like "At time of documentation" and "Current data shows".
  3. **Stay efficient**  
     For existing docs, only update what changed. Preserve good existing context.
  4. **Validate and commit**  
     Run `dbt parse` to ensure valid YAML, then commit changes to the PR.

  ## Output
  - Model descriptions with purpose, grain, and approximate row count
  - Column descriptions with business meaning and key patterns
  - All changes validated and committed to the PR branch

Have an agent update dbt docs on every pull request. Keep documentation always in sync with your models.

Trusted by data teams at top companies

Trusted by data teams at top companies

Real results from modern data teams

Real results from modern data teams

Buster saves data teams hundreds of hours every month. It automates repetitive tasks, ensures data quality at scale, and keeps dbt projects up-to-date.

Buster saves data teams hundreds of hours every month. It automates repetitive tasks, ensures data quality at scale, and keeps dbt projects up-to-date.

“Buster saves our data team hundreds of hours of work every month.”

Jonathon Northrup

Data Engineer, Angel Studios

"Buster helps us keep our dbt project clean, documented, and up-to-date.”

Jen Eutsler

Data Engineer, SchoolAI

“Buster frees me up from things I always had to do, so I can focus on longer term goals."

Landen Bailey

Senior Data Engineer, Redo

"A lot of data teams think AI tools don’t actually work. Buster is legit, for real for real."

Alex Ahlstrom

Director of Data, Angel Studios

"Buster’s understanding of our dbt project has blown my mind. It deeply understands our data models."

Cale Anderson

Data Engineer, Remi

Getting started is simple and easy

Getting started is simple and easy

MY-DBT-PROJECT

analyses

macros

models

agents

dbt-docs-updater.yml

weekly-model-cleanup.yml

seeds

snapshots

tests

dbt_project.yml

README.md

docs-updater.yml

name: docs-updater
description: Auto-update documentation for changed models

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: standard

prompt: |
  # Task: You are tasked with keeping dbt model documentation accurate and up-to-date when models change.

  ## Goal: Generate clear, accurate documentation that reflects the **current state of the data**—not change history.

  ## Approach
  1. **Profile the data** Use retrieve_metadata to understand row counts, column types, null rates, and distributions.
  2. **Describe what you observe** Document patterns, not absolutes. Use phrases like "At time of documentation" and "Current data shows".
  3. **Stay efficient** For existing docs, only update what changed. Preserve good existing context.
  4. **Validate and commit** Run `dbt parse` to ensure valid YAML, then commit changes to the PR.

  ## Output
  - Model descriptions with purpose, grain, and approximate row count
  - Column descriptions with business meaning and key patterns
  - All changes validated and committed to the PR branch

Step 1

Add an agent to your repo

Agents are simple YAML configs with a prompt. Start with a template from our library of proven agents, or write your own.

Step 2

Test it locally

Test your agent locally or use our dry-run mode. Preview exactly which decisions and actions your agent makes before deploying.

Step 3

Deploy your agent

Your agent will run automatically on trigger events, as defined in your YAML file. View detailed logs to audit an agent’s work.

Step 1

Add an agent to your repo

Agents are simple YAML configs with a prompt. Start with a template from our library of proven agents, or write your own.

Step 2

Test it locally

Test your agent locally or use our dry-run mode. Preview exactly which decisions and actions your agent makes before deploying.

Step 3

Deploy your agent

Your agent will run automatically on trigger events, as defined in your YAML file. View detailed logs to audit an agent’s work.

Spin up a custom agent for any data engineering task

Spin up a custom agent for any data engineering task

Below are a few examples of agent's we've seen data teams use:

Auto-update docs

Profiles changed models on every PR, updates documentation accordingly, and commits updates back to the branch.

Detect upstream changes

Runs nightly to catch schema changes in source tables, update staging models, and open a fix PR by morning.

Flag breaking changes

Detects model changes in PRs, finds impacted downstream dependencies, and comments with an impact report.

Model cleanup

Runs weekly to find unused models and over-materialized tables, opens a PR with model optimizations.

Ensure test coverage

Profiles new models in PRs, generates missing tests, and commits them to prevent untested code.

Null spike alerts

Checks specified columns every hour and sends Slack alerts when null rates spike above baseline.

JSON schema sync

Scans specified JSON columns every 6 hours for new fields or changes, updates staging model extractions, and opens a PR with the adapted SQL.

Convention enforcer

Detects naming violations and policy breaches in PRs, refactors changes to match specified conventions and commits back to the branch.

name: docs-updater
description: Auto-update documentation for changed models

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: standard

prompt: |
  # Task: You are tasked with keeping dbt model documentation accurate and up-to-date when models change.

  ## Goal: Generate clear, accurate documentation that reflects the **current state of the data**—not change history.

  ## Approach
  1. **Profile the data:** Use retrieve_metadata to understand row counts, column types, null rates, and distributions.
  2. **Describe what you observe:** Document patterns, not absolutes. Use phrases like "At time of documentation" and "Current data shows".
  3. **Stay efficient:** For existing docs, only update what changed. Preserve good existing context.
  4. **Validate and commit:** Run `dbt parse` to ensure valid YAML, then commit changes to the PR.

  ## Output
  - Model descriptions with purpose, grain, and approximate row count
  - Column descriptions with business meaning and key patterns
  - All changes validated and committed to the PR branch
name: data-diff-checker
description: Compare PR vs production data output

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: standard

prompt: |
  # Task: You are tasked with comparing data output between PR and production to catch logic errors before merge.

  ## Goal: Find **unexpected differences** that indicate bugs—not just any differences, but ones that don't align with the intended change.

  ## Approach
  1. **Understand the change first:** Read the SQL diff and PR description to know what *should* change.
  2. **Compare systematically:** Start with row counts and key metrics, then drill into anomalies.
  3. **Investigate and explain:** Don't just report numbers—connect findings back to SQL changes and explain the root cause.
  4. **Classify by severity:** Row explosions and data loss are critical. Expected changes from refactors are low priority.

  ## Output
  - Critical issues first (row explosions, broken joins, data loss)
  - Show before/after with sample records
  - Root cause analysis tied to SQL diff
  - Specific fix recommendations
name: auto-schema-sync
description: Update staging models when schemas change

triggers:
  - type: event
    event_name: schema_change

tools:
  preset: standard

prompt: |
  # Task: You maintain staging models so they always mirror upstream schemas, applying the team’s naming and typing conventions **exactly**.

  ## Goal: Update affected staging models immediately to prevent pipeline failures—catch schema drift before dbt runs break.

  ## Approach
  1. **Understand the change:** Review the schema change event details (table, columns added/changed/removed).
  2. **Update staging models:** Add new columns with proper naming conventions, update type casts, handle removed columns defensively.
  3. **Validate changes:** Run dbt parse/compile to ensure models are valid.
  4. **Open PR:** Commit changes with clear description of what changed and why.

  ## Output
  - Updated staging model SQL with new/changed columns
  - Updated YAML documentation
  - Validated dbt project
  - PR ready for review
name: auto-schema-sync
description: Update staging models when schemas change

triggers:
  - type: event
    event_name: schema_change

tools:
  preset: standard

prompt: |
  # Task: You maintain staging models so they always mirror upstream schemas, applying the team’s naming and typing conventions **exactly**.

  ## Goal: Update affected staging models immediately to prevent pipeline failures—catch schema drift before dbt runs break.

  ## Approach
  1. **Understand the change:** Review the schema change event details (table, columns added/changed/removed).
  2. **Update staging models:** Add new columns with proper naming conventions, update type casts, handle removed columns defensively.
  3. **Validate changes:** Run dbt parse/compile to ensure models are valid.
  4. **Open PR:** Commit changes with clear description of what changed and why.

  ## Output
  - Updated staging model SQL with new/changed columns
  - Updated YAML documentation
  - Validated dbt project
  - PR ready for review
name: auto-schema-sync
description: Update staging models when schemas change

triggers:
  - type: event
    event_name: schema_change

tools:
  preset: standard

prompt: |
  # Task: You maintain staging models so they always mirror upstream schemas, applying the team’s naming and typing conventions **exactly**.

  ## Goal: Update affected staging models immediately to prevent pipeline failures—catch schema drift before dbt runs break.

  ## Approach
  1. **Understand the change:** Review the schema change event details (table, columns added/changed/removed).
  2. **Update staging models:** Add new columns with proper naming conventions, update type casts, handle removed columns defensively.
  3. **Validate changes:** Run dbt parse/compile to ensure models are valid.
  4. **Open PR:** Commit changes with clear description of what changed and why.

  ## Output
  - Updated staging model SQL with new/changed columns
  - Updated YAML documentation
  - Validated dbt project
  - PR ready for review
name: incremental-doctor
description: Diagnose incremental model failures

triggers:
  - type: event
    event_name: dbt_run_failed
    source: dbt_cloud
    filters:
      materialization: incremental

tools:
  preset: safe

prompt: |
  # Task: You are tasked with diagnosing why an incremental model failed and recommending a fix.

  ## Goal: Identify the root cause quickly and provide actionable solutions—both immediate and permanent.

  ## Approach
  1. **Understand the symptoms:** Read error message, check recent changes, review model config.
  2. **Pattern recognition:** Common issues: duplicates, schema changes, late data, wrong unique_key.
  3. **Validate hypothesis:** Run diagnostic queries to confirm the root cause.
  4. **Recommend fixes:** Immediate: full-refresh to unblock Permanent: code change to prevent recurrence

  ## Output
  - Clear diagnosis with root cause
  - Sample data showing the issue
  - Immediate fix to unblock
  - Permanent solution to prevent recurrence
name: incremental-doctor
description: Diagnose incremental model failures

triggers:
  - type: event
    event_name: dbt_run_failed
    source: dbt_cloud
    filters:
      materialization: incremental

tools:
  preset: safe

prompt: |
  # Task: You are tasked with diagnosing why an incremental model failed and recommending a fix.

  ## Goal: Identify the root cause quickly and provide actionable solutions—both immediate and permanent.

  ## Approach
  1. **Understand the symptoms:** Read error message, check recent changes, review model config.
  2. **Pattern recognition:** Common issues: duplicates, schema changes, late data, wrong unique_key.
  3. **Validate hypothesis:** Run diagnostic queries to confirm the root cause.
  4. **Recommend fixes:** Immediate: full-refresh to unblock Permanent: code change to prevent recurrence

  ## Output
  - Clear diagnosis with root cause
  - Sample data showing the issue
  - Immediate fix to unblock
  - Permanent solution to prevent recurrence

Trigger agents on any PR, event, or schedule

Trigger agents on any PR, event, or schedule

Easily define when agents should trigger and run. Trigger agents on pull requests, on events like schema changes, or on a scheduled cadence.

Pull request triggers

Trigger agents when PRs are opened, updated, or labeled. Filter by changed file paths to target specific model layers.

Event-based triggers

Trigger agents when data stack events occur: schema changes in source tables, test failures from dbt Cloud, or custom webhooks from your systems.

Scheduled jobs

Schedule daily audits, weekly reports, or custom cron tasks. Batch process accumulated changes efficiently.

triggers:
  - type: pull_request
    events: [opened, synchronize]
    paths:
      include: ["models/**/*.sql"]
      exclude: ["models/staging/temp_*"]
    conditions:
      - type: pr_labels
        none_of: ["wip", "skip-review"

Pull request triggers

Trigger agents when PRs are opened, updated, or labeled. Filter by changed file paths to target specific model layers.

Event-based triggers

Trigger agents when data stack events occur: schema changes in source tables, test failures from dbt Cloud, or custom webhooks from your systems.

Scheduled jobs

Schedule daily audits, weekly reports, or custom cron tasks. Batch process accumulated changes efficiently.

triggers:
  - type: pull_request
    events: [opened, synchronize]
    paths:
      include: ["models/**/*.sql"]
      exclude: ["models/staging/temp_*"]
    conditions:
      - type: pr_labels
        none_of: ["wip", "skip-review"

triggers:
  - type: event
    event_name: schema_change_detected
    source: fivetran
    filters:
      schema: "raw.salesforce"
      change_type: ["column_modified","column_added", "column_removed"

triggers:
  - type: scheduled
    cron: "0 9 * * 1"  # Every Monday at 9am
    timezone: "America/New_York"
    only_on_weekdays: true

Complete visibility into every run

Complete visibility into every run

Never wonder what an agent did or why. Every run is fully logged with complete transparency—files accessed, queries executed, reasoning process, and actions taken.

Never wonder what an agent did or why. Every run is fully logged with complete transparency—files accessed, queries executed, reasoning process, and actions taken.

Runs

Docs

All runs

Height

Filter

Columns

Search runs

ID

Agent

Status

Trigger

Last run

run_asdfghjklzxcvbnm123456

dbt-docs-updater

Completed

pr_checks.yml

Oct 21, 2025, 4:00 PM

3m, 18s

run_fghyujklqwer1234abcdxyz

dbt-test-generator

Completed

post_merge.yml

Oct 20, 2025, 1:00 PM

4m, 8s

run_fghyujklqwer1234abcdxyz

dbt-test-generator

Completed

post_merge.yml

Oct 20, 2025, 12:08 PM

1m, 57s

run_ynhwertghjkf67asdlkfjhqw

dbt-docs-updater

Completed

scheduled_checks.yml

Oct 18, 2025, 11:00 AM

3m, 43s

run_bjwnxfqhlpdt2focvwefklkqz

dbt-breaking-change-reviewer

Completed

pr_checks.yml

Oct 18, 2025, 12:02 PM

6m, 19s

run_cmgvazqbgrh443aoiuoqxjkjh

upstream-change-impact-reviewer

Completed

upstream_pr_checks.yml

Oct 17, 2025, 2:30 PM

2m, 46s

run_stxzpqazjfgh8nmdqzlkdweqex

feature-branch-reviewer

Completed

scheduled_analysis.yml

Oct 18, 2025, 11:15 AM

1m, 32s

run_mnaxvqzjkbhs8fmobgxlqhjzrt

upstream-change-impact-reviewer

Completed

upstream_pr_checks.yml

Oct 19, 2025, 4:00 PM

3m, 10s

run_qzvhnpcfthau5girohfjzjuygf

upstream-change-impact-reviewer

Completed

upstream_pr_checks.yml

Oct 20, 2025, 9:45 AM

2m, 20s

run_xcghbqplpwae3uoknwlzhtxjkl

upstream-change-impact-reviewer

Failed

upstream_pr_checks.yml

Oct 21, 2025, 1:00 PM

5m, 15s

run_jzdeqacokljk4ioyuxqjvmkzrf

dbt-docs-updater

Completed

pr_checks.yml

Oct 22, 2025, 6:30 PM

4m, 5s

run_abc123xyz456def789ghi

data-processor

Completed

data_cleaning.yml

Oct 23, 2025, 2:15 PM

10m, 12s

run_zyx987cba654vut432sqr

report-generator

Completed

annual_report.yml

Oct 23, 2025, 3:45 PM

2m, 30s

run_abcd5678efgh1234ijkl

user-notification-service

Completed

send_notifications.yml

Oct 23, 2025, 4:00 PM

5m, 1s

run_ijklmnopqrs9876tuvw

data-sync

Completed

sync_records.yml

Oct 23, 2025, 4:30 PM

7m, 45s

run_stuvwx1234yz5678abcd

image-processor

Completed

resize_images.yml

Oct 24, 2025, 1:00 PM

3m, 20s

run_efghijklmnopqrs7890abc

api-monitor

Completed

check_health.yml

Oct 24, 2025, 2:30 PM

15m, 5s

run_mnopqr1234abcdef5678

data-archiver

Completed

archive_old_data.yml

Oct 24, 2025, 3:10 PM

8m, 35s

run_qrstuvwxyz1234567890

log-analyzer

Completed

analyze_logs.yml

Oct 24, 2025, 4:00 PM

6m, 22s

run_abcdefg12345hijklmnop

data-visualizer

Completed

generate_charts.yml

Oct 24, 2025, 5:15 PM

12m, 7s

Buster documents your dbt project, so agents deeply understand it

Buster documents your dbt project, so agents deeply understand it

sales_order_detail.yml

Copied

Copy file

version: 2

models:
  - name: sales_order_detail
    description: |

      Individual line items representing products sold within each sales order.
      
      Purpose: Line-item transaction table enabling revenue analysis, product performance tracking, discount effectiveness measurement, and basket composition analysis. Foundation for calculating revenue metrics, product-level profitability, and customer purchasing patterns. Used extensively by metrics models for calculating CLV, average order value, gross profit, and product-specific KPIs.
      
      Contents: One row per product line item on a sales order. Composite key: (salesOrderID, salesOrderDetailID). Scale: ~121K line items across ~31K orders spanning Sept 2022 to July 2025 (date-shifted to align with current date).
      
      Lineage: Direct pass-through from stg_sales_order_detail, which sources from sales.salesorderdetail. Staging layer calculates lineTotal field and applies date shifting to modifiedDate.
      
      Patterns:
      - Order simplicity: Most orders contain few items (avg 3.9 items per order). Single-item orders are extremely common, representing the dominant purchasing pattern.
      - Quantity concentration: 58% of line items are quantity 1, 71% are quantity 1-2. Bulk purchases (qty >10) represent <3% but can reach qty 44.
      - Product concentration: Top 10 products (out of 259) account for 20% of line items. Product 870 alone appears in 3.7% of all line items.
      - Discount sparsity: 97% of line items have no discount (unitPriceDiscount = 0). When discounts apply, they're typically 2%, 5%, 10%, 15%, or 20%.
      - Special offer dominance: 95% use specialOfferID = 1 (likely "No Discount" baseline offer), making non-promotional sales the norm.
      - Carrier tracking: 45% of line items have null carrierTrackingNumber, suggesting orders not yet shipped or using ship methods without tracking.
      - Price distribution: Highly skewed - median unit price $54.94, but ranges from $1.37 to $3578.27. High-value items (>$2000) appear in ~6% of line items.
      - Line total pattern: Log-normal distribution with median $183.94, mean $989.34. Most line items are modest value, but tails extend to $22K+ for high-quantity luxury purchases.
      
      Usage Guidance:
      Foundational fact table for sales analytics. Essential for calculating revenue totals, analyzing product performance, measuring discount impact, and understanding purchasing behavior. Most revenue metrics aggregate lineTotal; product analysis groups by productID; discount analysis filters or segments by unitPriceDiscount or specialOfferID. For customer behavior analysis, aggregate to order level first via salesOrderID to avoid over-counting multi-item orders. For product profitability, join to product table for cost data then calculate margin (lineTotal - cost). When analyzing average order value, aggregate line items by order first to get order-level totals.
      
      Critical Context:
      - lineTotal is calculated in staging as (unitPrice * orderQty * (1 - unitPriceDiscount)) and represents net revenue after discounts but before taxes/freight. This is the primary revenue metric field.
      - All dates shifted forward using shift_date() macro to make dataset feel current (max date aligns with March 28, 2025). Historical patterns span ~3 years.
      - Null carrierTrackingNumber doesn't indicate data quality issue - reflects legitimate business states (orders not shipped yet, certain ship methods, or in-store pickup).
      - salesOrderDetailID is unique within entire table (not just within order) - serves as primary key alone, though conceptually represents line item number within order.
      - unitPrice reflects actual selling price at time of sale (may differ from product.listPrice due to negotiated pricing, promotions, or price changes over time).
      - High orderQty outliers (>20) typically involve accessories or components sold in bulk, not bikes.
      - No line items exist without corresponding order in sales_order_header - referential integrity is clean.

    relationships:
      - name: sales_order_header
        description: >
          Business relationship: Every line item belongs to exactly one sales order. Order header provides order-level context (customer, dates, shipping, totals, status) that applies to all line items within that order. Join to get customer attribution, order timing, territory assignment, shipping details, and order-level calculated fields (purchase context filters, consultation level, etc.).
          Join considerations: Many-to-one from detail to header. Each salesOrderID in details appears in header exactly once. Each order in header typically has multiple detail rows (avg 3.9 line items per order, but distribution is right-skewed with many single-item orders).
          Coverage: 100% of line items match to header. Clean referential integrity - no orphaned details.
          Cardinality notes: Standard fact-to-dimension pattern. When joining, expect row count to remain same (detail-level grain preserved). When aggregating metrics from details, group by salesOrderID first to get order-level aggregates before further analysis to avoid over-representing multi-item orders.
        source_col: salesOrderID
        ref_col: salesOrderID
        cardinality

sales_order_detail.yml

Copied

Copy file

version: 2

models:
  - name: sales_order_detail
    description: |

      Individual line items representing products sold within each sales order.
      
      Purpose: Line-item transaction table enabling revenue analysis, product performance tracking, discount effectiveness measurement, and basket composition analysis. Foundation for calculating revenue metrics, product-level profitability, and customer purchasing patterns. Used extensively by metrics models for calculating CLV, average order value, gross profit, and product-specific KPIs.
      
      Contents: One row per product line item on a sales order. Composite key: (salesOrderID, salesOrderDetailID). Scale: ~121K line items across ~31K orders spanning Sept 2022 to July 2025 (date-shifted to align with current date).
      
      Lineage: Direct pass-through from stg_sales_order_detail, which sources from sales.salesorderdetail. Staging layer calculates lineTotal field and applies date shifting to modifiedDate.
      
      Patterns:
      - Order simplicity: Most orders contain few items (avg 3.9 items per order). Single-item orders are extremely common, representing the dominant purchasing pattern.
      - Quantity concentration: 58% of line items are quantity 1, 71% are quantity 1-2. Bulk purchases (qty >10) represent <3% but can reach qty 44.
      - Product concentration: Top 10 products (out of 259) account for 20% of line items. Product 870 alone appears in 3.7% of all line items.
      - Discount sparsity: 97% of line items have no discount (unitPriceDiscount = 0). When discounts apply, they're typically 2%, 5%, 10%, 15%, or 20%.
      - Special offer dominance: 95% use specialOfferID = 1 (likely "No Discount" baseline offer), making non-promotional sales the norm.
      - Carrier tracking: 45% of line items have null carrierTrackingNumber, suggesting orders not yet shipped or using ship methods without tracking.
      - Price distribution: Highly skewed - median unit price $54.94, but ranges from $1.37 to $3578.27. High-value items (>$2000) appear in ~6% of line items.
      - Line total pattern: Log-normal distribution with median $183.94, mean $989.34. Most line items are modest value, but tails extend to $22K+ for high-quantity luxury purchases.
      
      Usage Guidance:
      Foundational fact table for sales analytics. Essential for calculating revenue totals, analyzing product performance, measuring discount impact, and understanding purchasing behavior. Most revenue metrics aggregate lineTotal; product analysis groups by productID; discount analysis filters or segments by unitPriceDiscount or specialOfferID. For customer behavior analysis, aggregate to order level first via salesOrderID to avoid over-counting multi-item orders. For product profitability, join to product table for cost data then calculate margin (lineTotal - cost). When analyzing average order value, aggregate line items by order first to get order-level totals.
      
      Critical Context:
      - lineTotal is calculated in staging as (unitPrice * orderQty * (1 - unitPriceDiscount)) and represents net revenue after discounts but before taxes/freight. This is the primary revenue metric field.
      - All dates shifted forward using shift_date() macro to make dataset feel current (max date aligns with March 28, 2025). Historical patterns span ~3 years.
      - Null carrierTrackingNumber doesn't indicate data quality issue - reflects legitimate business states (orders not shipped yet, certain ship methods, or in-store pickup).
      - salesOrderDetailID is unique within entire table (not just within order) - serves as primary key alone, though conceptually represents line item number within order.
      - unitPrice reflects actual selling price at time of sale (may differ from product.listPrice due to negotiated pricing, promotions, or price changes over time).
      - High orderQty outliers (>20) typically involve accessories or components sold in bulk, not bikes.
      - No line items exist without corresponding order in sales_order_header - referential integrity is clean.

    relationships:
      - name: sales_order_header
        description: >
          Business relationship: Every line item belongs to exactly one sales order. Order header provides order-level context (customer, dates, shipping, totals, status) that applies to all line items within that order. Join to get customer attribution, order timing, territory assignment, shipping details, and order-level calculated fields (purchase context filters, consultation level, etc.).
          Join considerations: Many-to-one from detail to header. Each salesOrderID in details appears in header exactly once. Each order in header typically has multiple detail rows (avg 3.9 line items per order, but distribution is right-skewed with many single-item orders).
          Coverage: 100% of line items match to header. Clean referential integrity - no orphaned details.
          Cardinality notes: Standard fact-to-dimension pattern. When joining, expect row count to remain same (detail-level grain preserved). When aggregating metrics from details, group by salesOrderID first to get order-level aggregates before further analysis to avoid over-representing multi-item orders.
        source_col: salesOrderID
        ref_col: salesOrderID
        cardinality

Deep Model Understanding

Buster deploys dozens of agents in parallel to index your dbt project and explore your repo.

Grounded in Metadata

Agents specialize in retrieving and traversing dbt metadata, data profiling metrics, and lineage.

Optimized for AI tools

Agents document nuance, edge cases, and how models should actually be used in analysis.

Buster is built with enterprise-grade security practices. This includes state-of-the-art encryption, safe and reliable infrastructure partners, and independently verified security controls.

SOC 2 Type II compliant

Buster has undergone a Service Organization Controls audit (SOC 2 Type II).

HIPAA compliant

Privacy & security measures to ensure that PHI is appropriately safeguarded.

Permissions & governance

Provision users, enforce permissions, & implement robust governance.

IP protection policy

Neither Buster nor our model partners train models on customer data.

Self-hosted deployment

Deploy in your own air-gapped environment.

Secure connections

SSL and pass-through OAuth available.

Simple pricing that scales with you

Simple pricing that scales with you

Start

Free

Free for everyone

For individual data professionals exploring AI automation

Unlimited team members

Up to 3 active agents

Bring your own API keys

Pro

$2,400

per month, billed monthly

For data teams automating dbt workflows at scale

Unlimited team members

Unlimited agents

Up to 4,000 runs per month

Full platform access

Enterprise

Contact us

Custom pricing

For unique compliance needs and large-scale dbt operations

Unlimited team members

Unlimited agents

Unlimited runs

Full platform access

Custom pricing & SLA

Advanced security

Frequently asked questions

Frequently asked questions

How do I get started with Buster?

Getting started takes about 10 minutes. Check out our Quickstart guide to see how.

What kinds of tasks can Buster handle?

Buster excels at repetitive data engineering workflows. Anything you might instruct a data engineer teammate to do for you, Buster can automate. You can see a few examples here.

How does usage-based pricing work?

The Pro plan includes 4,000 agent runs per month. One run = one agent execution, regardless of duration. If you consistently exceed 4,000 runs per month, we’ll discuss an Enterprise agreement.

How does Buster work with my existing tools?

Buster integrates directly with your stack through native connections. It works with dbt Cloud and dbt Core, all major data warehouses, GitHub, and Slack. You can see all of our integrations here.

Is Buster secure?

Yes. Buster is SOC 2 compliant and is built with enterprise-grade security practices. Agents have read-only warehouse access and run in isolated sandboxes with ephemeral containers destroyed after each run.

How does Buster use my data?

We never train models on your data. All warehouse data remains in isolated sandboxes that are deprecated after completion. Enterprise customers can self-host for complete data control.

Ready to automate your dbt workflows? Create your first agent in 10 minutes.

Create your first agent in 10 minutes.