mirror of https://github.com/BillyOutlast/posthog.git synced 2026-02-04 03:01:23 +01:00

Files

Andrew Maguire 1b7cca957d feat(llma): Add daily metrics aggregation pipeline (#41287 )

Co-authored-by: Claude <noreply@anthropic.com>

2025-11-13 13:38:42 +01:00

common

feat(llma): Add daily metrics aggregation pipeline (#41287 )

2025-11-13 13:38:42 +01:00

llma

feat(llma): Add daily metrics aggregation pipeline (#41287 )

2025-11-13 13:38:42 +01:00

locations

feat(llma): Add daily metrics aggregation pipeline (#41287 )

2025-11-13 13:38:42 +01:00

max_ai

chore(max): increase export timeouts in dagster (#38536 )

2025-10-01 08:29:02 +00:00

sdk_doctor

fix: Last couple fixes to SDK Doctor (#39737 )

2025-10-15 20:28:37 +00:00

tests

feat(llma): Add daily metrics aggregation pipeline (#41287 )

2025-11-13 13:38:42 +01:00

__init__.py

feat(clickhouse): Add Postgres to ClickHouse ETL pipeline for organization and team data (#36683 )

2025-08-14 21:58:50 -07:00

backups.py

fix: pass config to check_running_backup_for_table op (#41321 )

2025-11-12 09:43:12 +01:00

ch_examples.py

feat: Add Dagster jobs to compute exchange rates (#29495 )

2025-03-06 11:22:29 -03:00

deletes.py

feat: Add monthly cleanup job for old events (#38916 )

2025-10-01 14:27:32 +00:00

exchange_rate.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

experiment_regular_metrics_timeseries.py

chore(experiments): Catch real errors in Dagster (#41282 )

2025-11-12 14:17:35 +01:00

experiment_saved_metrics_timeseries.py

chore(experiments): Catch real errors in Dagster (#41282 )

2025-11-12 14:17:35 +01:00

experiment_timeseries_recalculation.py

fix(experiments): Add stale experiment filtering (#40201 )

2025-10-28 11:09:09 +00:00

experiments.py

fix(experiments): Add stale experiment filtering (#40201 )

2025-10-28 11:09:09 +00:00

export_query_logs_to_s3.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

managed_viewset_sync.py

feat: Use managed viewset views for revenue analytics (#40352 )

2025-11-05 09:27:25 -03:00

materialized_columns.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

oauth.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

orm_examples.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

person_overrides.py

feat(sessions): Refactor an overrides_manager file out of person overrides (to use later in session overrides) (#40492 )

2025-11-03 22:07:50 +00:00

postgres_to_clickhouse_etl.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

property_definitions.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

query_log_example.yaml

fix: Dagster export job + add fixedstring support (#29523 )

2025-03-05 12:46:18 +01:00

README.md

docs(dagster): how to add new teams to dagster (#41278 )

2025-11-11 12:49:02 -03:00

sessions.py

feat(sessions): Add concurrency limit to sessions backfill (#41339 )

2025-11-12 14:24:10 +00:00

slack_alerts.py

chore(experiments): Dagster alerts to separate channel (#41364 )

2025-11-13 11:29:17 +00:00

symbol_set_cleanup.py

chore(devex): Migrate Error Tracking Models to Products Structure (#38775 )

2025-10-14 10:25:10 +00:00

testing.py

chore: revert "chore: add dagster job resource config" (#38050 )

2025-09-12 15:49:31 -07:00

web_pre_aggregated_accuracy.py

chore(web-analytics): beautify accuracy comparison output (#39050 )

2025-10-02 16:08:50 -03:00

web_preaggregated_asset_checks.py

chore(web-analytics): detach asset checks from pre-aggregated assets (#38675 )

2025-09-25 16:35:09 -03:00

web_preaggregated_daily.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

web_preaggregated_hourly.py

chore: make web_preaggregated_hourly jobs easier on the cluster (#37928 )

2025-09-16 19:06:38 +02:00

web_preaggregated_team_selection.py

chore(isort): add isort configuration in pyproject.toml (#36781 )

2025-08-25 00:57:29 -03:00

web_preaggregated_utils.py

chore(web-analytics): always target hosts by role (#38535 )

2025-09-23 15:57:33 -03:00

web_preaggregated.py

chore(web-analytics): move the demo data pre-aggregation backfill out of mprocs (#39847 )

2025-10-28 18:17:47 -03:00

README.md

PostHog Dagster DAGs

This directory contains Dagster data pipelines (DAGs) for PostHog. Dagster is a data orchestration framework that allows us to define, schedule, and monitor data workflows.

What is Dagster?

Dagster is an open-source data orchestration tool designed to help you define and execute data pipelines. Key concepts include:

Assets: Data artifacts that your pipelines produce and consume (e.g., tables, files)
Ops: Individual units of computation (functions)
Jobs: Collections of ops that are executed together
Resources: Shared infrastructure and connections (e.g. database connections)
Schedules: Time-based triggers for jobs
Sensors: Event-based triggers for jobs

Project Structure

locations/: Main Dagster definition files (split by team) that defines assets, jobs, schedules, sensors, and resources
common.py: Shared utilities and resources
Individual DAG files (e.g., exchange_rate.py, deletes.py, person_overrides.py)
tests/: Tests for the DAGs

Cloud access for posthog employees

Ask someone on the #team-infrastructure or #team-clickhouse to add you to Dagster Cloud. You might also want to join the #dagster-posthog slack channel.

Adding a New Team

To set up a new team with their own Dagster definitions and Slack alerts, follow these steps:

Create a new definitions file in locations/<team_name>.py:

import dagster

from dags import my_module  # Import your DAGs

from . import resources  # Import shared resources (if needed)

defs = dagster.Definitions(
    assets=[
        # List your assets here
        my_module.my_asset,
    ],
    jobs=[
        # List your jobs here
        my_module.my_job,
    ],
    schedules=[
        # List your schedules here
        my_module.my_schedule,
    ],
    resources=resources,  # Include shared resources (ClickHouse, S3, Slack, etc.)
)

Examples: See locations/analytics_platform.py (simple) or locations/web_analytics.py (complex with conditional schedules)

Register the location in the workspace (for local development):

Add your module to .dagster_home/workspace.yaml:
```
load_from:
  - python_module: dags.locations.your_team
```
Note: Only add locations that should run locally. Heavy operations should remain commented out.
Configure production deployment:

For PostHog employees, add the new location to the Dagster configuration in the charts repository (see config/dagster/).

Sample PR: https://github.com/PostHog/charts/pull/6366

Add team to the JobOwners enum in common/common.py:

class JobOwners(str, Enum):
    TEAM_ANALYTICS_PLATFORM = "team-analytics-platform"
    TEAM_YOUR_TEAM = "team-your-team"  # Add your team here (alphabetically sorted)
    # ... other teams

Add Slack channel mapping in slack_alerts.py:

notification_channel_per_team = {
    JobOwners.TEAM_ANALYTICS_PLATFORM.value: "#alerts-analytics-platform",
    JobOwners.TEAM_YOUR_TEAM.value: "#alerts-your-team",  # Add mapping here (alphabetically sorted)
    # ... other teams
}

Create the Slack channel (if it doesn't exist) and ensure the Alertmanager/Max Slack bot is invited to the channel
Apply owner tags to your team's assets and jobs (see next section)

How slack alerts works

The notify_slack_on_failure sensor (defined in slack_alerts.py) monitors all job failures across all code locations
Alerts are only sent in production (when CLOUD_DEPLOYMENT environment variable is set)
Each team has a dedicated Slack channel where their alerts are routed based on job ownership
Failed jobs send a message to the appropriate team channel with a link to the Dagster run

Consecutive Failure Thresholds

Some jobs are configured to only alert after multiple consecutive failures to avoid alert fatigue. Configure this in slack_alerts.py:

CONSECUTIVE_FAILURE_THRESHOLDS = {
    "web_pre_aggregate_current_day_hourly_job": 3,  # Alert after 3 consecutive failures
    "your_job_name": 2,  # Add your threshold here
}

Disabling Notifications

To disable Slack notifications for a specific job, add the disable_slack_notifications tag:

@dagster.job(tags={"disable_slack_notifications": "true"})
def quiet_job():
    pass

Testing Alerts Locally

When running Dagster locally (with DEBUG=1), the Slack resource is replaced with a dummy resource, so no actual notifications are sent. This prevents test alerts from being sent to production Slack channels during development.

To test the alert routing logic, write unit tests in tests/test_slack_alerts.py.

Local Development

Environment Setup

Dagster uses the DAGSTER_HOME environment variable to determine where to store instance configuration, logs, and other local artifacts. Set this to the .dagster_home file at the top of this repository:

export DAGSTER_HOME=$(pwd)/.dagster_home

You can add this to your shell profile if you want to always store your assets, or to your local .env file which will be automatically detected by dagster dev.

Running the Development Server

(Recommended) The Dagster development server starts automatically if you are using the top-level local development script:

./bin/start.sh

To run only the Dagster development server locally:

export DAGSTER_HOME=$(pwd)/.dagster_home
export DEBUG=1 # Important: Set DEBUG=1 when running locally to use local resources
dagster dev --workspace $DAGSTER_HOME/workspace.yaml

The Dagster UI will be available at http://localhost:3000 by default, where you can:

Browse assets, jobs, and schedules
Manually trigger job runs
View execution logs and status
Debug pipeline issues

Adding New DAGs

When adding a new DAG:

Create a new Python file for your DAG
Define your assets, ops, and jobs
Import and register them in the relevant file in dags/locations/
Add appropriate tests in the tests/ directory

Running Tests

Tests are implemented using pytest. The following command will run all DAG tests:

# From the project root
pytest dags/

To run a specific test file:

pytest dags/tests/test_exchange_rate.py

To run a specific test:

pytest dags/tests/test_exchange_rate.py::test_name

Add -v for verbose output:

pytest -v dags/tests/test_exchange_rate.py

Web Analytics Pre-Aggregated Tables

Note: For materializing web analytics preaggregated tables locally (e.g., during development or testing), you may want to use a higher partition count to process more data in a single run:

DAGSTER_WEB_PREAGGREGATED_MAX_PARTITIONS_PER_RUN=3000 DEBUG=1 dagster dev -m dags.definitions

This will allow backfills to process up to 3000 partitions per run instead of the default, significantly reducing the number of individual runs needed for large historical backfills.

Testing Concurrency Limits Locally

To test job concurrency limits (useful for jobs like web_analytics_daily_job that use backfill policies), you need to configure a dagster.yaml file with concurrency settings. This is especially important for asset backfills which create __ASSET_JOB runs that can overwhelm your system if not properly limited.

Setup

Create the Dagster home directory and configuration file:

mkdir -p .dagster_home

Create .dagster_home/dagster.yaml with the following content:

run_coordinator:
  module: dagster._core.run_coordinator.queued_run_coordinator
  class: QueuedRunCoordinator
  config:
    dequeue_interval_seconds: 5

run_launcher:
  module: dagster._core.launcher.default_run_launcher
  class: DefaultRunLauncher

concurrency:
  runs:
    max_concurrent_runs: 10 # Overall instance limit
    tag_concurrency_limits:
      # Limit specific job types
      - key: 'dagster/job_name'
        value: 'web_analytics_daily_job'
        limit: 1

Run Dagster with the configuration:

DAGSTER_WEB_PREAGGREGATED_MAX_PARTITIONS_PER_RUN=1  # Force small partitions per run to create multiple runs

```bash
export DAGSTER_HOME=$(pwd)/.dagster_home && DAGSTER_WEB_PREAGGREGATED_MAX_PARTITIONS_PER_RUN=1 DEBUG=1 dagster dev -m dags.definitions

Testing

In the Dagster UI, navigate to your assets (e.g., web analytics assets)
Start a backfill for several days (e.g., 3-5 days)
Check the "Runs" page - you should observe:
- Only 1 run in STARTED/STARTING status at a time for the same concurrency group
- Other runs waiting in QUEUED status
- Runs progressing sequentially: QUEUED → STARTED → SUCCESS

Production Configuration

For production deployments, configure similar concurrency settings in your dagster.yaml. For posthog employees, it is on our charts repo: https://github.com/PostHog/charts/blob/master/config/dagster

README.md

PostHog Dagster DAGs

What is Dagster?

Project Structure

Cloud access for posthog employees

Adding a New Team

How slack alerts works

Consecutive Failure Thresholds

Disabling Notifications

Testing Alerts Locally

Local Development

Environment Setup

Running the Development Server

Adding New DAGs

Running Tests

Web Analytics Pre-Aggregated Tables

Testing Concurrency Limits Locally

Setup

Testing

Production Configuration

Additional Resources