diff --git a/.claude/agents/access-control.md b/.claude/agents/access-control.md index ab7bbd5012..9b1c13b911 100644 --- a/.claude/agents/access-control.md +++ b/.claude/agents/access-control.md @@ -363,7 +363,7 @@ RESOURCE_INHERITANCE_MAP = { The `AccessControlViewSetMixin` automatically adds these endpoints: -``` +```text GET /api/projects/{project_id}/{resource}/{id}/access_controls/ POST /api/projects/{project_id}/{resource}/{id}/access_controls/ DELETE /api/projects/{project_id}/{resource}/{id}/access_controls/ diff --git a/.config/.markdownlint-cli2.jsonc b/.config/.markdownlint-cli2.jsonc index 13e94871eb..4f80c4fdcc 100644 --- a/.config/.markdownlint-cli2.jsonc +++ b/.config/.markdownlint-cli2.jsonc @@ -1,8 +1,17 @@ { + "gitignore": true, "config": { "MD013": false, "MD033": false, "MD034": false, - "MD036": false + "MD036": false, + "MD041": false, + "MD024": false, + "MD001": false, + "MD025": false, + "MD026": false, + "MD028": false, + "MD045": false, + "MD029": false } } diff --git a/.github/ISSUE_TEMPLATE/sprint_planning_retro.md b/.github/ISSUE_TEMPLATE/sprint_planning_retro.md index 8499eb11fd..5bbda08431 100644 --- a/.github/ISSUE_TEMPLATE/sprint_planning_retro.md +++ b/.github/ISSUE_TEMPLATE/sprint_planning_retro.md @@ -15,7 +15,7 @@ title: Sprint 1.n.0 m/2 - Jan 1 to Jan 12 For your team sprint planning copy this template into a comment below for each team. -``` +```text # Team ___ **Support hero:** ___ @@ -40,5 +40,4 @@ For your team sprint planning copy this template into a comment below for each t ### Low priority / side quests - - ``` diff --git a/.github/workflows/ci-frontend.yml b/.github/workflows/ci-frontend.yml index a71383b538..f8e40d59fc 100644 --- a/.github/workflows/ci-frontend.yml +++ b/.github/workflows/ci-frontend.yml @@ -52,6 +52,9 @@ jobs: - tsconfig.*.json - webpack.config.js - stylelint* + - '**/*.md' + - '**/*.mdx' + - .config/.markdownlint-cli2.jsonc frontend-format: name: Frontend formatting @@ -110,6 +113,10 @@ jobs: if: needs.changes.outputs.frontend == 'true' run: pnpm --filter=@posthog/frontend lint:js -f github + - name: Lint markdown files + if: needs.changes.outputs.frontend == 'true' + run: pnpm exec markdownlint-cli2 --config .config/.markdownlint-cli2.jsonc "**/*.{md,mdx}" + frontend-toolbar-checks: name: Frontend toolbar checks needs: [changes] diff --git a/README.md b/README.md index 21b0c47c0e..fb995e8718 100644 --- a/README.md +++ b/README.md @@ -41,8 +41,8 @@ Best of all, all of this is free to use with a [generous monthly free tier](http - [PostHog is an all-in-one, open source platform for building successful products](#posthog-is-an-all-in-one-open-source-platform-for-building-successful-products) - [Table of Contents](#table-of-contents) - [Getting started with PostHog](#getting-started-with-posthog) - - [PostHog Cloud (Recommended)](#posthog-cloud-recommended) - - [Self-hosting the open-source hobby deploy (Advanced)](#self-hosting-the-open-source-hobby-deploy-advanced) + - [PostHog Cloud (Recommended)](#posthog-cloud-recommended) + - [Self-hosting the open-source hobby deploy (Advanced)](#self-hosting-the-open-source-hobby-deploy-advanced) - [Setting up PostHog](#setting-up-posthog) - [Learning more about PostHog](#learning-more-about-posthog) - [Contributing](#contributing) @@ -106,7 +106,7 @@ Need _absolutely πŸ’―% FOSS_? Check out our [posthog-foss](https://github.com/Po The pricing for our paid plan is completely transparent and available on [our pricing page](https://posthog.com/pricing). -## We’re hiring! +## We're hiring! Hedgehog working on a Mission Control Center diff --git a/common/hogvm/README.md b/common/hogvm/README.md index 7f3ce80e85..a12266fc6b 100644 --- a/common/hogvm/README.md +++ b/common/hogvm/README.md @@ -6,7 +6,7 @@ A HogVM is a πŸ¦” that runs Hog bytecode. It's purpose is to locally evaluate Ho Hog Bytecode is a compact representation of a subset of the Hog AST nodes. It follows a certain structure: -``` +```python 1 + 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.PLUS] 1 and 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.AND, 2] 1 or 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.OR, 2] diff --git a/docker/temporal/dynamicconfig/README.md b/docker/temporal/dynamicconfig/README.md index 67354e3915..7e1be8f008 100644 --- a/docker/temporal/dynamicconfig/README.md +++ b/docker/temporal/dynamicconfig/README.md @@ -12,30 +12,30 @@ constraints. There are only three types of constraint: Please use the following format: -``` +```yaml testGetBoolPropertyKey: - - value: false - - value: true - constraints: - namespace: "global-samples-namespace" - - value: false - constraints: - namespace: "samples-namespace" + - value: false + - value: true + constraints: + namespace: 'global-samples-namespace' + - value: false + constraints: + namespace: 'samples-namespace' testGetDurationPropertyKey: - - value: "1m" - constraints: - namespace: "samples-namespace" - taskQueueName: "longIdleTimeTaskqueue" + - value: '1m' + constraints: + namespace: 'samples-namespace' + taskQueueName: 'longIdleTimeTaskqueue' testGetFloat64PropertyKey: - - value: 12.0 - constraints: - namespace: "samples-namespace" + - value: 12.0 + constraints: + namespace: 'samples-namespace' testGetMapPropertyKey: - - value: - key1: 1 - key2: "value 2" - key3: - - false - - key4: true - key5: 2.0 + - value: + key1: 1 + key2: 'value 2' + key3: + - false + - key4: true + key5: 2.0 ``` diff --git a/docs/S3_QUERY_CACHE_SETUP.md b/docs/S3_QUERY_CACHE_SETUP.md index f9b780f00f..fc00aa8da8 100644 --- a/docs/S3_QUERY_CACHE_SETUP.md +++ b/docs/S3_QUERY_CACHE_SETUP.md @@ -4,7 +4,7 @@ 1 **Object Tagging**: Each S3 object gets tags: -``` +```text ttl_days=1 # Calculated TTL in days team_id=123 # Team identifier ``` diff --git a/ee/benchmarks/README.md b/ee/benchmarks/README.md index ef7bdcca25..22c3a98687 100644 --- a/ee/benchmarks/README.md +++ b/ee/benchmarks/README.md @@ -45,7 +45,7 @@ CLICKHOUSE_HOST=X CLICKHOUSE_USER=X CLICKHOUSE_PASSWORD=X CLICKHOUSE_DATABASE=po You'll probably want to be running one test, with quick iteration. Running e.g.: -``` +```bash asv run --config ee/benchmarks/asv.conf.json --bench track_lifecycle --quick ``` diff --git a/ee/hogai/README.md b/ee/hogai/README.md index 059ff1636e..08b9b7a850 100644 --- a/ee/hogai/README.md +++ b/ee/hogai/README.md @@ -156,12 +156,15 @@ NOTE: this won't extend query types generation. For that, talk to the Max AI tea - Add a new formatter class in `query_executor/format.py` that implements query result formatting for AI consumption (see below, point 3) - Add formatting logic to `_compress_results()` method in `query_executor/query_executor.py`: + ```python elif isinstance(query, YourNewAssistantQuery | YourNewQuery): return YourNewResultsFormatter(query, response["results"]).format() ``` + - Add example prompts for your query type in `query_executor/prompts.py`, this explains to the LLM the query results formatting - Update `_get_example_prompt()` method in `query_executor/nodes.py` to handle your new query type: + ```python if isinstance(viz_message.answer, YourNewAssistantQuery): return YOUR_NEW_EXAMPLE_PROMPT @@ -169,6 +172,7 @@ NOTE: this won't extend query types generation. For that, talk to the Max AI tea 2. **Update the root node** (`@ee/hogai/graph/root/`): - Add your new query type to the `MAX_SUPPORTED_QUERY_KIND_TO_MODEL` mapping in `nodes.py:57`: + ```python MAX_SUPPORTED_QUERY_KIND_TO_MODEL: dict[str, type[SupportedQueryTypes]] = { "TrendsQuery": TrendsQuery, diff --git a/frontend/src/queries/README.md b/frontend/src/queries/README.md index 48ac9e9e6b..6d35571f9d 100644 --- a/frontend/src/queries/README.md +++ b/frontend/src/queries/README.md @@ -1,16 +1,16 @@ # Queries - `Query/` - - Generic component that routes internally to the right node. - - `` + - Generic component that routes internally to the right node. + - `` - `QueryEditor/` - - Generic JSON editor - - `` + - Generic JSON editor + - `` - `nodes/` - - The folders in this directory (`EventsNode/`, `DataTable/`, etc) contain React components that display queries of that specific `kind`. - - Basically everything in `nodes/DataTable/` expects your query to be of kind `DataTable`. - - The top level component, `DataTable.tsx`, always exports the component `DataTable({ query, setQuery })` - - There are various sub-components as needed, e.g. ``, ``. Some of them depend on a logic, likely `dataNodeLogic`, being in a `BindLogic` context, so read the source. + - The folders in this directory (`EventsNode/`, `DataTable/`, etc) contain React components that display queries of that specific `kind`. + - Basically everything in `nodes/DataTable/` expects your query to be of kind `DataTable`. + - The top level component, `DataTable.tsx`, always exports the component `DataTable({ query, setQuery })` + - There are various sub-components as needed, e.g. ``, ``. Some of them depend on a logic, likely `dataNodeLogic`, being in a `BindLogic` context, so read the source. - `examples.ts` - Various examples used in storybook - `query.ts` - make API calls to fetch data for any query - `schema.json` - JSON schema, used for query editor, built with `pnpm -w schema:build` diff --git a/frontend/src/scenes/max/README.md b/frontend/src/scenes/max/README.md index 8c6ed09894..fd855a1526 100644 --- a/frontend/src/scenes/max/README.md +++ b/frontend/src/scenes/max/README.md @@ -4,13 +4,13 @@ Scene logics can expose a `maxContext` selector to provide relevant context to M To do so: -1. Import the necessary types and helpers: +1. Import the necessary types and helpers: ```typescript import { MaxContextInput, createMaxContextHelpers } from 'scenes/max/maxTypes' ``` -2. Add a `maxContext` selector that returns MaxContextInput[]: +2. Add a `maxContext` selector that returns MaxContextInput[]: ```typescript selectors({ @@ -26,7 +26,8 @@ To do so: }) ``` -3. For multiple context items: +3. For multiple context items: + ```typescript maxContext: [ (s) => [s.insight, s.events], diff --git a/frontend/src/scenes/web-analytics/contributing.md b/frontend/src/scenes/web-analytics/contributing.md index 071a25de4c..cc0ece965b 100644 --- a/frontend/src/scenes/web-analytics/contributing.md +++ b/frontend/src/scenes/web-analytics/contributing.md @@ -40,8 +40,8 @@ Some web analytics features are present in the [toolbar](https://posthog.com/doc ## More resources - Clickhouse - - PostHog maintains a [Clickhouse manual](https://posthog.com/handbook/engineering/clickhouse) - - Clickhouse has a [video course](https://learn.clickhouse.com/visitor_class_catalog/category/116050), which has been recommended by some team members - - You can skip the videos that are about e.g. migrating from another tool to Clickhouse - - [Designing Data-Intensive Applications](https://dataintensive.net/) is a great book about distributed systems, and chapter 3 introduces OLAP / columnar databases. - - If you already know what an OLAP database is, you'd probably get more out of the Clickhouse course than this book. This book is good at introducing concepts but won't touch on Clickhouse specifically. + - PostHog maintains a [Clickhouse manual](https://posthog.com/handbook/engineering/clickhouse) + - Clickhouse has a [video course](https://learn.clickhouse.com/visitor_class_catalog/category/116050), which has been recommended by some team members + - You can skip the videos that are about e.g. migrating from another tool to Clickhouse + - [Designing Data-Intensive Applications](https://dataintensive.net/) is a great book about distributed systems, and chapter 3 introduces OLAP / columnar databases. + - If you already know what an OLAP database is, you'd probably get more out of the Clickhouse course than this book. This book is good at introducing concepts but won't touch on Clickhouse specifically. diff --git a/frontend/src/stories/Hello.stories.mdx b/frontend/src/stories/Hello.stories.mdx index 359c0afedb..e1eb055649 100644 --- a/frontend/src/stories/Hello.stories.mdx +++ b/frontend/src/stories/Hello.stories.mdx @@ -31,6 +31,6 @@ To run storybook locally, run `pnpm storybook`. It'll open on [http://localhost: To edit in the cloud, launch a new github codespace for [this repository](https://github.com/posthog/posthog), then run `pnpm i` and `pnpm storybook` -## Hot tips: +## Hot tips - When you're in a [scene story](/story/scenes-app-dashboard--edit), hit "a" and click "Story" to see its source. diff --git a/frontend/src/stories/How to build a scene.stories.mdx b/frontend/src/stories/How to build a scene.stories.mdx index 6b523c7ad4..20b18116c6 100644 --- a/frontend/src/stories/How to build a scene.stories.mdx +++ b/frontend/src/stories/How to build a scene.stories.mdx @@ -8,9 +8,9 @@ If you want to add a new scene in the PostHog App frontend, here are 7 easy step But first, you must answer one question: Does your scene depend on an `id` in the URL, like `/dashboard/:id`? -## Option A: I'm buliding a global scene that does not depend on an `id` in the URL. +## Option A: I'm buliding a global scene that does not depend on an `id` in the URL -### 1. Create the component, logic and styles. +### 1. Create the component, logic and styles Create a component like: `frontend/src/scenes/dashboard/Dashboards.tsx` @@ -112,9 +112,9 @@ export const appScenes: Record any> = { } ``` -## Option B: My scene depends on an `id` in the URL (`/dashboard/:id`). +## Option B: My scene depends on an `id` in the URL (`/dashboard/:id`) -### 1. Create the component, logic and styles. +### 1. Create the component, logic and styles Create a component like: `frontend/src/scenes/dashboard/Dashboard.tsx` diff --git a/frontend/src/stories/Missing scenes.stories.mdx b/frontend/src/stories/Missing scenes.stories.mdx index ae90203a32..68c5ef8e1d 100644 --- a/frontend/src/stories/Missing scenes.stories.mdx +++ b/frontend/src/stories/Missing scenes.stories.mdx @@ -7,16 +7,16 @@ import { Meta } from '@storybook/addon-docs' The following scenes are missing. Please help add them: - Dashboards - - List - - Dashboards + - List + - Dashboards - Insights - - Funnels - - Empty state with bad exclusion filters - - Correlation Results - - Property Correlation Results - - Skewed Funnel Results + - Funnels + - Empty state with bad exclusion filters + - Correlation Results + - Property Correlation Results + - Skewed Funnel Results - Recordings - - View a recording + - View a recording - Experiments - Data Managmenet - Persons & Groups diff --git a/infra-scripts/clitools/README.md b/infra-scripts/clitools/README.md index d0eca1e835..39bf0a068c 100644 --- a/infra-scripts/clitools/README.md +++ b/infra-scripts/clitools/README.md @@ -7,6 +7,7 @@ The primary function is to help manage and connect to PostHog toolbox pods in a 1. Ensure you have Python 3.x installed on your system 2. Clone this repository or download `toolbox.py` 3. Make the script executable (Unix-based systems): + ```bash chmod +x toolbox.py ``` diff --git a/infra-scripts/clitools/toolbox/README.md b/infra-scripts/clitools/toolbox/README.md index cd9b3fec99..eda5246c71 100644 --- a/infra-scripts/clitools/toolbox/README.md +++ b/infra-scripts/clitools/toolbox/README.md @@ -16,12 +16,12 @@ A command line utility to connect to PostHog toolbox pods in a Kubernetes enviro The toolbox utility uses a hybrid approach with modular functions in a package but the main entry point in the top-level script: - **Main script**: - - `toolbox.py` - Main script with argument parsing and the core workflow + - `toolbox.py` - Main script with argument parsing and the core workflow - **Support modules**: - - `toolbox/kubernetes.py` - Functions for working with Kubernetes contexts - - `toolbox/user.py` - User identification and ARN parsing - - `toolbox/pod.py` - Pod management (finding, claiming, connecting, deleting) + - `toolbox/kubernetes.py` - Functions for working with Kubernetes contexts + - `toolbox/user.py` - User identification and ARN parsing + - `toolbox/pod.py` - Pod management (finding, claiming, connecting, deleting) This structure keeps the main flow in a single script for easy understanding while separating the implementation details into modular components. diff --git a/livestream/README.md b/livestream/README.md index e717f87991..6046846514 100644 --- a/livestream/README.md +++ b/livestream/README.md @@ -8,18 +8,18 @@ Hog 3000 powers live event stream on PostHog: https://us.posthog.com/project/0/a ## Endpoints - - `/` - dummy placeholder - - `/served` - total number of events and users recorded - - `/stats` - number of unique users (distinct id) on a page - - `/events` - stream consumed events to the requester, it's a done through +- `/` - dummy placeholder +- `/served` - total number of events and users recorded +- `/stats` - number of unique users (distinct id) on a page +- `/events` - stream consumed events to the requester, it's a done through [Server Side Event](sse-moz), it supports extra query params adding filters: - - `eventType` - event type name, - - `distinctId` - only events with a given distinctId, - - `geo` - return only coordinates guessed based on IP, - - `/debug` - dummy html for SSE testing, - - `/debug/sse/` - backend for `/debug` generating a server side events, - - `/metrics` - exposes metrics in Prometheus format - + - `eventType` - event type name, + - `distinctId` - only events with a given distinctId, + - `geo` - return only coordinates guessed based on IP, +- `/debug` - dummy html for SSE testing, +- `/debug/sse/` - backend for `/debug` generating a server side events, +- `/metrics` - exposes metrics in Prometheus format + ## Installing One needs a IP -> (lat,lng) database: @@ -37,7 +37,6 @@ go run . ``` ## Notice + If modifying fields with `//easyjson:json` comment, one must regenerate the easyjson marshaller / unmarshaller. It requires to install: `go install github.com/mailru/easyjson/...@latest` - -[sse-moz]: https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events diff --git a/playwright/README.md b/playwright/README.md index 39fd1a819a..f9ee374445 100644 --- a/playwright/README.md +++ b/playwright/README.md @@ -1,6 +1,6 @@ # End-to-End Testing -## `/e2e/` directory contains all the end-to-end tests. +## `/e2e/` directory contains all the end-to-end tests to run the new playwright tests, run the following command: @@ -20,7 +20,7 @@ you might need to install playwright with `pnpm --filter=@posthog/playwright exe ## Writing tests -### Flaky tests are almost always due to not waiting for the right thing. +### Flaky tests are almost always due to not waiting for the right thing Consider adding a better selector, an intermediate step like waiting for URL or page title to change, or waiting for a critical network request to complete. @@ -28,7 +28,7 @@ Consider adding a better selector, an intermediate step like waiting for URL or If you write a selector that is too loose and matches multiple elements, playwright will output all the matches. With a better selector for each -``` +```text Error: locator.click: Error: strict mode violation: locator('text=Set a billing limit') resolved to 2 elements: 1) Set a billing limit aka getByTestId('billing-limit-input-wrapper-product_analytics').getByRole('button', { name: 'Set a billing limit' }) 2) Set a billing limit aka getByTestId('billing-limit-input-wrapper-session_replay').getByRole('button', { name: 'Set a billing limit' }) diff --git a/posthog/api/test/__snapshots__/test_event.ambr b/posthog/api/test/__snapshots__/test_event.ambr index 5c6ac74080..4916c9484d 100644 --- a/posthog/api/test/__snapshots__/test_event.ambr +++ b/posthog/api/test/__snapshots__/test_event.ambr @@ -286,7 +286,7 @@ /* user_id:0 request:_snapshot_ */ SELECT DISTINCT nullIf(nullIf(events.mat_visible_prop, ''), 'null') AS visible_prop FROM events - WHERE and(equals(events.team_id, 99999), greaterOrEquals(toTimeZone(events.timestamp, 'UTC'), '2025-09-24 00:00:00'), lessOrEquals(toTimeZone(events.timestamp, 'UTC'), '2025-10-01 23:59:59'), isNotNull(nullIf(nullIf(events.mat_visible_prop, ''), 'null'))) + WHERE and(equals(events.team_id, 99999), greaterOrEquals(toTimeZone(events.timestamp, 'UTC'), '2025-09-25 00:00:00'), lessOrEquals(toTimeZone(events.timestamp, 'UTC'), '2025-10-02 23:59:59'), isNotNull(nullIf(nullIf(events.mat_visible_prop, ''), 'null'))) LIMIT 10 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1, @@ -322,7 +322,7 @@ /* user_id:0 request:_snapshot_ */ SELECT DISTINCT nullIf(nullIf(events.mat_test_prop, ''), 'null') AS test_prop FROM events - WHERE and(equals(events.team_id, 99999), greaterOrEquals(toTimeZone(events.timestamp, 'UTC'), '2025-09-24 00:00:00'), lessOrEquals(toTimeZone(events.timestamp, 'UTC'), '2025-10-01 23:59:59'), isNotNull(nullIf(nullIf(events.mat_test_prop, ''), 'null'))) + WHERE and(equals(events.team_id, 99999), greaterOrEquals(toTimeZone(events.timestamp, 'UTC'), '2025-09-25 00:00:00'), lessOrEquals(toTimeZone(events.timestamp, 'UTC'), '2025-10-02 23:59:59'), isNotNull(nullIf(nullIf(events.mat_test_prop, ''), 'null'))) LIMIT 10 SETTINGS readonly=2, max_execution_time=60, allow_experimental_object_type=1, diff --git a/posthog/clickhouse/migrations/README.md b/posthog/clickhouse/migrations/README.md index 3f291ca651..6c1ca987f2 100644 --- a/posthog/clickhouse/migrations/README.md +++ b/posthog/clickhouse/migrations/README.md @@ -60,19 +60,16 @@ This may cause lots of troubles and block migrations. The `ON CLUSTER` clause is used to specify the cluster to run the DDL statement on. By default, the `posthog` cluster is used. That cluster only includes the data nodes. - - ### Testing To re-run a migration, you'll need to delete the entry from the `infi_clickhouse_orm_migrations` table. - ## Ingestion layer We have extra nodes with a sole purpose of ingesting the data from Kafka topics into ClickHouse tables. The way to do that is to: 1. Create your data table in ClickHouse main cluster. -2. Create a writable table only on ingestion nodes: `node_roles=[NodeRole.INGESTION_SMALL]`. It should be Distributed table with your data table. If your data table is non-sharded, you should point it to one shard: `Distributed(..., cluster=settings.CLICKHOUSE_SINGLE_SHARD_CLUSTER)`. +2. Create a writable table only on ingestion nodes: `node_roles=[NodeRole.INGESTION_SMALL]`. It should be Distributed table with your data table. If your data table is non-sharded, you should point it to one shard: `Distributed(..., cluster=settings.CLICKHOUSE_SINGLE_SHARD_CLUSTER)`. 3. Create a Kafka table in ingestion nodes: `node_roles=[NodeRole.INGESTION_SMALL]`. 4. Create materialized view between Kafka table and writable table on ingestion nodes. diff --git a/posthog/temporal/README.md b/posthog/temporal/README.md index 85d83060b7..effdfc288a 100644 --- a/posthog/temporal/README.md +++ b/posthog/temporal/README.md @@ -173,13 +173,13 @@ The most important rule when writing asyncio code is: **DO NOT BLOCK** the event Asyncio is not new in Python (originally introduced in 3.4, and the new keywords in 3.5), but it has not been widely adopted in PostHog (yet!). This means that there isn't much code we can re-use from the PostHog monolith within Temporal activities. In particular, Django models will issue blocking requests when using the same method calls used anywhere else in PostHog. For this reason, more often than not, some amount of work is required to bring code from other parts of PostHog into activities: - Sometimes, the library you need to use has adopted asyncio and offers methods that can be a drop-in replacement. - - For example: Django models have async methods that just append `a` to the front: `MyModel.objects.get(...)` becomes `await MyModel.objects.aget(...)`. But not all the Django model API has support for asyncio, so check the documentation for our current version of Django. + - For example: Django models have async methods that just append `a` to the front: `MyModel.objects.get(...)` becomes `await MyModel.objects.aget(...)`. But not all the Django model API has support for asyncio, so check the documentation for our current version of Django. - If the library you require doesn't support asyncio, an alternative may exist. - - For example: The popular `requests` is blocking, but multiple alternatives with asyncio support exist, like `aiohttp` and `httpx`, and generally the API is quite similar, and doesn't require many code changes. - - Another example: The `aioboto3` implements asyncio support for `boto3`. - - One more: The `aiokafka` provides consumer and producer classes with non-blocking methods to interact with Kafka. + - For example: The popular `requests` is blocking, but multiple alternatives with asyncio support exist, like `aiohttp` and `httpx`, and generally the API is quite similar, and doesn't require many code changes. + - Another example: The `aioboto3` implements asyncio support for `boto3`. + - One more: The `aiokafka` provides consumer and producer classes with non-blocking methods to interact with Kafka. - If none of the above, you could get around by running blocking code in a thread pool using `concurrent.futures.ThreadPoolExecutor` or just `asyncio.to_thread`. - - Python releases the GIL on an I/O operation, so you can send that code to a different thread to avoid blocking the main thread with the asyncio event loop. + - Python releases the GIL on an I/O operation, so you can send that code to a different thread to avoid blocking the main thread with the asyncio event loop. - Similarly, if the blocking code is CPU bound, you could try using a `concurrent.futures.ProcessPoolExecutor`. - If nothing worked, you will need to re-implement the code using asyncio libraries and primitives. @@ -308,9 +308,10 @@ By default, the logger you get from `structlog.get_logger` is configured to do b > [!NOTE] > Do note that producing logs requires extra configuration to fit the `log_entries` table schema: -> * A `team_id` must be set somewhere in the context. -> * The function `resolve_log_source` in `posthog/temporal/common/logger.py` must be configured to resolve a `log_source` from your workflow's ID and type. -> That being said, we want logging to be there when you need it, but otherwise get out of the way. For this reason, writing logs to stdout will always work, regardless of whether the requirements for log production are met or not. Moreover, if the requirements for log production are not met, log production will not crash your workflows. +> +> - A `team_id` must be set somewhere in the context. +> - The function `resolve_log_source` in `posthog/temporal/common/logger.py` must be configured to resolve a `log_source` from your workflow's ID and type. +> That being said, we want logging to be there when you need it, but otherwise get out of the way. For this reason, writing logs to stdout will always work, regardless of whether the requirements for log production are met or not. Moreover, if the requirements for log production are not met, log production will not crash your workflows. > [!TIP] > If you don't care about log production, you can use `get_write_only_logger` from `posthog/temporal/common/logger.py` to obtain a logger that only writes to stdout. `get_produce_only_logger` works analogously. @@ -485,6 +486,6 @@ As you run workflows, you will be able to see the logs in the worker's logs, and ## Examples in PostHog -All of batch exports is built in Temporal, see some example workflows in [here](https://github.com/PostHog/posthog/tree/master/products/batch_exports/backend/temporal/destinations). +All of batch exports is built in Temporal, see [example workflows in batch exports](https://github.com/PostHog/posthog/tree/master/products/batch_exports/backend/temporal/destinations). -Examples on how to unit test temporal workflows are available [here](https://github.com/PostHog/posthog/tree/master/products/batch_exports/backend/tests/temporal). +[Examples on unit testing Temporal workflows](https://github.com/PostHog/posthog/tree/master/products/batch_exports/backend/tests/temporal) are available in the batch exports tests. diff --git a/posthog/temporal/data_imports/sources/README.md b/posthog/temporal/data_imports/sources/README.md index 5941b60449..786773847b 100644 --- a/posthog/temporal/data_imports/sources/README.md +++ b/posthog/temporal/data_imports/sources/README.md @@ -16,14 +16,19 @@ Adding a new source should be pretty simple. We've refactored the sources so tha **This step is REQUIRED** - without it, `@SourceRegistry.register` won't work and your source won't be discoverable. 9. **Re-run config generation** after implementing source logic: + ```bash pnpm generate:source-configs ``` + This updates `generated_configs.py` with your actual implemented source class. + 10. **Build schemas** to update types: + ```bash pnpm schema:build ``` + This ensures your source appears in frontend dropdowns and forms. ### Source file template @@ -206,7 +211,8 @@ If your source uses OAuth (SourceFieldOauthConfig): ``` 3. **Redirect URI**: Configure in external service: - ``` + + ```text https://localhost:8010/integrations/your-kind/callback ``` diff --git a/posthog/warehouse/README.md b/posthog/warehouse/README.md index 844fe92388..ad105fde81 100644 --- a/posthog/warehouse/README.md +++ b/posthog/warehouse/README.md @@ -10,13 +10,13 @@ HOMEBREW_ACCEPT_EULA=Y brew install msodbcsql18 mssql-tools18 Without this, you'll get the following error when connecting a SQL database to data warehouse: -``` +```text symbol not found in flat namespace '_bcp_batch' ``` If the issue persists, install from source without cache again -``` +```bash pip install --pre --no-binary :all: pymssql --no-cache ``` diff --git a/products/revenue_analytics/backend/views/README.md b/products/revenue_analytics/backend/views/README.md index f9e06859ba..b49db3ae54 100644 --- a/products/revenue_analytics/backend/views/README.md +++ b/products/revenue_analytics/backend/views/README.md @@ -11,7 +11,7 @@ The system follows a builder pattern where: 3. **Orchestrator** coordinates the process and builds concrete view instances 4. **Views** are the final HogQL queries registered in the database schema -``` +```text β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Sources │───▢│ Builders │───▢│ Orchestrator │───▢│ View Objects β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ @@ -71,7 +71,7 @@ SUPPORTED_SOURCES: list[ExternalDataSourceType] = [ Create a new directory `sources/chargebee/` with builder modules for each view type: -``` +```text sources/chargebee/ β”œβ”€β”€ __init__.py β”œβ”€β”€ charge.py @@ -226,7 +226,7 @@ The testing system follows a structured approach with dedicated test suites for #### Test Directory Structure -``` +```text sources/test/ β”œβ”€β”€ base.py # Core testing infrastructure β”œβ”€β”€ events/ # Event source tests @@ -260,13 +260,13 @@ sources/test/ **2. Source-Specific Base Tests** - `EventsSourceBaseTest`: Specialized for event-based revenue analytics - - Revenue analytics event configuration helpers - - Team base currency management - - Event clearing and setup utilities + - Revenue analytics event configuration helpers + - Team base currency management + - Event clearing and setup utilities - `StripeSourceBaseTest`: Specialized for Stripe external data sources - - Mock external data source and schema creation - - Stripe-specific test fixtures and helpers - - Currency validation and testing support + - Mock external data source and schema creation + - Stripe-specific test fixtures and helpers + - Currency validation and testing support #### Testing Guidelines for New Sources diff --git a/rust/README.md b/rust/README.md index cbc42738d4..2d9f593b47 100644 --- a/rust/README.md +++ b/rust/README.md @@ -2,7 +2,6 @@ The `posthog/rust` directory serves as PostHog's "Rust monorepo" hosting Rust libraries and service implementations. This is *not* the Rust client library for PostHog. - ## Catalog Some selected examples of subprojects homed in the Rust workspace. @@ -49,19 +48,18 @@ Rust based webhook management services. Includes `hook-api`, `hook-common`, `hoo Miscellaneous internal Rust libraries reused by service implementations. - ## Requirements 1. [Rust](https://www.rust-lang.org/tools/install). 2. [Docker](https://docs.docker.com/engine/install/), or [podman](https://podman.io/docs/installation) and [podman-compose](https://github.com/containers/podman-compose#installation): To setup development stack. Other useful links for those new to Rust: + * [The Rust Programming Language](https://doc.rust-lang.org/book/index.html) * [Cargo manual](https://doc.rust-lang.org/cargo/) * [The "Rustonomicon"](https://doc.rust-lang.org/nomicon/) * [crates.io](https://crates.io/) - ## Local Development Start up and bootstrap the "top-level" `posthog` repo dev environment, including the Docker-Compose support services. Ensure that `bin/migrate` has run and `bin/start` behaves as expected. Leave the Docker services running when developing in the Rust workspace. The `bin/start` processes are typically optional for running Rust tests or the inner dev loop. diff --git a/rust/batch-import-worker/scripts/README.md b/rust/batch-import-worker/scripts/README.md index 78a648b901..827c7dd899 100644 --- a/rust/batch-import-worker/scripts/README.md +++ b/rust/batch-import-worker/scripts/README.md @@ -7,6 +7,7 @@ This directory contains test scripts and utilities for the PostHog Rust batch im Generates comprehensive test data for testing the PostHog Amplitude identify logic during batch imports. **Features:** + - **Strictly Increasing Timestamps**: All events are generated with chronologically ordered timestamps to ensure proper data sequencing - **Timestamp Verification**: Automatically validates that all generated events maintain strict temporal ordering - **Configurable Time Ranges**: Support for custom time windows and historical data generation @@ -116,6 +117,7 @@ Based on the generated data, you should see identify events for: ### Testing Workflow 1. **Generate Test Data:** + ```bash # US cluster (default) with default time range npm run generate diff --git a/rust/capture/docs/llma-capture-implementation-plan.md b/rust/capture/docs/llma-capture-implementation-plan.md index d250a4daac..790169d5e3 100644 --- a/rust/capture/docs/llma-capture-implementation-plan.md +++ b/rust/capture/docs/llma-capture-implementation-plan.md @@ -9,10 +9,12 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 0: Local Development Setup #### 0.1 Routing Configuration + - [ ] Create new `/ai` endpoint in capture service - [ ] Set up routing for `/ai` endpoint to capture service #### 0.2 End-to-End Integration Tests + - [ ] Implement end-to-end integration tests for the full LLM analytics pipeline - [ ] Create test scenarios with multipart requests and blob data - [ ] Test Kafka message output and S3 storage integration @@ -21,17 +23,20 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 1: HTTP Endpoint #### 1.1 HTTP Endpoint Foundation + - [ ] Implement multipart/form-data request parsing - [ ] Add server-side boundary validation - [ ] Output events with blob placeholders to Kafka - [ ] Implement error schema #### 1.2 Basic Validation + - [ ] Implement `$ai_` event name prefix validation - [ ] Validate blob part names against event properties - [ ] Prevent blob overwriting of existing properties #### 1.3 Initial Deployment + - [ ] Deploy capture-ai service to production with basic `/ai` endpoint - [ ] Test basic multipart parsing and Kafka output functionality - [ ] Verify endpoint responds correctly to AI events @@ -39,6 +44,7 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 2: Basic S3 Uploads #### 2.1 Simple S3 Upload (per blob) + - [ ] Upload individual blobs to S3 as separate objects - [ ] Generate S3 URLs for blobs (including byte range parameters) - [ ] Store S3 blob metadata @@ -48,6 +54,7 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 3: S3 Infrastructure & Deployment #### 3.1 S3 Bucket Configuration + - [ ] Set up S3 buckets for dev and production environments - [ ] Set up bucket structure with `llma/` prefix - [ ] Configure S3 lifecycle policies for retention (30d default) @@ -55,6 +62,7 @@ This document outlines the implementation steps for the LLM Analytics capture pi - [ ] Create service accounts with appropriate S3 permissions #### 3.2 Capture S3 Configuration + - [ ] Deploy capture-ai service to dev environment with S3 configuration - [ ] Deploy capture-ai service to production environment with S3 configuration - [ ] Set up IAM roles and permissions for capture-ai service @@ -64,6 +72,7 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 4: Multipart File Processing #### 4.1 Multipart File Creation + - [ ] Implement multipart/mixed format - [ ] Store metadata within multipart format - [ ] Generate S3 URLs for blobs (including byte range parameters) @@ -71,6 +80,7 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 5: Authorization #### 5.1 Request Signature Verification + - [ ] Implement PostHog API key authentication - [ ] Add request signature verification - [ ] Validate API key before processing multipart data @@ -80,19 +90,23 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 6: Operations #### 6.1 Monitoring Setup + - [ ] Set up monitoring dashboards for capture-ai #### 6.2 Alerting + - [ ] Configure alerts for S3 upload failures - [ ] Set up alerts for high error rates on `/ai` endpoint - [ ] Set up alerts for high latency on `/ai` endpoint #### 6.3 Runbooks + - [ ] Create runbook for capture-ai S3 connectivity issues ### Phase 7: Compression #### 7.1 Compression Support + - [ ] Parse Content-Encoding headers from SDK requests - [ ] Implement server-side compression for uncompressed text/JSON - [ ] Add compression metadata to multipart files @@ -102,6 +116,7 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 8: Schema Validation #### 8.1 Schema Validation + - [ ] Create strict schema definitions for each AI event type - [ ] Add schema validation for event payloads - [ ] Validate Content-Type headers on blob parts @@ -110,6 +125,7 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 9: Limits (Optional) #### 9.1 Request Validation & Limits + - [ ] Add request size limits and validation - [ ] Add request rate limiting per team - [ ] Implement payload size limits per team @@ -117,6 +133,7 @@ This document outlines the implementation steps for the LLM Analytics capture pi ### Phase 10: Data Deletion (Optional) #### 10.1 Data Deletion (Choose One Approach) + - [ ] Option A: S3 expiry (passive) - rely on lifecycle policies - [ ] Option B: S3 delete by prefix functionality - [ ] Option C: Per-team encryption keys diff --git a/rust/capture/docs/llma-capture-overview.md b/rust/capture/docs/llma-capture-overview.md index a49f9ac0bc..100aae92b9 100644 --- a/rust/capture/docs/llma-capture-overview.md +++ b/rust/capture/docs/llma-capture-overview.md @@ -43,6 +43,7 @@ These events can be processed through the regular pipeline without blob storage: ### Future Considerations The event schema is designed to accommodate future multimodal content types, including: + - Images - Audio - Video @@ -89,6 +90,7 @@ The `/ai` endpoint accepts multipart POST requests with the following structure: #### Request Format **Headers:** + - `Content-Type: multipart/form-data; boundary=` - Standard PostHog authentication headers @@ -109,7 +111,7 @@ The `/ai` endpoint accepts multipart POST requests with the following structure: #### Example Request Structure -``` +```http POST /ai HTTP/1.1 Content-Type: multipart/form-data; boundary=----boundary123 @@ -183,7 +185,7 @@ All blobs for an event are stored as a single multipart file in S3: #### Bucket Structure -``` +```text s3:/// llma/ / @@ -192,7 +194,8 @@ s3:/// ``` With retention prefixes: -``` + +```text s3:/// llma/ / @@ -217,6 +220,7 @@ s3:/// #### Event Property Format Properties contain S3 URLs with byte range parameters: + ```json { "event": "$ai_generation", @@ -238,13 +242,15 @@ Properties contain S3 URLs with byte range parameters: #### Example S3 paths Without retention prefix (default 30 days): -``` + +```text s3://posthog-llm-analytics/llma/123/2024-01-15/event_456_x7y9z.multipart s3://posthog-llm-analytics/llma/456/2024-01-15/event_789_a3b5c.multipart ``` With retention prefixes: -``` + +```text s3://posthog-llm-analytics/llma/30d/123/2024-01-15/event_012_m2n4p.multipart s3://posthog-llm-analytics/llma/90d/456/2024-01-15/event_345_q6r8s.multipart s3://posthog-llm-analytics/llma/1y/789/2024-01-15/event_678_t1u3v.multipart @@ -301,7 +307,8 @@ For uncompressed data received from SDKs: #### Example Headers Compressed blob part from SDK: -``` + +```http Content-Disposition: form-data; name="event.properties.$ai_input"; filename="blob_abc123" Content-Type: application/json Content-Encoding: gzip @@ -375,9 +382,11 @@ Three approaches for handling data deletion requests: - Use S3's delete by prefix functionality to remove all objects for a team - Simple to implement but requires listing and deleting potentially many objects - Example: Delete all data for team 123: - ``` + + ```bash aws s3 rm s3://posthog-llm-analytics/llma/123/ --recursive ``` + Or using S3 API to delete objects with prefix `llma/123/` 3. **Per-Team Encryption** @@ -432,14 +441,15 @@ The capture service enforces strict validation on incoming events: ### WarpStream-based Processing **Architecture:** + - Push entire request payloads (including large LLM content) to WarpStream, which supports large messages - A separate service consumes from WarpStream and uploads blobs to S3 - Events are then forwarded to the regular ingestion pipeline **Downsides:** + - WarpStream is less reliable than S3, reducing overall system availability - Additional transfer costs for moving data through WarpStream - Additional processing costs for the intermediate service - No meaningful batching opportunity - the service would upload files to S3 individually, same as direct upload from capture - Adds complexity and another point of failure without significant benefits - diff --git a/rust/capture/docs/llma-integration-test-suite.md b/rust/capture/docs/llma-integration-test-suite.md index 01841c9485..81f2f6a4e7 100644 --- a/rust/capture/docs/llma-integration-test-suite.md +++ b/rust/capture/docs/llma-integration-test-suite.md @@ -40,106 +40,128 @@ This document describes the high-level architecture and test scenarios for the L ### Phase 1: HTTP Endpoint #### Scenario 1.1: Basic Routing + - **Test**: Verify `/ai` endpoint is accessible and returns correct response codes - **Validation**: HTTP 200 for valid requests, proper error codes for invalid requests #### Scenario 1.2: Multipart Parsing + - **Test**: Send multipart requests with various boundary strings and blob configurations - **Validation**: All parts parsed correctly, blob data extracted without corruptionΒ§ - **Variations**: Different boundary formats, multiple blobs, mixed content types #### Scenario 1.3: Boundary Validation + - **Test**: Send requests with malformed boundaries, missing boundaries, boundary collisions - **Validation**: Appropriate error responses, no server crashes, proper error logging #### Scenario 1.4: Event Processing Verification + - **Test**: Send multipart request and verify event reaches PostHog query API - **Validation**: Use PostHog query API to fetch processed event, verify blob placeholders correctly inserted #### Scenario 1.5: Basic Validation + - **Test**: Send events with invalid names (not starting with `$ai_`), duplicate blob properties - **Validation**: Invalid events rejected, valid events processed, proper error messages ### Phase 2: Basic S3 Uploads #### Scenario 2.1: Individual Blob Upload + - **Test**: Upload blobs of various sizes as separate S3 objects - **Validation**: Verify each blob stored correctly, S3 URLs generated in event properties - **Variations**: Small/medium/large blobs, different content types #### Scenario 2.2: S3 URL Generation and Access + - **Test**: Verify generated S3 URLs in PostHog events point to accessible objects - **Validation**: Query PostHog API for events, extract S3 URLs, verify blobs retrievable from S3 #### Scenario 2.3: Blob Metadata Storage + - **Test**: Verify S3 object metadata is stored correctly - **Validation**: Use S3 client to inspect object metadata - Content-Type, size, team_id present #### Scenario 2.4: Team Data Isolation + - **Test**: Multiple teams uploading simultaneously - **Validation**: Verify S3 key prefixes are team-scoped, no cross-team data access, proper S3 path isolation ### Phase 3: S3 Infrastructure & Deployment #### Scenario 3.1: S3 Bucket Configuration + - **Test**: Verify S3 bucket structure and lifecycle policies - **Validation**: Use S3 client to verify correct `llma/` prefix structure, retention policies configured ### Phase 4: Multipart File Processing #### Scenario 4.1: Multipart File Creation + - **Test**: Upload events with multiple blobs, verify multipart/mixed format - **Validation**: Use S3 client to verify single S3 file contains all blobs, proper MIME boundaries, metadata preserved - **Variations**: 2-10 blobs per event, mixed content types, different blob sizes #### Scenario 4.2: Byte Range URLs and Access + - **Test**: Verify S3 URLs in PostHog events include correct byte range parameters - **Validation**: Query PostHog API for events, verify URLs contain range parameters, use S3 client to test range requests #### Scenario 4.3: Content Type Handling + - **Test**: Mix of JSON, text, and binary blobs in single multipart file - **Validation**: Content types preserved in multipart format, correctly parsed ### Phase 5: Authorization #### Scenario 5.1: API Key Authentication + - **Test**: Send requests with valid/invalid/missing API keys - **Validation**: Valid keys accepted, invalid keys rejected with 401, proper error messages #### Scenario 5.2: Request Signature Verification + - **Test**: Test signature validation for various request formats - **Validation**: Valid signatures accepted, invalid signatures rejected #### Scenario 5.3: Pre-processing Authentication + - **Test**: Verify authentication occurs before multipart parsing - **Validation**: Invalid auth rejected immediately, no resource consumption for unauthorized requests ### Phase 7: Compression #### Scenario 7.1: Client-side Compression + - **Test**: Send pre-compressed blobs with `Content-Encoding: gzip` - **Validation**: Compressed blobs stored correctly, decompression works for retrieval #### Scenario 7.2: Server-side Compression + - **Test**: Send uncompressed JSON/text blobs - **Validation**: Server compresses before S3 storage, compression metadata preserved #### Scenario 7.3: Mixed Compression + - **Test**: Single request with both compressed and uncompressed blobs - **Validation**: Each blob handled according to its compression state ### Phase 8: Schema Validation #### Scenario 8.1: Event Schema Validation + - **Test**: Send events conforming to and violating strict schemas for each AI event type - **Validation**: Valid events accepted, invalid events rejected with detailed error messages - **Variations**: Missing required fields, extra properties, wrong data types #### Scenario 8.2: Content-Type Validation + - **Test**: Send blobs with various Content-Type headers - **Validation**: Supported types accepted, unsupported types handled according to policy #### Scenario 8.3: Content-Length Validation + - **Test**: Mismatched Content-Length headers and actual blob sizes - **Validation**: Mismatches detected and handled appropriately @@ -154,14 +176,17 @@ This document describes the high-level architecture and test scenarios for the L ### Edge Case Scenarios #### Scenario E.1: Malformed Requests + - **Test**: Invalid JSON, corrupted multipart data, missing required headers - **Validation**: Graceful error handling, no server crashes, proper error responses #### Scenario E.2: S3 Service Interruption + - **Test**: Simulate S3 unavailability during uploads - **Validation**: Proper error responses, retry logic works, no data loss #### Scenario E.3: Kafka Unavailability + - **Test**: Simulate Kafka unavailability during event publishing - **Validation**: Appropriate error handling, request failure communicated to client @@ -172,6 +197,7 @@ This document describes the high-level architecture and test scenarios for the L The integration test suite will be implemented in Rust to align with the capture service's existing toolchain and avoid introducing additional dependencies. #### Test Structure + - **Location**: `tests/integration/llma/` directory within the capture service codebase - **Framework**: Standard Rust testing framework with `tokio-test` for async operations - **Dependencies**: @@ -181,7 +207,8 @@ The integration test suite will be implemented in Rust to align with the capture - `multipart` for constructing test requests #### Test Organization -``` + +```text tests/ └── integration/ └── llma/ @@ -199,12 +226,14 @@ tests/ ### Local Test Environment Setup #### Prerequisites + - **Local PostHog Instance**: Full PostHog deployment running locally - **Local S3 Storage**: S3-compatible storage (configured via PostHog local setup) - **Capture Service**: Running with `/ai` endpoint enabled - **Test Configuration**: Environment variables for service endpoints and credentials #### Environment Configuration + ```bash # PostHog Local Instance export POSTHOG_HOST="http://localhost:8000" @@ -226,6 +255,7 @@ export LLMA_TEST_MODE="local" ### Test Execution #### Running Tests + ```bash # Run all LLMA integration tests cargo test --test llma_integration @@ -244,6 +274,7 @@ cargo test --test llma_integration -- --test-threads=1 #### Test Utilities Each test phase will include common utilities for: + - **Multipart Request Builder**: Construct multipart/form-data requests with event JSON and blob parts - **S3 Client Wrapper**: Direct S3 operations for validation and cleanup - **PostHog API Client**: Query PostHog API to verify event processing @@ -251,12 +282,14 @@ Each test phase will include common utilities for: - **Cleanup Helpers**: Remove test data from S3 and PostHog between test runs #### Test Data Management + - **Isolated Test Teams**: Each test uses unique team IDs to prevent interference - **Cleanup Between Tests**: Automatic cleanup of S3 objects and PostHog test data - **Fixture Data**: Predefined multipart requests and blob data for consistent testing - **Random Data Generation**: Configurable blob sizes and content for stress testing ## Phase Gating + - **Mandatory Testing**: All integration tests for a phase must pass before proceeding to implementation of the next phase - **Regression Prevention**: Previous phase tests continue to run to ensure no regression - **Incremental Validation**: Each phase builds upon validated functionality from previous phases @@ -270,11 +303,13 @@ For validating the LLM Analytics capture pipeline in production environments, th ### Configuration Requirements #### PostHog Credentials + - **Project API Key**: PostHog project private API key for authentication - **PostHog URL**: PostHog instance URL (cloud or self-hosted) - **Project ID**: PostHog project identifier for query API access #### AWS S3 Credentials + - **AWS Access Key ID**: Limited IAM user with read-only S3 access - **AWS Secret Access Key**: Corresponding secret key - **S3 Bucket Name**: Production S3 bucket name @@ -316,6 +351,7 @@ A separate script (`generate-s3-test-keys.sh`) will be implemented to generate l ### Production Test Configuration #### Environment Variables + ```bash # PostHog Configuration export POSTHOG_PROJECT_API_KEY="your_posthog_api_key" @@ -336,12 +372,14 @@ export LLMA_TEST_MODE="production" ### Production Test Execution #### Safety Measures + - **Read-Only Operations**: Production tests only read data, never write or modify - **Team Isolation**: Tests only access data for the specified team ID - **Rate Limiting**: Production tests include delays to avoid overwhelming services - **Data Validation**: Verify S3 objects exist and are accessible without downloading large payloads #### Usage Example + ```bash # Generate S3 test credentials (script to be implemented) ./generate-s3-test-keys.sh 123 posthog-llm-analytics diff --git a/rust/common/alloc/README.md b/rust/common/alloc/README.md index f35e8a6437..506e5b234d 100644 --- a/rust/common/alloc/README.md +++ b/rust/common/alloc/README.md @@ -1,6 +1,7 @@ # What is this? We use jemalloc everywhere we can, for any binary that we expect to run in a long-lived process. The reason for this is that our workloads are: + - multi-threaded - extremely prone to memory fragmentation (due to our heavy use of `serde_json`, or json generally) @@ -9,4 +10,5 @@ jemalloc helps reduce memory fragmentation hugely, to the point of solving produ At time of writing (2024-09-04), rust workspaces don't have good support for specifying dependencies on a per-target basis, so this crate does the work of pulling in jemalloc only when compiling for supported targets, and then exposes a simple macro to use jemalloc as the global allocator. Anyone writing a binary crate should put this macro at the top of their `main.rs`. Libraries should not make use of this crate. ## Future work -Functions could be added to this crate to, in situations where jemalloc is in use, report a set of metrics about the allocator, as well as other functionality (health/liveness, a way to specify hooks to execute when memory usage exceeds a certain threshold, etc). Right now, it's prety barebones. \ No newline at end of file + +Functions could be added to this crate to, in situations where jemalloc is in use, report a set of metrics about the allocator, as well as other functionality (health/liveness, a way to specify hooks to execute when memory usage exceeds a certain threshold, etc). Right now, it's prety barebones. diff --git a/rust/common/metrics/README.md b/rust/common/metrics/README.md index 4788321ecd..60e1f82131 100644 --- a/rust/common/metrics/README.md +++ b/rust/common/metrics/README.md @@ -1 +1 @@ -Ripped from rusty-hook, since it'll be used across more or less all cyclotron stuff, as well as rustyhook \ No newline at end of file +Ripped from rusty-hook, since it'll be used across more or less all cyclotron stuff, as well as rustyhook diff --git a/rust/cymbal/README.md b/rust/cymbal/README.md index 3f09527c1b..e2b7c62ccb 100644 --- a/rust/cymbal/README.md +++ b/rust/cymbal/README.md @@ -2,9 +2,10 @@ You throw 'em, we catch 'em. - ### Terms + We use a lot of terms in this and other error tracking code, with implied meanings. Here are some of them: + - **Issue**: A group of errors, representing, ideally, one bug. - **Error**: An event capable of producing an error fingerprint, letting it be grouped into an issue. May or may not have one or more stack traces. - **Fingerprint**: A unique identifier for class of errors. Generated based on the error type and message, and the stack if we have one (with or without raw frames). Notably, multiple fingerprints might be 1 error, because e.g. our ability to process stack frames (based on available symbol sets) changes over time, or our fingerprinting heuristics get better. We do not encode this "class of errors" notions anywhere - it's just important to remember an "issue" might group multiple fingerprints that all have the same "unprocessed" stack trace, but different "processed" ones, or even just that were received at different time. diff --git a/rust/hook-common/README.md b/rust/hook-common/README.md index d277a6c860..65b4f13ba8 100644 --- a/rust/hook-common/README.md +++ b/rust/hook-common/README.md @@ -1,2 +1,3 @@ # hook-common + Library of common utilities used by rusty-hook. diff --git a/rust/hook-worker/README.md b/rust/hook-worker/README.md index 9b1884aab1..f21e5a8e8c 100644 --- a/rust/hook-worker/README.md +++ b/rust/hook-worker/README.md @@ -1,2 +1,3 @@ # hook-worker + Consume and process webhook jobs diff --git a/rust/kafka-deduplicator/README.md b/rust/kafka-deduplicator/README.md index 6f478b26cc..36788e2dc3 100644 --- a/rust/kafka-deduplicator/README.md +++ b/rust/kafka-deduplicator/README.md @@ -25,7 +25,7 @@ The Kafka consumer is built as a stateful, partition-aware consumer that maintai When partitions are assigned or revoked (during rebalancing): 1. **Partition Assignment**: Creates a new RocksDB store for each assigned partition at path: `{base_path}/{topic}_{partition}/` -2. **Partition Revocation**: +2. **Partition Revocation**: - Marks the partition as "fenced" to reject new messages - Waits for in-flight messages to complete - Cleanly closes the RocksDB store @@ -42,6 +42,7 @@ When partitions are assigned or revoked (during rebalancing): ## Deduplication Strategy Events are deduplicated based on a **composite key**: + - Format: `timestamp:distinct_id:token:event_name` - Two events with the same composite key are considered duplicates - UUID is used only for Kafka partitioning, not deduplication @@ -55,7 +56,7 @@ The service includes a comprehensive checkpoint system for backup, recovery, and - **Periodic snapshots**: Creates RocksDB checkpoints at configurable intervals (default: 5 minutes) - **Point-in-time consistency**: Checkpoints capture the complete deduplication state at a specific moment - **Multi-tier storage**: Local checkpoints for fast recovery, S3 uploads for durability and scaling -- **Incremental vs Full uploads**: +- **Incremental vs Full uploads**: - **Incremental**: Upload only changed SST files since last checkpoint - **Full**: Upload complete checkpoint (every N incremental uploads, default: 10) diff --git a/rust/log-capture/README.md b/rust/log-capture/README.md index 449dd7b9d2..5c4f761086 100644 --- a/rust/log-capture/README.md +++ b/rust/log-capture/README.md @@ -24,7 +24,7 @@ The service is configured using environment variables: Clients must authenticate by sending a valid JWT token in the Authorization header: -``` +```http Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0ZWFtX2lkIjoiMTIzNDU2Nzg5MCJ9.czOuiHUzSl8s9aJiPghhkGZP-WxI7K-I85XNY-bXRSQ ```