130 Commits

Author SHA1 Message Date
CanbiZ (MickLesk) 46c0abafc1 Update service.go 2026-03-03 13:55:54 +01:00
CanbiZ (MickLesk) d6b4349b9c Add preflight category styling and mappings
Introduce a new 'preflight' category and wire it through the UI and service logic. Updates:
- public/static/css/error-analysis.css: add .category-badge styles (preflight, proxmox, service, database, shell, build) to provide consistent color badges.
- public/static/js/error-analysis.js: add 'preflight' color to catColors so the UI can render the new category.
- service.go: register 'preflight' as a known category and reclassify exit codes 103–108 from various categories to 'preflight' (these are validation/initial checks).

This groups validation-related errors under a single category for clearer display and handling in the frontend.
2026-03-03 13:23:33 +01:00
CanbiZ (MickLesk) fd4a11c727 fix: deduplicate installing POSTs to prevent retry-induced duplicate records
When the client retries post_to_api() after a curl timeout (5s), the server
previously created a new record for every retry since installing status always
called CreateTelemetry blindly. This inflated record counts (~3x) and left
orphan records that get cleaned up as 'unknown' after 4h.

Now UpsertTelemetry checks FindRecordByExecutionID before creating. If a record
with the same execution_id already exists, the retry is treated as idempotent
and returns success without creating a duplicate.
2026-03-02 16:35:29 +01:00
CanbiZ (MickLesk) 7020cb0edd fix: add missing VM os_types, admin auth fallback for migration, enhanced JSON error diagnostics
- allowedOsType: add homeassistant, opnsense, openwrt, mikrotik, umbrel-os,
  pimox-haos, owncloud, turnkey-nextcloud, arch-linux (fixes invalid os_type
  rejections for VM scripts)
- EnsurePipelineField: if configured auth gets 403, auto-fallback to
  _superusers and _admins admin auth endpoints (fixes migration 403)
- JSON decode errors: increase body snippet from 200 to 2000 chars, log
  byte offset from json.SyntaxError for precise diagnosis
2026-03-02 16:24:06 +01:00
CanbiZ (MickLesk) d0ba3935e7 fix: auto-trust private IPs as proxy + increase rate limit defaults
Root cause of massive data loss: all traffic arrives from reverse proxy
10.0.1.7, but TRUSTED_PROXIES_CIDR was not set, so rate limiter saw ALL
requests as one IP → 60 RPM shared by all users → most requests rejected.

Fixes:
- Auto-trust RFC 1918/6598 private IPs (10.x, 172.16-31.x, 192.168.x,
  100.64-127.x) and loopback as reverse proxy sources for X-Forwarded-For
- This means Docker/K8s/Caddy/Nginx setups work out of the box without
  needing to configure TRUSTED_PROXIES_CIDR
- Increase default rate limit from 60→300 RPM, burst from 20→60
  (telemetry payloads are tiny, no reason for aggressive limiting)
2026-03-02 15:58:19 +01:00
CanbiZ (MickLesk) 141af284e0 fix: prevent data loss from silent rejections and missing records
Root cause: ~50% of installing records were never created, causing all
subsequent configuring/validation/success/failed updates to be lost.

Server-side fixes:
- Add logging for ALL rejection points (rate limit, body too large, JSON
  parse errors, validation failures) - previously completely silent
- configuring/validation pings now create a fallback record instead of
  silently skipping when the installing record doesn't exist
- Terminal status updates (success/failed/aborted) now use full PATCH
  with all fields to fill in missing specs from fallback records
- New UpdateTelemetryFull method for complete record updates

The configuring/validation fallback creates a record with minimal data.
When the final success/failed update arrives with full specs, the full
PATCH fills in all missing fields (ct_type, disk_size, etc). Fields with
omitempty+zero values are omitted from JSON, preserving existing values.
2026-03-02 15:42:58 +01:00
CanbiZ (MickLesk) d4d9cafc97 fix: pipeline migration now supports PB v0.22+ (fields vs schema)
The EnsurePipelineField migration was using PocketBase's legacy 'schema' key
which was renamed to 'fields' in PB v0.22+. Since the install script fetches
the latest PB release, the migration silently failed: it read an empty schema
array, PATCHed with the wrong key, and the pipeline field was never created.
All pipeline data sent to PB was silently dropped.

Changes:
- Auto-detect schema/fields key from PB collection response
- Use correct key and field format for both PB <0.22 and >=0.22
- Add retry logic (3 attempts with backoff) for startup race conditions
- Add diagnostic logging when pipeline steps are tracked
- Log which API format (schema vs fields) is being used
2026-03-02 15:27:23 +01:00
CanbiZ (MickLesk) 89493fdf12 refactor: redesign detail modal with header strip, parsed error traces, pipeline
- Replace General Information section with compact header strip showing
  status badge, exit code, type, duration, and created time at a glance
- Pipeline section now always visible (shows 'not available' placeholder
  when no data instead of being hidden entirely)
- Error section redesigned: parses structured error format
  (exit_code=N | desc\n---\nlog) into separate category badge,
  explanation text, and scrollable error trace with copy button
- Also handles container pipe-delimited fallback format (|---| separator)
- Moved IDs and timestamps to collapsed Metadata section at bottom
- Removed duplicate CSS blocks (second copy of detail styles cleaned up)
- Added new CSS: header-strip, error-category-badge, error-trace-box,
  btn-copy-trace, error-trace-empty, pipeline-empty with dashed border
2026-03-02 14:37:48 +01:00
CanbiZ (MickLesk) 1bc83683a1 feat: add installation pipeline tracking (status history)
Track the progression of each installation through its phases:
installing → validation → configuring → success/failed/aborted

Server-side changes (service.go):
- Add PipelineStep type and in-memory pipelineTracker (sync.Map)
- Track each status transition with timestamp in trackPipelineStep()
- Add Pipeline (json.RawMessage) field to TelemetryOut and TelemetryStatusUpdate
- Include pipeline data in every PocketBase create/update
- Auto-cleanup tracker entries on terminal status (success/failed/aborted)
- Add EnsurePipelineField() auto-migration: creates 'pipeline' JSON field
  in PocketBase collection on startup if missing

Frontend changes (dashboard.js + dashboard.css):
- Add renderPipeline() function with visual step indicator
- Show pipeline section in record detail modal (between General Info and
  System Resources)
- Each step shows icon (checkmark/cross/pause), label, and timestamp
- Color-coded: green for passed phases, red for failed, purple for aborted
- Handles both JSON array (native) and string-wrapped JSON (fallback)
- Graceful degradation: no pipeline shown for records without data
2026-03-02 14:12:05 +01:00
CanbiZ (MickLesk) a0a17a2e17 refactor: consolidate exit codes into single source of truth, add exit code column to dashboard
- Replace duplicate exitCodeCategories + exitCodeDescriptions maps in service.go
  with unified exitCodeInfo map (single struct per code: Desc + Category)
- Add helper functions getExitCodeDescription() and getExitCodeCategory()
- Add all missing exit codes: 103-123 (validation/setup), 150-154 (systemd),
  160-162 (Python), 170-193 (databases), 200-231 (Proxmox), 232-238 (tools),
  239-249 (Node.js), 250-254 (app install/update), BSD sysexits (64-78)
- Replace ~300-line switch statement in dashboard.go with 3-line lookup
- Add 'Exit Code' column to Installation Log table (badge for failed/aborted)
- Add new error category 'build' to allowedErrorCategory
- Add missing category colors in error-analysis.js (service, database, proxmox, shell, build)
- Net reduction: ~148 lines of duplicated code removed
2026-03-02 13:42:42 +01:00
CanbiZ (MickLesk) 56192f09bd Support "validation" status with progress pings
Treat the new "validation" status like an in-progress ping similar to "configuring": add "validation" to status counts in dashboard and PBClient analysis, include it in stuck/installing logic, and prevent creating ghost records in UpsertTelemetry by skipping new record creation when payload.Status is "validation" or "configuring". Also add "validation" to allowedStatus and improve the warning log to include the actual status.
2026-02-23 16:59:38 +01:00
CanbiZ (MickLesk) 7f5c65c4c7 fix: prevent ghost records from 'configuring' progress pings
When the container sends 'configuring' before the host's 'installing'
record was written (race condition or curl timeout), UpsertTelemetry
created a new record with ct_type=0, all-zero hardware values, and
repo_source=N/A. These ghost records were then hidden by the dashboard
(which filters repo_source='ProxmoxVE').

Changes:
- Skip record creation for 'configuring' status when no existing record
  is found (log warning instead of creating ghost)
- Add RepoSource to TelemetryStatusUpdate so PATCH updates carry
  repo_source through to PocketBase
- Final states (failed/success) still create fallback records since
  post_update_to_api sends full payloads with all fields
2026-02-23 16:45:04 +01:00
CanbiZ (MickLesk) 9c92479710 fix: add 'shell' to allowedErrorCategory
Exit codes 1/2 are categorized as 'shell' on the bash side but the Go
allowlist was missing it, causing fallback to 'unknown' for every general
shell error.
2026-02-23 16:30:48 +01:00
CanbiZ (MickLesk) 1420a104f9 Anonymize IPv4 and relax telemetry enums
Introduce IPv4 anonymization for telemetry logs by adding a regexp import and sanitizeIPs (ipv4Re) that masks the last three octets (keeps first octet as x.x) and apply it to TelemetryIn.Error after sanitizeMultiLine for GDPR compliance. Add "tool" to allowedType. Change enum validation behavior to log warnings and fall back to safe "unknown" defaults for gpu_vendor, gpu_passthrough, cpu_vendor, and error_category instead of rejecting writes; type validation still returns an error but now logs the rejection. Improves privacy and robustness of telemetry ingestion.
2026-02-23 14:24:48 +01:00
CanbiZ (MickLesk) 09a57824fd Add install log viewing and increase payload size
Expose full installation logs in error analysis and UI, and allow larger payloads. Includes: add InstallLog to error records and propagate it to dashboard responses; render a collapsible full-installation-log panel in recent errors with a toggle and log button in public/static/js/error-analysis.js; increase allowed error/log size (sanitizeMultiLine -> 128KB) and MaxBodyBytes -> 256KB to carry full logs; adjust timeouts/refresh TTLs for larger data sets and make various minor formatting/whitespace cleanups across dashboard.go and service.go. These changes improve debugging by preserving and surfacing install logs while tuning server limits and timeouts to accommodate larger payloads.
2026-02-23 14:16:23 +01:00
CanbiZ (MickLesk) 6337e1742b Revert "Simplify dashboard: remove analysis and add HTML"
This reverts commit 77beab2008.
2026-02-18 14:02:05 +01:00
CanbiZ (MickLesk) fc9ed9f45f Revert "Drop 'configuring' status; embed public assets"
This reverts commit 871b97501e.
2026-02-18 14:01:56 +01:00
CanbiZ (MickLesk) 871b97501e Drop 'configuring' status; embed public assets
Remove the deprecated "configuring" status across telemetry handling and validation: it is no longer treated as a separate state or included in queries/allowedStatus. Simplify UpsertTelemetry to create on "installing" and update otherwise, with a fallback create when no existing record is found. Add an embedded public filesystem (go:embed public) with serveHTMLFile helper and route handlers for /error-analysis, /script-analysis and serving /static/ assets. Also tighten defaults and schedules: reduce MaxBodyBytes (16384 -> 1024) and increase cleanup intervals (CHECK_INTERVAL 15m -> 60m, STUCK_AFTER 2h -> 24h). These changes consolidate status handling and enable serving bundled frontend assets from the binary.
2026-02-18 13:59:10 +01:00
CanbiZ (MickLesk) 77beab2008 Simplify dashboard: remove analysis and add HTML
Remove large script/error analysis plumbing and persistent script-stats code from dashboard.go (types, aggregation, PB stores and helper functions). Clean up imports (remove log and sync), tighten auto-reclassification logic (drop SIGHUP/exit 129 branch, only handle SIGINT/ctrl-c/exit 130), and change tool tracking from type "pve" to "tool". Add DashboardHTML() which returns the embedded dashboard HTML/CSS/JS. These changes simplify the dashboard codepath and embed the frontend HTML, but remove the previous detailed analysis and persistent stats machinery that other callers may have relied on.
2026-02-18 13:14:03 +01:00
CanbiZ (MickLesk) bbf817748c Treat SIGHUP (exit 129) as aborted
Add handling for SIGHUP/exit code 129 across the codebase so records from clients that reported status="failed" due to SIGHUP are reclassified as aborted. Updates include: string-matching for "sighup" alongside SIGINT/Ctrl+C in reclassification logic (dashboard, analytics, ingestion), PocketBase query filters to include/exclude SIGHUP and exit_code=129, and changing exit code 129's category/description to user_aborted (with an expanded text mentioning hangup). This unifies legacy SIGINT/SIGHUP behavior and improves accuracy of "aborted" vs "failed" classification.
2026-02-18 08:38:41 +01:00
CanbiZ (MickLesk) 171a2893a8 chore: add pull request template
- Type of Change checkboxes (bug, feature, breaking, dashboard, refactor, docs, security/privacy)
- Privacy prerequisite check (no new PII without PRIVACY.md update)
- Links to privacy documentation
2026-02-17 17:17:20 +01:00
CanbiZ (MickLesk) bbf391895b Merge branch 'main' of https://github.com/community-scripts/telemetry-service 2026-02-17 17:15:02 +01:00
CanbiZ (MickLesk) ddf1fb5434 chore: add GitHub issue templates (bug, privacy, feature request)
- Bug Report: with component dropdown (dashboard, API, data, opt-in)
- Privacy Concern: with concern type dropdown, links to PRIVACY/ROPA/TOMS docs
- Feature Request: with area dropdown (dashboard, API, alerting)
- config.yml: disable blank issues, add contact links (dashboard, privacy docs, discord)
- Blank issues disabled to ensure structured reports
2026-02-17 17:14:38 +01:00
CanbiZ (MickLesk) dcf8d17e9e Merge pull request #3 from community-scripts/feature/execution-id
Feature/execution
2026-02-17 17:11:41 +01:00
CanbiZ (MickLesk) 345c5d8180 docs: add comprehensive PRIVACY.md, link from README
- New docs/PRIVACY.md with full telemetry documentation:
  consent model, what's collected (per type), what's NOT collected,
  data processing, GDPR legal basis, how to opt out
- README: link to PRIVACY.md, ROPA.md, TOMS.md in Privacy section
2026-02-17 17:10:43 +01:00
CanbiZ (MickLesk) 5dbcabdd94 fix: rename type 'tool' to 'pve' for PVE scripts, update allowedType
- Change allowedType from 'tool' to 'pve' in validation
- Update dashboard tracking to match type='pve'
- Update TelemetryIn type comment
2026-02-17 16:55:53 +01:00
CanbiZ (MickLesk) cb5017a739 feat: add execution_id field for unique record identification
- Add ExecutionID to TelemetryIn, TelemetryOut, TelemetryStatusUpdate structs
- Add FindRecordByExecutionID() for O(1) unique-index lookups
- Update UpsertTelemetry to prefer execution_id lookup with random_id fallback
- Add execution_id sanitization in validate()
- Map execution_id in handler (in→out)
2026-02-17 16:06:06 +01:00
CanbiZ (MickLesk) b465a4c211 fix: URL-encode filter in FindRecordByRandomID, reduce stuck timeout to 1h
- FindRecordByRandomID now uses url.QueryEscape for PocketBase filter
- CLEANUP_STUCK_HOURS default lowered from 2 to 1 for faster recovery
- Add go-sqlite3 dependency (for bootstrap tooling)
2026-02-17 15:59:39 +01:00
CanbiZ (MickLesk) 97f9f8d3cc Initialize empty stats store on bootstrap failure
If statsAllTime.Bootstrap fails, log the error but continue by initializing an empty stats map with a "_marker" entry set to today's date. This prevents repeated bootstrap attempts and allows the service to build stats incrementally going forward. The change acquires the stats mutex while updating the map and adds an informational log message.
2026-02-17 13:21:10 +01:00
CanbiZ (MickLesk) e332ca8dfa increase request timeout 2026-02-17 13:19:35 +01:00
CanbiZ (MickLesk) ad45294ed5 add signal and proxmox error_types 2026-02-17 12:44:03 +01:00
CanbiZ (MickLesk) 45e128e80f switch to emb. go services, otherwise deployment gives 404 2026-02-17 12:00:10 +01:00
CanbiZ (MickLesk) c3d74fbf01 Merge pull request #2 from ls-root/refactor/inline-html-assets
refactor: extract inline HTML/assets and fix theme toggle styling
2026-02-17 11:54:20 +01:00
Finn Joshua Bartels 0609022c2c refactor: move inline HTML/assets to seperate files
Move inline HTML and assets from dashboard.go into their
seperate files in a public directory.

Add script to inline SVGs
2026-02-17 11:36:33 +01:00
CanbiZ (MickLesk) e8c1d68967 Add configuring status and improve error analysis
Treat a new 'configuring' status alongside 'installing' across services and UI (filters, counts, badges, allowedStatus, stuck-install detection). Reclassify failed records with exit_code==0 as success (in Fetch* paths and main ingestion) and remove aggressive top-error truncation; increase error preview/full display limits. Expand exit code mappings and descriptions (many curl/apt/docker/signal/timeouts added) and enhance categorizeErrorText to detect Docker/container, resource (OOM) and signal-related errors for better error_category assignment. Misc: add new HTML/CSS for configuring badge and adjust related dashboard/error-analysis rendering.
2026-02-17 09:23:11 +01:00
CanbiZ (MickLesk) 780613f6ab Scale timeouts and increase cache TTLs
Make request and background refresh timeouts depend on the requested data range: default 120s, 300s for >=90 days and 600s for >=365 days. Simplify and extend cache TTL behavior: keep a 2m default but use a 23h TTL for queries older than 7 days (replacing the previous 5m/15m tiers). These changes reduce frequent recomputation for large/time-spanning analyses and allow more time for heavy data fetches.
2026-02-16 18:20:44 +01:00
CanbiZ (MickLesk) c580032442 Update service.go 2026-02-16 18:02:16 +01:00
CanbiZ (MickLesk) 4eb75b4690 Add persistent script stats stores and usage
Introduce ScriptStatsStore to persist aggregated script telemetry (7d, 30d, all-time) backed by PocketBase collections. Implements loading, bootstrap/rebuild, incremental updates, aggregation, and PB upserts (LoadFromPB, Bootstrap, Rebuild, IncrementalUpdate, Update, writeAllToPB, BuildData). Integrate stores into main: initialize stores, serve script analysis from in-memory persistent stores for days=0/<=7/<=30 (fast response + caching), allow days=0 as All Time, and wire stores into warmupCaches to load/update and pre-warm cached script data during startup/nightly. Also: adjust cache TTL behavior for live today data, change warmupCaches signature to accept stores, add startup bootstrapping/rebuild logic (including long timeout for all-time bootstrap), add sync mutex and imports. Misc: return JSON error payloads from GitHub issue API endpoint (instead of plain text), and fix JS error parsing to prefer data.error over data.message.
2026-02-16 17:50:25 +01:00
CanbiZ (MickLesk) d51d56a7d5 Add installs/day and script created date
Parse and store script creation date, compute script age and installs-per-day, and surface that metric in the script analysis UI. ScriptInfo now includes Created (parsed from script_created), ScriptStat gains DaysOld and InstallsPerDay, and the top/bottom tables render an "Installs/Day" column (colspan adjusted). Also increased cache TTLs for larger ranges (uses 23h for >7 days/default) and limited warmup to ProxmoxVE only. Includes date parsing fallbacks and a minimum 1-day age to avoid division-by-zero.
2026-02-16 17:26:31 +01:00
CanbiZ (MickLesk) 61dafab9d7 Update .gitignore 2026-02-16 17:18:49 +01:00
CanbiZ (MickLesk) b7921e40a8 Fetch script types and include zero-usage scripts
Rename and extend script lookup to return ScriptInfo (slug + type) and add a mapping from relation IDs to display types. The API request now requests the type field and results are stored as map[string]ScriptInfo instead of map[string]bool. Updated FetchScriptAnalysisData to use the new FetchKnownScripts, adjust logging and filtering, and append zero-usage scripts (with their type) to TopScripts for 30d+ and all-time views so scripts with no telemetry are represented. Unknown type IDs fall back to the raw type value.
2026-02-16 17:15:36 +01:00
CanbiZ (MickLesk) 760dc135c3 Update dashboard.go 2026-02-16 17:12:45 +01:00
CanbiZ (MickLesk) cdf7cee0ce Update dashboard.go 2026-02-16 16:59:32 +01:00
CanbiZ (MickLesk) 77b8087585 Update dashboard.go 2026-02-16 16:12:46 +01:00
CanbiZ (MickLesk) 98d4c3cc37 perf: nightly pre-warm for all caches (dashboard+scripts+errors) with 23h TTL
- warmupCaches replaces warmupDashboardCache, now warms all 3 endpoints
- Nightly at 02:00 UTC: full warmup with 23h TTL for days>1
- Every 15min: refresh today-only data (fast, changes frequently)
- Startup: full warmup of all day ranges and repos
- 30d/AllTime data now instantly available from cache all day
2026-02-16 14:42:22 +01:00
CanbiZ (MickLesk) d6ef143f9c feat: nav links, least used scripts table, caching for /api/errors & /api/scripts TTL fix 2026-02-16 14:36:42 +01:00
CanbiZ (MickLesk) b329076b93 feat: add Script Analysis page (/script-analysis)
- New page: Most Used Scripts (Top 10, expandable) + Recent Activity (Last 10, expandable)
- Filter by 7d / 30d / All Time + Source (ProxmoxVE/ProxmoxVED/All)
- Stats cards: Total Installs, Unique Scripts, Avg Installs/Script
- Success rate bar visualization per script (green/red/purple/yellow segments)
- Search filtering on both tables
- /api/scripts endpoint with caching (stale-while-revalidate)
- Nav links added to Error Analysis page
2026-02-16 14:26:08 +01:00
CanbiZ (MickLesk) be4d0881d7 feat(error-analysis): complete exit code map + safe Issue buttons + expandable errors
Exit codes:
- Add all application-specific codes: Systemd (150-154), Python (160-162),
  PostgreSQL (170-173), MySQL (180-183), MongoDB (190-193),
  Proxmox custom (200-225, 231), Node.js (243-249)
- Add missing curl codes: 5, 8, 10, 25, 30, 75, 102, 124, 128, 134
- Sync all descriptions with client-side explain_exit_code()

Issue button fix:
- Replace inline onclick with data-attributes (data-app, data-exit, data-error, data-rate)
- Add event delegation for .issue-btn clicks
- Add escapeAttr() for safe HTML attribute encoding (handles newlines, quotes)
- Eliminates SyntaxError on complex error text

Expandable error text:
- Show first 80 chars with 'show more' / 'show less' toggle
- Works in both App Errors and Recent Errors tables
- Pre-wrapped display with scroll for long errors
2026-02-16 13:13:33 +01:00
CanbiZ (MickLesk) 7ada9dfd5f Enhance error analysis UI and exit codes
Reclassify runs showing status="failed" with exit_code=0 (and no error text) as success and skip exit_code 0 from error stats to avoid false errors. Expand and refine exit-code -> description/category mapping (curl errors, signals with names, BSD sysexits, apt cases, storage/timeout categories) for more accurate categorization. Improve error listing UI: allow wrapped error text, add short/full toggles for long errors, add escapeAttr to safely serialize attributes, change default period to "Today", and replace inline onclicks with data-attribute-driven issue buttons plus delegated click handler to avoid escaping issues. Minor CSS and layout tweaks for readability.
2026-02-16 13:08:34 +01:00
CanbiZ (MickLesk) ce1f38852c Revamp Error Analysis UI; fix PB negation
Combine Error Analysis and Failed Apps into a two-column responsive grid, update styles and interactions (new classes, hover states, sizing, truncation, badges, severity colors), and add a Deep Analysis action. Adjust client-side rendering: tighter item counts, changed text/formatting for counts and thresholds, improved accessibility/spacing and responsive behavior. In service.go, correct PocketBase filter negation to use the !~ operator for error pattern exclusions and add a clarifying comment.
2026-02-16 12:49:36 +01:00