Architecture Decisions
ADR-0009: OTel + SigNoz Monitoring
Documents the migration from Sentry to OpenTelemetry and SigNoz for observability
ADR-0009: OTel + SigNoz Monitoring
Status
Accepted
Date: 2026-03-05
Context
The monorepo initially used Sentry for error monitoring across all apps. As the platform grew to 8+ apps with self-hosted Convex backends and Kubernetes infrastructure, the monitoring requirements expanded:
- Distributed tracing: Requests flow through Traefik ingress, Next.js servers, and Convex backends. Sentry’s tracing was limited and expensive at scale.
- Metrics: Need CPU, memory, request latency, and custom business metrics. Sentry does not provide infrastructure metrics.
- Cost control: Sentry’s per-event pricing becomes unpredictable with 8 apps generating errors, performance transactions, and replays.
- Data ownership: Prefer self-hosted observability data on EU servers, consistent with the self-hosted Convex decision.
- Unified platform: Want traces, metrics, and logs in a single interface rather than switching between tools.
Key constraints:
- Must support both server-side (Node.js) and client-side (browser) instrumentation
- Should integrate with Kubernetes (pod-level metrics, auto-instrumentation)
- Must run on the existing single-node k3s cluster
- Development team is small; operational complexity must be manageable
Decision
Migrated from Sentry to OpenTelemetry (OTel) + SigNoz with the following architecture:
Shared Instrumentation Package
- Created
@hn-monorepo/monitoringpackage for shared OTel configuration - Provides consistent setup for all apps: trace exporters, span processors, error handlers
- Apps import and initialize monitoring in their instrumentation files
Server-Side Auto-Instrumentation
- OTel Operator deployed in the k3s cluster
- Pods annotated with
instrumentation.opentelemetry.io/inject-nodejs: "otel-instrumentation"are auto-instrumented - Captures HTTP requests, database queries, and custom spans without code changes
Client-Side Error Capture
- Each app exposes a
/api/otel/errorendpoint - Client-side errors are caught by a global error boundary and sent to this endpoint
- The endpoint forwards errors to SigNoz via the OTel collector
SigNoz Deployment
- Self-hosted in the
signoznamespace on k3s - Receives traces, metrics, and logs from the OTel collector
- Provides dashboards, alerting, and trace exploration via web UI
Environment Configuration
OTEL_EXPORTER_OTLP_ENDPOINTandOTEL_SERVICE_NAMEconfigured per-app via.env.opfiles- In Kubernetes, these are set via deployment manifests and the OTel Operator
Consequences
Positive
- Vendor-neutral: OTel is an open standard; can switch backends (Jaeger, Grafana Tempo, Datadog) without changing instrumentation
- Self-hosted: No per-event pricing; fixed infrastructure cost regardless of traffic volume
- Unified observability: Traces, metrics, and logs in a single SigNoz interface
- Auto-instrumentation: OTel Operator instruments pods automatically, reducing per-app boilerplate
- Distributed tracing: Full request lifecycle visibility across ingress, app server, and Convex backend
Negative
- More operational overhead than SaaS solutions (must maintain SigNoz, OTel Operator, collector)
- SigNoz requires non-trivial resources on the cluster (ClickHouse storage backend)
- Learning curve for OTel concepts (spans, traces, baggage, propagation)
- Client-side error capture via API endpoint is less feature-rich than Sentry’s SDK (no session replay, no breadcrumbs)
Neutral
- Alert configuration moved from Sentry’s UI to SigNoz’s alerting system
- Error grouping and deduplication handled by SigNoz rather than Sentry’s issue system
- Local development can optionally send telemetry to the cluster’s SigNoz instance via Tailscale
Alternatives Considered
Alternative 1: Keep Sentry
- Description: Continue using Sentry for error monitoring, add separate tools for traces/metrics
- Pros: Excellent error grouping, session replay, breadcrumbs, mature SDK
- Cons: Per-event pricing scales poorly with 8+ apps, limited infrastructure metrics, separate tools for traces/metrics
- Why not chosen: Cost unpredictability and the need for a unified observability platform
Alternative 2: Datadog
- Description: Comprehensive observability SaaS (APM, metrics, logs, RUM)
- Pros: Best-in-class APM, unified platform, excellent Kubernetes integration
- Cons: Expensive (per-host + per-event pricing), data leaves EU, vendor lock-in on proprietary agents
- Why not chosen: Cost prohibitive for a small team; conflicts with self-hosting philosophy
Alternative 3: Grafana Stack (Loki + Tempo + Prometheus)
- Description: Self-hosted observability using Grafana’s open-source tools
- Pros: Extremely flexible, large community, Grafana dashboards are best-in-class
- Cons: More components to deploy and manage (4+ services), steeper learning curve, requires more cluster resources
- Why not chosen: SigNoz provides a simpler all-in-one solution with fewer moving parts
References
- OpenTelemetry Documentation
- SigNoz Documentation
- OTel Operator for Kubernetes
- Related: ADR-0001 (Monorepo Structure)
- Related: ADR-0008 (Self-hosted Convex on k3s)
Notes
- SigNoz uses ClickHouse as its storage backend; monitor disk usage as trace volume grows
- The
@hn-monorepo/monitoringpackage abstracts OTel setup so individual apps do not need to know the backend - When adding a new app, add the OTel annotation to its deployment manifest for automatic instrumentation
- Consider adding Synthetic Monitoring (uptime checks) as a future enhancement