ADR-0009: OTel + SigNoz Monitoring

Status

Accepted

Date: 2026-03-05

Context

The monorepo initially used Sentry for error monitoring across all apps. As the platform grew to 8+ apps with self-hosted Convex backends and Kubernetes infrastructure, the monitoring requirements expanded:

Distributed tracing: Requests flow through Traefik ingress, Next.js servers, and Convex backends. Sentry’s tracing was limited and expensive at scale.
Metrics: Need CPU, memory, request latency, and custom business metrics. Sentry does not provide infrastructure metrics.
Cost control: Sentry’s per-event pricing becomes unpredictable with 8 apps generating errors, performance transactions, and replays.
Data ownership: Prefer self-hosted observability data on EU servers, consistent with the self-hosted Convex decision.
Unified platform: Want traces, metrics, and logs in a single interface rather than switching between tools.

Key constraints:

Must support both server-side (Node.js) and client-side (browser) instrumentation
Should integrate with Kubernetes (pod-level metrics, auto-instrumentation)
Must run on the existing single-node k3s cluster
Development team is small; operational complexity must be manageable

Decision

Migrated from Sentry to OpenTelemetry (OTel) + SigNoz with the following architecture:

Shared Instrumentation Package

Created @hn-monorepo/monitoring package for shared OTel configuration
Provides consistent setup for all apps: trace exporters, span processors, error handlers
Apps import and initialize monitoring in their instrumentation files

Server-Side Auto-Instrumentation

OTel Operator deployed in the k3s cluster
Pods annotated with instrumentation.opentelemetry.io/inject-nodejs: "otel-instrumentation" are auto-instrumented
Captures HTTP requests, database queries, and custom spans without code changes

Client-Side Error Capture

Each app exposes a /api/otel/error endpoint
Client-side errors are caught by a global error boundary and sent to this endpoint
The endpoint forwards errors to SigNoz via the OTel collector

SigNoz Deployment

Self-hosted in the signoz namespace on k3s
Receives traces, metrics, and logs from the OTel collector
Provides dashboards, alerting, and trace exploration via web UI

Environment Configuration

OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME configured per-app via .env.op files
In Kubernetes, these are set via deployment manifests and the OTel Operator

Consequences

Positive

Vendor-neutral: OTel is an open standard; can switch backends (Jaeger, Grafana Tempo, Datadog) without changing instrumentation
Self-hosted: No per-event pricing; fixed infrastructure cost regardless of traffic volume
Unified observability: Traces, metrics, and logs in a single SigNoz interface
Auto-instrumentation: OTel Operator instruments pods automatically, reducing per-app boilerplate
Distributed tracing: Full request lifecycle visibility across ingress, app server, and Convex backend

Negative

More operational overhead than SaaS solutions (must maintain SigNoz, OTel Operator, collector)
SigNoz requires non-trivial resources on the cluster (ClickHouse storage backend)
Learning curve for OTel concepts (spans, traces, baggage, propagation)
Client-side error capture via API endpoint is less feature-rich than Sentry’s SDK (no session replay, no breadcrumbs)

Neutral

Alert configuration moved from Sentry’s UI to SigNoz’s alerting system
Error grouping and deduplication handled by SigNoz rather than Sentry’s issue system
Local development can optionally send telemetry to the cluster’s SigNoz instance via Tailscale

Alternatives Considered

Alternative 1: Keep Sentry

Description: Continue using Sentry for error monitoring, add separate tools for traces/metrics
Pros: Excellent error grouping, session replay, breadcrumbs, mature SDK
Cons: Per-event pricing scales poorly with 8+ apps, limited infrastructure metrics, separate tools for traces/metrics
Why not chosen: Cost unpredictability and the need for a unified observability platform

Alternative 2: Datadog

Description: Comprehensive observability SaaS (APM, metrics, logs, RUM)
Pros: Best-in-class APM, unified platform, excellent Kubernetes integration
Cons: Expensive (per-host + per-event pricing), data leaves EU, vendor lock-in on proprietary agents
Why not chosen: Cost prohibitive for a small team; conflicts with self-hosting philosophy

Alternative 3: Grafana Stack (Loki + Tempo + Prometheus)

Description: Self-hosted observability using Grafana’s open-source tools
Pros: Extremely flexible, large community, Grafana dashboards are best-in-class
Cons: More components to deploy and manage (4+ services), steeper learning curve, requires more cluster resources
Why not chosen: SigNoz provides a simpler all-in-one solution with fewer moving parts

References

OpenTelemetry Documentation
SigNoz Documentation
OTel Operator for Kubernetes
Related: ADR-0001 (Monorepo Structure)
Related: ADR-0008 (Self-hosted Convex on k3s)

Notes

SigNoz uses ClickHouse as its storage backend; monitor disk usage as trace volume grows
The @hn-monorepo/monitoring package abstracts OTel setup so individual apps do not need to know the backend
When adding a new app, add the OTel annotation to its deployment manifest for automatic instrumentation
Consider adding Synthetic Monitoring (uptime checks) as a future enhancement