Architecture Decisions

ADR-0009: OTel + SigNoz Monitoring

Documents the migration from Sentry to OpenTelemetry and SigNoz for observability

ADR-0009: OTel + SigNoz Monitoring

Status

Accepted

Date: 2026-03-05

Context

The monorepo initially used Sentry for error monitoring across all apps. As the platform grew to 8+ apps with self-hosted Convex backends and Kubernetes infrastructure, the monitoring requirements expanded:

  • Distributed tracing: Requests flow through Traefik ingress, Next.js servers, and Convex backends. Sentry’s tracing was limited and expensive at scale.
  • Metrics: Need CPU, memory, request latency, and custom business metrics. Sentry does not provide infrastructure metrics.
  • Cost control: Sentry’s per-event pricing becomes unpredictable with 8 apps generating errors, performance transactions, and replays.
  • Data ownership: Prefer self-hosted observability data on EU servers, consistent with the self-hosted Convex decision.
  • Unified platform: Want traces, metrics, and logs in a single interface rather than switching between tools.

Key constraints:

  • Must support both server-side (Node.js) and client-side (browser) instrumentation
  • Should integrate with Kubernetes (pod-level metrics, auto-instrumentation)
  • Must run on the existing single-node k3s cluster
  • Development team is small; operational complexity must be manageable

Decision

Migrated from Sentry to OpenTelemetry (OTel) + SigNoz with the following architecture:

Shared Instrumentation Package

  • Created @hn-monorepo/monitoring package for shared OTel configuration
  • Provides consistent setup for all apps: trace exporters, span processors, error handlers
  • Apps import and initialize monitoring in their instrumentation files

Server-Side Auto-Instrumentation

  • OTel Operator deployed in the k3s cluster
  • Pods annotated with instrumentation.opentelemetry.io/inject-nodejs: "otel-instrumentation" are auto-instrumented
  • Captures HTTP requests, database queries, and custom spans without code changes

Client-Side Error Capture

  • Each app exposes a /api/otel/error endpoint
  • Client-side errors are caught by a global error boundary and sent to this endpoint
  • The endpoint forwards errors to SigNoz via the OTel collector

SigNoz Deployment

  • Self-hosted in the signoz namespace on k3s
  • Receives traces, metrics, and logs from the OTel collector
  • Provides dashboards, alerting, and trace exploration via web UI

Environment Configuration

  • OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME configured per-app via .env.op files
  • In Kubernetes, these are set via deployment manifests and the OTel Operator

Consequences

Positive

  • Vendor-neutral: OTel is an open standard; can switch backends (Jaeger, Grafana Tempo, Datadog) without changing instrumentation
  • Self-hosted: No per-event pricing; fixed infrastructure cost regardless of traffic volume
  • Unified observability: Traces, metrics, and logs in a single SigNoz interface
  • Auto-instrumentation: OTel Operator instruments pods automatically, reducing per-app boilerplate
  • Distributed tracing: Full request lifecycle visibility across ingress, app server, and Convex backend

Negative

  • More operational overhead than SaaS solutions (must maintain SigNoz, OTel Operator, collector)
  • SigNoz requires non-trivial resources on the cluster (ClickHouse storage backend)
  • Learning curve for OTel concepts (spans, traces, baggage, propagation)
  • Client-side error capture via API endpoint is less feature-rich than Sentry’s SDK (no session replay, no breadcrumbs)

Neutral

  • Alert configuration moved from Sentry’s UI to SigNoz’s alerting system
  • Error grouping and deduplication handled by SigNoz rather than Sentry’s issue system
  • Local development can optionally send telemetry to the cluster’s SigNoz instance via Tailscale

Alternatives Considered

Alternative 1: Keep Sentry

  • Description: Continue using Sentry for error monitoring, add separate tools for traces/metrics
  • Pros: Excellent error grouping, session replay, breadcrumbs, mature SDK
  • Cons: Per-event pricing scales poorly with 8+ apps, limited infrastructure metrics, separate tools for traces/metrics
  • Why not chosen: Cost unpredictability and the need for a unified observability platform

Alternative 2: Datadog

  • Description: Comprehensive observability SaaS (APM, metrics, logs, RUM)
  • Pros: Best-in-class APM, unified platform, excellent Kubernetes integration
  • Cons: Expensive (per-host + per-event pricing), data leaves EU, vendor lock-in on proprietary agents
  • Why not chosen: Cost prohibitive for a small team; conflicts with self-hosting philosophy

Alternative 3: Grafana Stack (Loki + Tempo + Prometheus)

  • Description: Self-hosted observability using Grafana’s open-source tools
  • Pros: Extremely flexible, large community, Grafana dashboards are best-in-class
  • Cons: More components to deploy and manage (4+ services), steeper learning curve, requires more cluster resources
  • Why not chosen: SigNoz provides a simpler all-in-one solution with fewer moving parts

References

Notes

  • SigNoz uses ClickHouse as its storage backend; monitor disk usage as trace volume grows
  • The @hn-monorepo/monitoring package abstracts OTel setup so individual apps do not need to know the backend
  • When adding a new app, add the OTel annotation to its deployment manifest for automatic instrumentation
  • Consider adding Synthetic Monitoring (uptime checks) as a future enhancement
HanseNexus 2026