Monitoring

Monitor Reframe in production environments.

Health Checks

HTTP Health Endpoint

curl http://localhost:3000/health

Response:

{
  "status": "healthy",
  "engines": {
    "transform": "ready",
    "generation": "ready",
    "validation": "ready"
  },
  "package": {
    "id": "swift-cbpr-mt-mx",
    "version": "2.1.2",
    "loaded_at": "2025-01-15T10:30:00Z"
  },
  "uptime_seconds": 86400
}

Docker Health Check

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s

Kubernetes Probes

readinessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 15
  periodSeconds: 20

Metrics

Prometheus Endpoint

curl http://localhost:3000/metrics

Available Metrics

# Request metrics
reframe_requests_total{endpoint,status}
reframe_request_duration_seconds{endpoint,le}

# Transformation metrics
reframe_transformations_total{direction,type,status}
reframe_transformation_duration_seconds{direction,type,le}

# Validation metrics
reframe_validations_total{format,status}

# Generation metrics
reframe_generations_total{type}

# System metrics
reframe_uptime_seconds
reframe_package_workflows_total

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'reframe'
    static_configs:
      - targets: ['reframe:3000']
    metrics_path: /metrics
    scrape_interval: 30s

Grafana Dashboard

Key panels to include:

Request Rate: rate(reframe_requests_total[5m])
Error Rate: rate(reframe_requests_total{status="error"}[5m])
Latency P99: histogram_quantile(0.99, rate(reframe_request_duration_seconds_bucket[5m]))
Transformations by Type: sum by (type) (rate(reframe_transformations_total[5m]))

Logging

Log Levels

Level	Description
`error`	Errors only
`warn`	Warnings and errors
`info`	General info (recommended for production)
`debug`	Debug information
`trace`	Detailed tracing

Log Formats

compact (default):

2025-01-15T10:30:00Z INFO reframe::api: Transform request direction=outgoing type=MT103

json (for log aggregation):

{"timestamp":"2025-01-15T10:30:00Z","level":"INFO","target":"reframe::api","message":"Transform request","direction":"outgoing","type":"MT103"}

pretty (for development):

2025-01-15T10:30:00.123Z  INFO reframe::api: Transform request
    direction: outgoing
    type: MT103

Log Aggregation

Fluentd

<source>
  @type tail
  path /var/log/reframe/*.log
  pos_file /var/log/fluentd/reframe.pos
  tag reframe
  <parse>
    @type json
  </parse>
</source>

Loki

scrape_configs:
  - job_name: reframe
    static_configs:
      - targets:
          - localhost
        labels:
          job: reframe
          __path__: /var/log/reframe/*.log

Alerting

Key Alerts

High Error Rate

- alert: ReframeHighErrorRate
  expr: rate(reframe_requests_total{status="error"}[5m]) > 0.1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: High error rate in Reframe

High Latency

- alert: ReframeHighLatency
  expr: histogram_quantile(0.99, rate(reframe_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: High latency in Reframe

Service Down

- alert: ReframeDown
  expr: up{job="reframe"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Reframe service is down

Package Reload Failed

- alert: ReframePackageError
  expr: reframe_package_load_errors_total > 0
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: Package reload failed

Tracing

OpenTelemetry (if enabled)

{
  "tracing": {
    "enabled": true,
    "exporter": "otlp",
    "endpoint": "http://jaeger:4317",
    "service_name": "reframe"
  }
}

Trace Context

Reframe propagates trace context through headers:

traceparent
tracestate

Dashboard Example

Key Metrics to Display

Overview
- Requests per second
- Error rate percentage
- Average latency
Transformations
- By direction (outgoing/incoming)
- By message type
- Success/failure rate
System
- CPU usage
- Memory usage
- Uptime
Package
- Loaded workflows
- Last reload time
- Reload errors

Best Practices

Set appropriate log level - Use info for production
Use JSON logging - Easier to parse and aggregate
Configure alerts - Catch issues before users notice
Monitor latency - P99 is more important than average
Track by message type - Identify problematic transformations
Set retention policies - Don’t fill up storage

Configuration Reference →

Kubernetes Deployment →

Keyboard shortcuts

Reframe - A Universal Transformation Engine