Monitoring
The Helm chart includes a complete observability stack: Prometheus for metrics, Grafana for visualization, and Tempo for distributed tracing.
Enabling Monitoring
Section titled “Enabling Monitoring”Monitoring is enabled by default. To disable it:
helm install flow-like ./helm -n flow-like \ --set monitoring.enabled=falseComponents
Section titled “Components”Prometheus
Section titled “Prometheus”Collects metrics from all Flow-Like services and Kubernetes infrastructure.
Scrape targets:
- API Service (
/metricson port 9090) - Executor Pool (
/metricson port 9090) - CockroachDB (built-in metrics)
- Redis Exporter
- cAdvisor (container metrics from kubelet)
Access Prometheus UI:
kubectl port-forward -n flow-like svc/flow-like-prometheus 9090:9090# Open http://localhost:9090Grafana
Section titled “Grafana”Pre-configured dashboards for all components.
Access Grafana:
kubectl port-forward -n flow-like svc/flow-like-grafana 3000:80# Open http://localhost:3000Default credentials:
- Username:
admin - Password: Retrieved from secret:
kubectl get secret -n flow-like flow-like-grafana \ -o jsonpath='{.data.admin-password}' | base64 -d && echoReceives OpenTelemetry traces from the API and Executor services.
Configuration:
- OTLP endpoint:
flow-like-tempo:4317(gRPC) - Retention: 72 hours (configurable)
Pre-built Dashboards
Section titled “Pre-built Dashboards”System Overview
Section titled “System Overview”Cluster-wide resource utilization:
- CPU usage per pod/container
- Memory usage and limits
- Network I/O
- Disk usage (if applicable)
API Service
Section titled “API Service”API-specific metrics:
- Request rate (requests/sec)
- Response latency percentiles (p50, p95, p99)
- Error rate by status code
- Active connections
Executor Pool
Section titled “Executor Pool”Execution metrics:
- Job queue depth
- Jobs in progress
- Execution duration histogram
- Success/failure rates
- Worker pool utilization
CockroachDB
Section titled “CockroachDB”Database performance:
- Query rate and latency
- Transaction throughput
- Replication lag
- Storage utilization
- Node health
Cache and queue metrics:
- Commands per second
- Memory usage
- Connected clients
- Key eviction rate
- Queue lengths
Tracing
Section titled “Tracing”Distributed traces via Tempo:
- Request traces across services
- Latency breakdown by service
- Error traces
- Service dependency map
Custom Alerts
Section titled “Custom Alerts”Add custom Prometheus alerting rules in values.yaml:
monitoring: prometheus: alertRules: groups: - name: flow-like rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: High error rate detectedConfiguration Reference
Section titled “Configuration Reference”monitoring: enabled: true
prometheus: image: repository: prom/prometheus tag: v2.48.0 retention: 15d resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi
grafana: image: repository: grafana/grafana tag: 10.2.2 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi
tempo: enabled: true image: repository: grafana/tempo tag: 2.3.1 retention: 72h resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512MiMetrics Endpoint Security
Section titled “Metrics Endpoint Security”The /metrics endpoints on API and Executor services are exposed on a separate internal port (9090) that is not exposed outside the cluster. Only Prometheus within the cluster can scrape these endpoints.
For production, ensure:
- Metrics port (9090) is not exposed via Ingress
- Network policies restrict access to monitoring namespace
- Grafana is behind authentication (SSO/OAuth recommended)
External Monitoring Integration
Section titled “External Monitoring Integration”Datadog
Section titled “Datadog”Add Datadog annotations to enable auto-discovery:
api: podAnnotations: ad.datadoghq.com/api.check_names: '["prometheus"]' ad.datadoghq.com/api.init_configs: '[{}]' ad.datadoghq.com/api.instances: '[{"prometheus_url": "http://%%host%%:9090/metrics"}]'New Relic
Section titled “New Relic”Export to New Relic via Prometheus remote write:
monitoring: prometheus: remoteWrite: - url: https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=flow-like bearer_token: YOUR_LICENSE_KEYTroubleshooting
Section titled “Troubleshooting”No metrics appearing
Section titled “No metrics appearing”-
Check if monitoring pods are running:
Terminal window kubectl get pods -n flow-like -l app.kubernetes.io/component=monitoring -
Verify Prometheus targets:
9090/targets kubectl port-forward -n flow-like svc/flow-like-prometheus 9090:9090 -
Check API metrics endpoint:
Terminal window kubectl exec -it deployment/flow-like-api -n flow-like -- \curl -s localhost:9090/metrics | head -20
Grafana dashboard not loading
Section titled “Grafana dashboard not loading”-
Check Grafana logs:
Terminal window kubectl logs -f deployment/flow-like-grafana -n flow-like -
Verify datasources are configured:
Terminal window kubectl get configmap -n flow-like flow-like-grafana-datasources -o yaml
Traces not appearing
Section titled “Traces not appearing”-
Check Tempo is receiving data:
Terminal window kubectl logs -f deployment/flow-like-tempo -n flow-like -
Verify OTLP endpoint is reachable from API:
Terminal window kubectl exec -it deployment/flow-like-api -n flow-like -- \nc -zv flow-like-tempo 4317