Skip to main content

Overview

Qovery provides integrated observability to help you monitor the health, performance, and behavior of your services directly within the Qovery Console. Your observability data stays within your infrastructure with zero maintenance required. Qovery Observability Overview
Supports Applications, Containers, and Managed Databases (Jobs support coming soon).
Qovery Observe is not yet self-service. Contact Qovery via Slack or email to get access.

Features

Service Health

Real-time service health and performance tracking

Metrics

CPU, memory, network, request latency, and error rates

Logs

12 weeks log retention with automatic error detection

Events

Qovery and Kubernetes events (deployments, scaling, failures)

Alerts

Proactive monitoring with customizable alerts and notifications
Observability Capabilities

Key Benefits

  • Data stays in your infrastructure: All observability data remains within your cloud
  • Zero maintenance: No configuration or management required
  • Correlated data: Metrics and logs automatically linked for faster troubleshooting

Architecture

Qovery’s observability combines open-source tools to monitor your Kubernetes infrastructure:

Data Collection

Metrics

Prometheus + Thanos collect and store metrics (CPU, memory, network)

Logs

Loki + Promtail collect and store container logs

Events

Qovery Event Logger captures Kubernetes events

Data Retention

  • Prometheus: 7-day local retention
  • Thanos: Raw metrics (15 days), 5-minute resolution (30 days), 1-hour resolution (30 days)
  • Loki: 12-week log retention

Key Features

  • Per-cluster isolation: Data protection and performance optimization
  • Automatic error detection: Custom metrics track error logs for alerting
  • High availability: Prometheus runs with 2 replicas; Thanos auto-scales 2-5 replicas

Architecture Diagram

Qovery Observability Architecture Qovery Observability Architecture

Monitoring

Access the Monitoring tab at the service level to view real-time and historical application data.

Service Health

Monitor your service health with:
  • Event tracking: Qovery events (deployments, failures) and Kubernetes events (autoscaler triggers, OOMKilled pods, health check issues)
  • Error logging: Automatically counts error-level logs with direct navigation to errors
  • HTTP error metrics: Aggregated 499 and 5xx error rates by endpoint and status code
  • Request latency: P99 tail latency visualization (expandable to P90 and P50)
Service Health and Events

Resource Monitoring

Track per-pod resources:
  • CPU usage: Against configured requests and limits
  • Memory usage: Against configured requests and limits
Resource Monitoring

Network Metrics

Monitor network-level data:
  • Request status by path and error code
  • Request duration (P50, P95, P99 percentiles)
  • Request size statistics
Metrics represent ingress traffic for services with public ports or internal cluster traffic otherwise. Scaleway clusters currently lack internal traffic monitoring when no public port is exposed.
Network Metrics

Managed Database Monitoring

For AWS managed databases (MySQL and PostgreSQL), Qovery Observe provides comprehensive monitoring of database performance and health metrics.
Managed database monitoring is currently available for AWS only. To enable it, the CloudWatch exporter must be activated on your cluster. This is not currently self-service - contact Qovery via Slack or email to enable this feature.

Overview Metrics

Monitor critical database health indicators:

CPU Usage

Track average CPU utilization across your database instances

Memory

Monitor available RAM and memory consumption patterns

Database Connections

Track active database connections in real-time

Swap Usage

Monitor swap memory usage (healthy databases should minimize swap)

Disk Queue Depth

Track outstanding disk I/O operations for performance insights

Unvacuumed Transactions

(PostgreSQL only) Monitor transactions pending cleanup operations

Query Performance

Track database query performance metrics:
  • Write Latency: Monitor write operation response times
  • Read Latency: Track read operation response times

Storage & I/O

Monitor disk performance and capacity:
  • Write IOPS: Operations per second for write operations
  • Read IOPS: Operations per second for read operations
  • Storage Available: Track remaining storage capacity percentage
These metrics help you identify performance bottlenecks, optimize query performance, and ensure your database has adequate resources.

Controls

  • Live update toggle: Continuous chart refresh
  • Custom time frames: Select data display ranges

Logs

Access logs via the Logs tab or the Monitoring tab. Service Logs

Log Features

Qovery collects and stores logs using Loki + Promtail with:
  • 12 weeks retention when observability is enabled
  • 24 hours retention without observability
  • Automatic error detection: Error-level logs are counted and highlighted
  • Log enrichment: Service ID, environment ID, and pod information

Filtering Capabilities

Keyword Search

Locate specific messages within log entries

Time Range

Isolate logs around deployments or incidents

Log Level

Filter by severity (error, info, debug)

Alerts

Qovery provides a built-in alerting system that proactively monitors your services and notifies you when specific conditions are met. Access alerts through the Monitoring section at the service level or via the dedicated Alerting section in the navigation menu.

Alert Management

Each service (Application or Container) has a Monitoring section that includes two tabs:
  • Dashboard: Real-time metrics visualization with graphs for CPU, memory, network, and latency
  • Alerts: View existing alerts and create new ones for this specific service
The navigation menu also includes a dedicated Alerting section where you can manage all your alerts from a centralized location:
  • Issues: View all currently fired alerts with their severity, target service, and duration. Each issue shows which alert rule triggered it and how long it has been active.
  • Alert rules: Browse and manage all configured alert rules across your services
  • Notification channel: Configure and manage notification channels (Slack, and more integrations coming soon)
Alerting Section - Issues Overview
To receive alerts in Slack, you first need to configure a notification channel. See the Slack Integration guide to set up Slack webhooks and add notification channels.

Alert Categories

Create alerts based on these monitoring categories:

CPU

Monitor CPU usage thresholds and spikes

Memory

Track memory consumption and prevent OOM issues

HTTP Errors

Detect elevated 5xx server error rates

HTTP Latency

Alert on slow response times and performance degradation

Missing Instances

Get notified when services can’t reach minimum instance count

Instance Restarts

Track unexpected pod restarts and crashes

Auto-Scaling Limit

Alert when service reaches maximum instance limit

Creating an Alert

1

Select Alert Category

Choose the type of alert you want to create (CPU, Memory, HTTP Errors, etc.)
2

Configure Conditions

Define the trigger conditions for your alert:
  • View the underlying query powering the alert
  • Customize threshold values and duration
  • Preview how the condition evaluates against your metrics
3

Configure Alert Details

Set up the alert notification:
  • Alert name: Descriptive name for easy identification
  • Notification message: Custom text included in notifications
  • Notification channel: Select a configured Slack channel (more integrations coming soon)
Creating an Alert
Start with pre-configured alert templates and adjust thresholds based on your service’s normal behavior.

Alert Conditions Guide

Understanding how to configure alert conditions is crucial for effective monitoring. Here’s a detailed explanation of each configuration option.

Aggregation Methods (Maximum, Minimum, Average)

When monitoring metrics, you need to decide how to aggregate the data over the specified duration:
When to use: Detect peak usage or spikes
  • CPU: Alert when any instance hits high CPU, even briefly
  • Memory: Catch memory spikes before OOM kills
  • Network Latency: Detect worst-case response times (tail latency)
Example: Alert when maximum CPU usage exceeds 80% - triggers if any pod reaches 80%, even if others are idle.
When to use: Detect drops or missing resources
  • CPU: Identify underutilized services (cost optimization)
  • Memory: Detect memory leaks causing gradual drops in available memory
  • Network Latency: Ensure baseline performance (rarely used for latency)
Example: Alert when minimum running instances drops below 2 - triggers when you have fewer than 2 healthy pods.
When to use: Monitor overall service health
  • CPU: Track average load across all instances
  • Memory: Monitor typical memory consumption
  • Network Latency: Measure average response time across requests
Example: Alert when average CPU usage exceeds 70% - triggers based on the mean CPU across all pods.

Trigger Conditions

Define when the alert should fire based on the comparison between your metric and threshold:

Above (>)

Triggers when metric exceeds the thresholdCommon use: CPU usage, memory usage, 5xx error rates, latencyExample: CPU > 80%

Below (<)

Triggers when metric falls below the thresholdCommon use: Available instances, request rate dropsExample: Running instances < 2

Equal (=)

Triggers when metric exactly matches the thresholdCommon use: Specific status codes, exact countsExample: Failed deployments = 3

Above or Equal (≥)

Triggers when metric is greater than or equal to the thresholdCommon use: Cumulative thresholdsExample: Error count ≥ 100

Below or Equal (≤)

Triggers when metric is less than or equal to the thresholdCommon use: Resource availability, uptime percentageExample: Available memory ≤ 20%

Duration

The duration specifies how long the condition must be true before the alert fires. This prevents false positives from temporary spikes. Duration options:
  • Last 1 minute: Very sensitive, catches issues immediately but may have false positives
  • Last 5 minutes: Balanced approach, good for most alerts
  • Last 10 minutes: Conservative, reduces noise but may delay critical alerts
  • Last 15 minutes: Best for informational alerts or gradual trends
Shorter durations increase alert sensitivity but also increase false positives. Longer durations reduce noise but may delay critical notifications.
Alert Conditions Configuration

Practical Examples by Metric Type

Recommended configuration:
  • Aggregation: Maximum (catches any pod spiking)
  • Trigger condition: Above 80%
  • Duration: Last 5 minutes
Query example:
SEND A NOTIFICATION WHEN THE MAXIMUM OF CPU
FOR [SERVICE_NAME] IS ABOVE 80 %
DURING THE LAST 5 MINUTES
Why: Maximum catches any pod hitting high CPU, 80% leaves headroom before throttling, 5 minutes filters temporary spikes.
Use the Main condition query preview to verify your alert configuration before saving. The query shows exactly what condition will trigger notifications.

Next Steps