> ## Documentation Index
> Fetch the complete documentation index at: https://www.qovery.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Qovery Observe

> Built-in observability for monitoring applications and infrastructure

## Overview

Qovery provides integrated observability to help you monitor the health, performance, and behavior of your services directly within the Qovery Console. Your observability data stays within your infrastructure with zero maintenance required.

<img src="https://mintcdn.com/qovery/h4GYArJsH08SmRTF/images/observability/overview.png?fit=max&auto=format&n=h4GYArJsH08SmRTF&q=85&s=ac7be6d006ddf98027f4e86363c0ea63" alt="Qovery Observability Overview" width="3164" height="2070" data-path="images/observability/overview.png" />

<Note>
  Supports Applications, Containers, and Managed Databases (Jobs support coming soon).
</Note>

<Info>
  Qovery Observe is not yet self-service. Contact Qovery via Slack or email to get access.
</Info>

## Features

<CardGroup cols={2}>
  <Card title="Service Health" icon="heart-pulse">
    Real-time service health and performance tracking
  </Card>

  <Card title="Metrics" icon="chart-line">
    CPU, memory, network, request latency, and error rates
  </Card>

  <Card title="Logs" icon="file-lines">
    12 weeks log retention with automatic error detection
  </Card>

  <Card title="Events" icon="clock">
    Qovery and Kubernetes events (deployments, scaling, failures)
  </Card>

  <Card title="Alerts" icon="bell">
    Proactive monitoring with customizable alerts and notifications
  </Card>
</CardGroup>

<img src="https://mintcdn.com/qovery/kTQizHnnz6yZZ5Tx/images/observability/capabilities.png?fit=max&auto=format&n=kTQizHnnz6yZZ5Tx&q=85&s=c6b93cfa9da29922209430d100a40e1f" alt="Observability Capabilities" width="1608" height="700" data-path="images/observability/capabilities.png" />

## Key Benefits

* **Data stays in your infrastructure**: All observability data remains within your cloud
* **Zero maintenance**: No configuration or management required
* **Correlated data**: Metrics and logs automatically linked for faster troubleshooting

## Architecture

Qovery's observability combines open-source tools to monitor your Kubernetes infrastructure:

### Data Collection

<CardGroup cols={3}>
  <Card title="Metrics" icon="chart-mixed">
    **Prometheus + Thanos** collect and store metrics (CPU, memory, network)
  </Card>

  <Card title="Logs" icon="file-lines">
    **Loki + Promtail** collect and store container logs
  </Card>

  <Card title="Events" icon="bell">
    **Qovery Event Logger** captures Kubernetes events
  </Card>
</CardGroup>

### Data Retention

* **Prometheus**: 7-day local retention
* **Thanos**: Raw metrics (15 days), 5-minute resolution (30 days), 1-hour resolution (30 days)
* **Loki**: 12-week log retention

### Key Features

* **Per-cluster isolation**: Data protection and performance optimization
* **Automatic error detection**: Custom metrics track error logs for alerting
* **High availability**: Prometheus runs with 2 replicas; Thanos auto-scales 2-5 replicas

### Architecture Diagram

<img src="https://mintcdn.com/qovery/kTQizHnnz6yZZ5Tx/images/observability/architecture-light.png?fit=max&auto=format&n=kTQizHnnz6yZZ5Tx&q=85&s=901499b1e9bf9ad4929b07194c579083" alt="Qovery Observability Architecture" className="light-mode-only" width="1640" height="1016" data-path="images/observability/architecture-light.png" />

<img src="https://mintcdn.com/qovery/kTQizHnnz6yZZ5Tx/images/observability/architecture-dark.png?fit=max&auto=format&n=kTQizHnnz6yZZ5Tx&q=85&s=4544e8824ec78ff9ea9781384eaac652" alt="Qovery Observability Architecture" className="dark-mode-only" width="1640" height="1016" data-path="images/observability/architecture-dark.png" />

## Monitoring

Access the **Monitoring** tab at the service level to view real-time and historical application data.

### Service Health

Monitor your service health with:

* **Event tracking**: Qovery events (deployments, failures) and Kubernetes events (autoscaler triggers, OOMKilled pods, health check issues)
* **Error logging**: Automatically counts error-level logs with direct navigation to errors
* **HTTP error metrics**: Aggregated 499 and 5xx error rates by endpoint and status code
* **Request latency**: P99 tail latency visualization (expandable to P90 and P50)

<img src="https://mintcdn.com/qovery/h4GYArJsH08SmRTF/images/observability/events.png?fit=max&auto=format&n=h4GYArJsH08SmRTF&q=85&s=eccd0c024f4a874ab238ead0f0db9fc8" alt="Service Health and Events" width="3164" height="2070" data-path="images/observability/events.png" />

### Resource Monitoring

Track per-pod resources:

* **CPU usage**: Against configured requests and limits
* **Memory usage**: Against configured requests and limits

<img src="https://mintcdn.com/qovery/h4GYArJsH08SmRTF/images/observability/resources.png?fit=max&auto=format&n=h4GYArJsH08SmRTF&q=85&s=51a925654f4512bdc9be202584f0ecf0" alt="Resource Monitoring" width="3164" height="2070" data-path="images/observability/resources.png" />

### Network Metrics

Monitor network-level data:

* Request status by path and error code
* Request duration (P50, P95, P99 percentiles)
* Request size statistics

<Note>
  Metrics represent ingress traffic for services with public ports or internal cluster traffic otherwise. Scaleway clusters currently lack internal traffic monitoring when no public port is exposed.
</Note>

<img src="https://mintcdn.com/qovery/h4GYArJsH08SmRTF/images/observability/network.png?fit=max&auto=format&n=h4GYArJsH08SmRTF&q=85&s=70379188cff76c5adb13bfd43ffea0ce" alt="Network Metrics" width="3164" height="2070" data-path="images/observability/network.png" />

### Managed Database Monitoring

For AWS managed databases (MySQL and PostgreSQL), Qovery Observe provides comprehensive monitoring of database performance and health metrics.

<Note>
  Managed database monitoring is currently available for AWS only. To enable it, the CloudWatch exporter must be activated on your cluster. This is not currently self-service - contact Qovery via Slack or email to enable this feature.
</Note>

#### Overview Metrics

Monitor critical database health indicators:

<CardGroup cols={3}>
  <Card title="CPU Usage" icon="microchip">
    Track average CPU utilization across your database instances
  </Card>

  <Card title="Memory" icon="memory">
    Monitor available RAM and memory consumption patterns
  </Card>

  <Card title="Database Connections" icon="plug">
    Track active database connections in real-time
  </Card>

  <Card title="Swap Usage" icon="arrows-rotate">
    Monitor swap memory usage (healthy databases should minimize swap)
  </Card>

  <Card title="Disk Queue Depth" icon="layer-group">
    Track outstanding disk I/O operations for performance insights
  </Card>

  <Card title="Unvacuumed Transactions" icon="broom">
    (PostgreSQL only) Monitor transactions pending cleanup operations
  </Card>
</CardGroup>

#### Query Performance

Track database query performance metrics:

* **Write Latency**: Monitor write operation response times
* **Read Latency**: Track read operation response times

#### Storage & I/O

Monitor disk performance and capacity:

* **Write IOPS**: Operations per second for write operations
* **Read IOPS**: Operations per second for read operations
* **Storage Available**: Track remaining storage capacity percentage

These metrics help you identify performance bottlenecks, optimize query performance, and ensure your database has adequate resources.

### Controls

* **Live update toggle**: Continuous chart refresh
* **Custom time frames**: Select data display ranges

## Logs

Access logs via the **Logs** tab or the Monitoring tab.

<img src="https://mintcdn.com/qovery/yDPPPWPKgIjFS10L/images/deployment/live_logs.png?fit=max&auto=format&n=yDPPPWPKgIjFS10L&q=85&s=601138982d2e163fb195110b9f0adc6f" alt="Service Logs" width="3164" height="2070" data-path="images/deployment/live_logs.png" />

### Log Features

Qovery collects and stores logs using **Loki + Promtail** with:

* **12 weeks retention** when observability is enabled
* **24 hours retention** without observability
* **Automatic error detection**: Error-level logs are counted and highlighted
* **Log enrichment**: Service ID, environment ID, and pod information

### Filtering Capabilities

<CardGroup cols={3}>
  <Card title="Keyword Search" icon="magnifying-glass">
    Locate specific messages within log entries
  </Card>

  <Card title="Time Range" icon="clock">
    Isolate logs around deployments or incidents
  </Card>

  <Card title="Log Level" icon="filter">
    Filter by severity (error, info, debug)
  </Card>
</CardGroup>

## Alerts

Qovery provides a built-in alerting system that proactively monitors your services and notifies you when specific conditions are met. Access alerts through the **Monitoring** section at the service level or via the dedicated **Alerting** section in the navigation menu.

### Alert Management

Each service (Application or Container) has a **Monitoring** section that includes two tabs:

* **Dashboard**: Real-time metrics visualization with graphs for CPU, memory, network, and latency
* **Alerts**: View existing alerts and create new ones for this specific service

The navigation menu also includes a dedicated **Alerting** section where you can manage all your alerts from a centralized location:

* **Issues**: View all currently fired alerts with their severity, target service, and duration. Each issue shows which alert rule triggered it and how long it has been active.
* **Alert rules**: Browse and manage all configured alert rules across your services
* **Notification channel**: Configure and manage notification channels (Slack, Email, and more integrations coming soon)

<img src="https://mintcdn.com/qovery/h4GYArJsH08SmRTF/images/observability/issues-overview.png?fit=max&auto=format&n=h4GYArJsH08SmRTF&q=85&s=e40172c373ccb122d038e72acb7b8f49" alt="Alerting Section - Issues Overview" width="3164" height="2070" data-path="images/observability/issues-overview.png" />

<Tip>
  To receive alerts, you first need to configure a notification channel. See the [Slack Integration guide](/configuration/integrations/slack) or [Email Integration guide](/configuration/integrations/email) to set up notification channels.
</Tip>

### Alert Categories

Create alerts based on these monitoring categories:

<CardGroup cols={2}>
  <Card title="CPU" icon="microchip">
    Monitor CPU usage thresholds and spikes
  </Card>

  <Card title="Memory" icon="memory">
    Track memory consumption and prevent OOM issues
  </Card>

  <Card title="HTTP Errors" icon="triangle-exclamation">
    Detect elevated 5xx server error rates
  </Card>

  <Card title="HTTP Latency" icon="gauge-high">
    Alert on slow response times and performance degradation
  </Card>

  <Card title="Missing Instances" icon="server">
    Get notified when services can't reach minimum instance count
  </Card>

  <Card title="Instance Restarts" icon="rotate">
    Track unexpected pod restarts and crashes
  </Card>

  <Card title="Auto-Scaling Limit" icon="arrow-up-to-line">
    Alert when service reaches maximum instance limit
  </Card>
</CardGroup>

### Creating an Alert

<Steps>
  <Step title="Select Alert Category">
    Choose the type of alert you want to create (CPU, Memory, HTTP Errors, etc.)
  </Step>

  <Step title="Configure Conditions">
    Define the trigger conditions for your alert:

    * View the underlying query powering the alert
    * Customize threshold values and duration
    * Preview how the condition evaluates against your metrics
  </Step>

  <Step title="Configure Alert Details">
    Set up the alert notification:

    * **Alert name**: Descriptive name for easy identification
    * **Notification message**: Custom text included in notifications
    * **Notification channel**: Select a configured notification channel (Slack, Email, or more integrations coming soon)
  </Step>
</Steps>

<img src="https://mintcdn.com/qovery/h4GYArJsH08SmRTF/images/observability/create-alert.png?fit=max&auto=format&n=h4GYArJsH08SmRTF&q=85&s=4f270ff4216f7bdacf640aca6fb369e1" alt="Creating an Alert" width="3164" height="2070" data-path="images/observability/create-alert.png" />

<Tip>
  Start with pre-configured alert templates and adjust thresholds based on your service's normal behavior.
</Tip>

### Alert Conditions Guide

Understanding how to configure alert conditions is crucial for effective monitoring. Here's a detailed explanation of each configuration option.

#### Aggregation Methods (Maximum, Minimum, Average)

When monitoring metrics, you need to decide how to aggregate the data over the specified duration:

<AccordionGroup>
  <Accordion title="Maximum" icon="arrow-up">
    **When to use**: Detect peak usage or spikes

    * **CPU**: Alert when any instance hits high CPU, even briefly
    * **Memory**: Catch memory spikes before OOM kills
    * **Network Latency**: Detect worst-case response times (tail latency)

    **Example**: Alert when maximum CPU usage exceeds 80% - triggers if any pod reaches 80%, even if others are idle.
  </Accordion>

  <Accordion title="Minimum" icon="arrow-down">
    **When to use**: Detect drops or missing resources

    * **CPU**: Identify underutilized services (cost optimization)
    * **Memory**: Detect memory leaks causing gradual drops in available memory
    * **Network Latency**: Ensure baseline performance (rarely used for latency)

    **Example**: Alert when minimum running instances drops below 2 - triggers when you have fewer than 2 healthy pods.
  </Accordion>

  <Accordion title="Average" icon="minus">
    **When to use**: Monitor overall service health

    * **CPU**: Track average load across all instances
    * **Memory**: Monitor typical memory consumption
    * **Network Latency**: Measure average response time across requests

    **Example**: Alert when average CPU usage exceeds 70% - triggers based on the mean CPU across all pods.
  </Accordion>
</AccordionGroup>

#### Trigger Conditions

Define when the alert should fire based on the comparison between your metric and threshold:

<CardGroup cols={2}>
  <Card title="Above (>)" icon="greater-than">
    Triggers when metric **exceeds** the threshold

    **Common use**: CPU usage, memory usage, 5xx error rates, latency

    **Example**: CPU > 80%
  </Card>

  <Card title="Below (<)" icon="less-than">
    Triggers when metric **falls below** the threshold

    **Common use**: Available instances, request rate drops

    **Example**: Running instances \< 2
  </Card>

  <Card title="Equal (=)" icon="equals">
    Triggers when metric **exactly matches** the threshold

    **Common use**: Specific status codes, exact counts

    **Example**: Failed deployments = 3
  </Card>

  <Card title="Above or Equal (≥)" icon="greater-than-equal">
    Triggers when metric is **greater than or equal to** the threshold

    **Common use**: Cumulative thresholds

    **Example**: Error count ≥ 100
  </Card>

  <Card title="Below or Equal (≤)" icon="less-than-equal">
    Triggers when metric is **less than or equal to** the threshold

    **Common use**: Resource availability, uptime percentage

    **Example**: Available memory ≤ 20%
  </Card>
</CardGroup>

#### Duration

The duration specifies how long the condition must be true before the alert fires. This prevents false positives from temporary spikes.

**Duration options:**

* **Last 1 minute**: Very sensitive, catches issues immediately but may have false positives
* **Last 5 minutes**: Balanced approach, good for most alerts
* **Last 10 minutes**: Conservative, reduces noise but may delay critical alerts
* **Last 15 minutes**: Best for informational alerts or gradual trends

<Warning>
  Shorter durations increase alert sensitivity but also increase false positives. Longer durations reduce noise but may delay critical notifications.
</Warning>

<img src="https://mintcdn.com/qovery/h4GYArJsH08SmRTF/images/observability/alert-conditions-configuration.png?fit=max&auto=format&n=h4GYArJsH08SmRTF&q=85&s=62e538894db2eeb6c8887d48d8d5e1fa" alt="Alert Conditions Configuration" width="3164" height="2070" data-path="images/observability/alert-conditions-configuration.png" />

#### Practical Examples by Metric Type

<Tabs>
  <Tab title="CPU">
    **Recommended configuration:**

    * **Aggregation**: Maximum (catches any pod spiking)
    * **Trigger condition**: Above 80%
    * **Duration**: Last 5 minutes

    **Query example:**

    ```
    SEND A NOTIFICATION WHEN THE MAXIMUM OF CPU
    FOR [SERVICE_NAME] IS ABOVE 80 %
    DURING THE LAST 5 MINUTES
    ```

    **Why**: Maximum catches any pod hitting high CPU, 80% leaves headroom before throttling, 5 minutes filters temporary spikes.
  </Tab>

  <Tab title="Memory">
    **Recommended configuration:**

    * **Aggregation**: Maximum (prevents OOM kills)
    * **Trigger condition**: Above 85%
    * **Duration**: Last 5 minutes

    **Query example:**

    ```
    SEND A NOTIFICATION WHEN THE MAXIMUM OF MEMORY
    FOR [SERVICE_NAME] IS ABOVE 85 %
    DURING THE LAST 5 MINUTES
    ```

    **Why**: Maximum ensures no pod approaches OOM threshold, 85% provides buffer, 5 minutes allows time to investigate before critical.
  </Tab>

  <Tab title="Network Latency">
    **Recommended configuration:**

    * **Aggregation**: Maximum (P99 latency)
    * **Trigger condition**: Above 1000 ms
    * **Duration**: Last 5 minutes

    **Query example:**

    ```
    SEND A NOTIFICATION WHEN THE MAXIMUM OF LATENCY (P99)
    FOR [SERVICE_NAME] IS ABOVE 1000 MS
    DURING THE LAST 5 MINUTES
    ```

    **Why**: Maximum captures tail latency affecting user experience, 1000ms is a common SLA threshold, 5 minutes confirms sustained issue.
  </Tab>

  <Tab title="HTTP 5xx Errors">
    **Recommended configuration:**

    * **Aggregation**: Average (across all requests)
    * **Trigger condition**: Above 5%
    * **Duration**: Last 3 minutes

    **Query example:**

    ```
    SEND A NOTIFICATION WHEN THE AVERAGE OF 5XX ERROR RATE
    FOR [SERVICE_NAME] IS ABOVE 5 %
    DURING THE LAST 3 MINUTES
    ```

    **Why**: Average 5xx error rate shows server-side issues, 5% indicates significant problems, 3 minutes for faster detection of service degradation.
  </Tab>
</Tabs>

<Tip>
  Use the **Main condition query** preview to verify your alert configuration before saving. The query shows exactly what condition will trigger notifications.
</Tip>

## Next Steps

<CardGroup cols={2}>
  <Card title="Slack Notifications" icon="slack" href="/configuration/integrations/slack">
    Set up Slack notification channels for alerts
  </Card>

  <Card title="Email Notifications" icon="envelope" href="/configuration/integrations/email">
    Set up email notification channels for alerts
  </Card>

  <Card title="Datadog" icon="dog" href="/configuration/integrations/observability/datadog">
    Add Datadog for advanced monitoring
  </Card>

  <Card title="Kubecost" icon="dollar-sign" href="/configuration/integrations/observability/kubecost">
    Monitor and reduce Kubernetes costs
  </Card>

  <Card title="Deployment Logs" icon="file-lines" href="/configuration/deployment/logs">
    View deployment and service logs
  </Card>

  <Card title="Application Config" icon="sliders" href="/configuration/application">
    Configure health checks and ports
  </Card>
</CardGroup>
