Changefeed Monitoring Guide

On this page

CockroachDB changefeeds are long-running jobs that work as a pipeline of change events emitted from a watched table through to the downstream sink. The change events from the storage layer of CockroachDB progress through the changefeed pipeline and are encoded into messages that the changefeed job delivers to a sink, such as Kafka.

This guide provides recommendations for monitoring and alerting on changefeeds throughout the pipeline to ensure reliable operation and quick problem detection.

Tip:

For details on how changefeeds work as jobs in CockroachDB, refer to the technical overview.

Overview

To monitor changefeed jobs effectively, it is necessary to have a view of the high-level metrics that track the health of the changefeed overall and metrics that track the different stages of the changefeed pipeline.

The changefeed pipeline contains three main sections that start at the storage layer of CockroachDB and end at the downstream sink with message delivery. The work completed in each of these pipeline sections feeds metrics that you can track and use to identify where issues could occur.

Rangefeed: Connects changefeeds to a long-lived request called a rangefeed, which pushes changes as they happen.
Processing: Prepares the change events from the rangefeed into changefeed messages by encoding messages into the specified format.
Sink: Delivers changefeed messages to the downstream sink.

An overview of the changefeed pipeline and the metrics that are connected to each stage.

Where noted in the following sections, you can use changefeed metrics labels to measure metrics per changefeed.

You can enable the Datadog integration on your cluster to collect data and alert on selected CockroachDB metrics using the Datadog platform.

Note:

Metrics names in Prometheus replace the . with _. In Datadog, metrics names can differ, refer to the Datadog metrics list.

Overall pipeline metrics

High-level performance metrics

Metrics:
- changefeed.max_behind_nanos
  - Description: The maximum lag in nanoseconds between the timestamp of the most recent resolved timestamp emitted by the changefeed and the current time. Indicates how far behind the changefeed is in processing changes.
  - Use with metrics labels (supported in v24.3.5+).
  - Investigation needed: If changefeed.max_behind_nanos is consistently increasing.
- (now() - changefeed.checkpoint_progress)
  - Description: The progress of changefeed checkpointing. Indicates how recently the changefeed state was persisted durably. Critical for monitoring changefeed recovery capability.
  - Investigation needed: If checkpointing falls too far behind the current time.
Impact:
- Slow processing of changes and updates to downstream sinks.
- Recovery time after failures, potential for increase in change backlog.
- Ability to resume from last known good state.
- Resource usage during catch-up after restarts or failures.

End-to-end performance metrics

End-to-end commit latency

Metric: changefeed.commit_latency
Description: The time between when a transaction commits and when the changefeed emits the corresponding changes. This time includes rangefeed and buffer delays.
Use with metrics labels.

Batch latency

Metric: changefeed.sink_batch_hist_nanos
Description: The end-to-end latency from the change event being put in a batch to the changefeed emitting the event message to the sink.
Investigation needed: If this metric has consistently high values, it indicates that there are bottlenecks in the pipeline.

Protected timestamp age

Metric: jobs.changefeed.protected_age_sec
Description: The age of the protected timestamp record for the changefeed.
Investigation needed: If this metric has high values, it may indicate changefeed processing delays or resource constraints.
Impact: High values can block garbage collection and lead to increased disk usage and degraded performance for some workloads.

Progress tracking

Metrics:
- high_water_timestamp from SHOW CHANGEFEED JOBS
- (now() - changefeed.aggregator_progress)
  - Use with metrics labels.
- changefeed.max_behind_nanos
  - Use with metrics labels (supported in v24.3.5+).
- (now() - changefeed.checkpoint_progress)
Description: These metrics help to measure how far behind the changefeed is from the current time.
Investigation: A growing delay indicates changefeed processing and messages emitting cannot keep up with the rate of change events.

Pipeline section metrics

Rangefeed

Buffer memory usage

Metric: changefeed.buffer_entries.allocated_mem.rangefeed
Description: The current quota pool memory allocation between the rangefeed and the KV feed.
Impact: High memory usage may indicate backpressure.

Rangefeed buffer latency

Metrics:
- changefeed.stage.rangefeed_buffer_checkpoint.latency (count/sum/bucket)
- changefeed.stage.rangefeed_buffer_value.latency (count/sum/bucket)
Description: Latency within the rangefeed section of the changefeed pipeline.
Impact: Indicates potential scanning or changefeed catch-up issues.
Use with metrics labels.

KV feed

Metric: changefeed.stage.kv_feed_wait_for_table_event.latency (count/sum/bucket)
Description: Latency within the processing section of the changefeed pipeline.
Impact: Potential bottlenecks in encoding, batching, sending data, or at the sink. Use this metric in conjunction with the downstream delivery metrics.

Processing — aggregation and encoding

Aggregator buffer health

Metrics:
- changefeed.buffer_entries.allocated_mem.aggregator
- changefeed.stage.kv_feed_buffer.latency (count/sum/bucket)
Description: The current quota pool memory allocation between the KV feed and the sink. Latency within the processing section of the changefeed pipeline.
Use with metrics labels.

Encoding performance

Metric: changefeed.stage.encode.latency (count/sum/bucket)
Description: Latency encoding data within the processing section of the changefeed pipeline.
Impact: High encoding latency can create a bottleneck for the entire changefeed pipeline.
Use with metrics labels.

Sink

Sink performance

Note:

This metric is supported for webhook, Google Cloud Pub/Sub, and the latest version of the CockroachDB Kafka sink.

Metric: changefeed.parallel_io_queue_nanos
Description: The time that outgoing requests to the sink spend waiting in a queue due to in-flight requests with conflicting keys.
Use with metrics labels.

Sink errors

Metric: changefeed.sink_errors
Description: The number of changefeed errors caused by the sink.
Impact: Indicates connectivity or downstream processing issues.
Use with metrics labels.

Downstream delivery

Metrics:
- changefeed.stage.downstream_client_send.latency (count/sum/bucket)
- changefeed.internal_retry_message_count
Description: Latency when flushing messages from a sink's client to the downstream sink. (This includes sends that failed for most, but not all sinks.) The number of messages for which an aggregator node attempted to retry.
Impact: Indicates connectivity or downstream processing issues.
Scoped by changefeed_job_id

Pricing

Contact us

Sign In

Changefeed Monitoring Guide

Overview

Overall pipeline metrics

High-level performance metrics

End-to-end performance metrics

End-to-end commit latency

Batch latency

Protected timestamp age

Progress tracking

Pipeline section metrics

Rangefeed

Buffer memory usage

Rangefeed buffer latency

KV feed

Processing — aggregation and encoding

Aggregator buffer health

Encoding performance

Sink

Sink performance

Sink errors

Downstream delivery

See also

Tell us about your experience

Thank you for your feedback!

Explore More Documentation:

Changefeed Monitoring Guide

Overview

Overall pipeline metrics

High-level performance metrics

End-to-end performance metrics

End-to-end commit latency

Batch latency

Protected timestamp age

Progress tracking

Pipeline section metrics

Rangefeed

Buffer memory usage

Rangefeed buffer latency

KV feed

Processing — aggregation and encoding

Aggregator buffer health

Encoding performance

Sink

Sink performance

Sink errors

Downstream delivery

See also

Tell us about your experience

Select the problem area

Thank you for your feedback!

Explore More Documentation: