feat: metricsV2 + oTel + prometheus sample and Grafana dashboard#3154
feat: metricsV2 + oTel + prometheus sample and Grafana dashboard#3154csviri wants to merge 64 commits intooperator-framework:nextfrom
Conversation
c7e6ca2 to
ece63e8
Compare
88c118c to
cf9eb57
Compare
24f992f to
0438611
Compare
9014b2b to
fe4ab6a
Compare
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…torsdk/operator/sample/metrics/MetricsHandlingSampleOperator.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…tor/api/monitoring/Metrics.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…operatorsdk/operator/sample/deployment.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
d60193a to
cf114e3
Compare
| | Meter name (Micrometer) | Type | Tags | Description | | ||
| |------------------------------------------|---------|-----------------------------------|----------------------------------------------------------------------| | ||
| | `reconciliations.executions` | gauge | `controller.name` | Number of reconciler executions currently in progress | | ||
| | `reconciliations.active` | gauge | `controller.name` | Number of resources currently queued for reconciliation | |
There was a problem hiding this comment.
is "active" really a good name for this?
| private static final String RECONCILIATIONS_STARTED = RECONCILIATIONS + "started" + TOTAL_SUFFIX; | ||
|
|
||
| private static final String RECONCILIATIONS_EXECUTIONS_GAUGE = RECONCILIATIONS + "executions"; | ||
| private static final String RECONCILIATIONS_QUEUE_SIZE_GAUGE = RECONCILIATIONS + "active"; |
There was a problem hiding this comment.
queue would be better name
There was a problem hiding this comment.
The thing is that queue is a bit misleading, because this is actually the queue + ongoing reconiliations, that is the reason I added active. (ExecutorService has a queue). Can expand the docs on this
There was a problem hiding this comment.
Actually made an adjustment so we have now active and queue where queue is that are submitted to executor service but not started yet, the active is the number of actually running threads with reconiliation logic. This seems to be more intuitive and useful.
- deleted redundant delete events - adjusted metrics now we have queue and active reconciliations (with the intuitive seantics) Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
| ``` No newline at end of file | ||
| ``` | ||
|
|
||
| ## Metrics interface changes |
There was a problem hiding this comment.
This would technically be an API break and would require a new major version.
|
|
||
| # Check if helm is installed, download locally if not | ||
| echo -e "\n${YELLOW}Checking helm installation...${NC}" | ||
| if ! command -v helm &> /dev/null; then |
There was a problem hiding this comment.
Installing helm should be made optional. People might not want to have helm automatically installed or want to install it themselves some other way.
| import java.util.function.Consumer; | ||
| import java.util.function.Function; | ||
|
|
||
| import org.jspecify.annotations.NonNull; |
There was a problem hiding this comment.
This should not have been introduced here. Nullability should be reviewed at the SDK level, not simply here for 2 methods.
|
|
||
| \* `namespace` tag is only included when `withNamespaceAsTag()` is enabled. | ||
|
|
||
| The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure |
There was a problem hiding this comment.
SLO acronym should be used expanded first before.
| \* `namespace` tag is only included when `withNamespaceAsTag()` is enabled. | ||
|
|
||
| The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure | ||
| compatibility with `histogram_quantile()` queries in Prometheus. This is important when using the OTLP registry, where |
There was a problem hiding this comment.
OTLP acronym should be used expanded first.
| if (resourceEvent.getAction() == ResourceAction.ADDED) { | ||
| gauges.get(numberOfResourcesRefName(getControllerName(metadata))).incrementAndGet(); | ||
| } | ||
| var namespace = resourceEvent.getRelatedCustomResourceID().getNamespace().orElse(null); |
There was a problem hiding this comment.
Shouldn't that be in sync with what's done for MDC, i.e. use a default value instead of null for clustered resources?
|
|
||
| private void incrementCounter( | ||
| String counterName, String namespace, Map<String, Object> metadata, Tag... additionalTags) { | ||
| final var tags = new ArrayList<Tag>(2 + additionalTags.length); |
There was a problem hiding this comment.
This shouldn't be a List but a Set.
| null, | ||
| metadata, | ||
| Tag.of(EVENT, event.getClass().getSimpleName()), | ||
| Tag.of(ACTION, UNKNOWN_ACTION)); |
|
|
||
| int retryNumber = retryInfo.map(RetryInfo::getAttemptCount).orElse(0); | ||
| if (retryNumber > 0) { | ||
| incrementCounter(RECONCILIATIONS_RETRIES_NUMBER, namespace, metadata); |
There was a problem hiding this comment.
How does the retry number get propagated? Seems like the counter that is incremented is completely decorrelated from the actual retry number?
Signed-off-by: Chris Laprun <metacosm@gmail.com>



Goal of this PR is to provide a OTel + Prometheus + Grafana setup. So we:
Notes on new metrics implementation: