This is the first in a planned series of articles to focus on Prometheus, PromQL, Thanos, metrics and observability in the k8s space. I will be focusing on less-explored areas like some of the challenging quirks of Prometheus's PromQL, adapting to PromQL from a relational database SQL background, and running the Prometheus family in a memory-constrained environment.
Why does Prometheus remove metric names from results?
Prometheus's PromQL language removes metric names from the series it returns if the query performs almost any operation on the series it selects.
This name removal is often logical - for example, it no longer makes sense to return a series with the name containers_created_total if rate(...) has been applied to it, it has been divided by another value, its timestamp(...) has been obtained, it has been aggregated, etc. The exact conditions are not always well documented and can be surprising; for example, multiplying by 1 removes the metric name.
TL;DR: Just tell me how to fix it
To preserve __name__ across a query, copy it to a different label name deep in the query, as soon as possible after the selector is evaluated, then restore it afterwards.
This won't work for range-vector selectors because Prometheus lacks label_replace for range vectors. This can be worked around to a limited extent by using subqueries, but doing so can have surprising and subtle consequences.
How is the name removed?
Prometheus removes metric names dropping the internal __name__ label that carries the metric name from each series. The displayed metric name is just client-side pretty-printing for __name__.
That means that these are just two different ways to write exactly the same series:
series_name{label="value"}
{__name__="series_name", label="value"}You can observe this yourself by running the PromQL:
label_replace(vector(1), "__name__", "new_name", "", "")... which will produce the result
new_name{} 1When is __name__ removal undesirable?
Metric names are removed whenever any operation is performed on a series except for last_over_time(...), first_over_time(...) or a filtering operation that does not change the series value. Those operations include ==, topk, etc.
Prometheus is over-enthusiastic with such removals though. Most notably, __name__ is removed when multiplying by 1, as in the common PromQL idiom for joining on an info-series and injecting RHS labels into the LHS:
series_to_filter
* by (common_labels)
group_left(labels_from_rhs_to_add_to_lhs)
group by (common_labels, labels_from_rhs_to_add_to_lhs) (
series_to_filter_by{selector=~"matcher"}
)This becomes a problem if series_to_filter is a selector that matches multiple series that differ only by __name__, such as when running queries that explore workload cardinality and label churn.
It can be a problem when a query intentionally wants to return, or process, series from different metrics for any reason.
An example where __name__ removal breaks queries
A query may fail with:
vector cannot contain metrics with the same labelsetfor what seems like no good reason, just because Prometheus has automatically removed the __name__.
As a contrived example, say I want to show the individual metrics with the most series emitted by individual pods:
topk(10, count by (namespace, pod, uid, __name__) ({pod=~".+"}))This will run fine, and the metric names will be retained in the output.
But if I now want to filter this to only consider pods with a specific label, the query will fail
topk(10,
count by (namespace, pod, uid, __name__) ({pod=~".+"})
* on (uid)
group_left()
group by (uid) (
kube_pod_labels{some_label=~".+"})
)... with
multiple matches for labels: grouping labels must ensure unique matcheswhich is PromQL's usual complaint when a label matching operation finds a many-to-many association between the left and right side labels. But none such should exist here, as only one value for each uid may exist in the right hand side, and each (namespace, pod, uid, __name__) tuple is unique on the left-hand side.
What's happening here is that Prometheus is "helpfully" deleting __name__ for you when applying the * operation. Compare the output of:
process_resident_memory_bytes{container="prometheus"}e.g.
process_resident_memory_bytes{container="prometheus", .....} => 693493760 @[1757552218.548]with the the output when * 1 is appended to the query:
process_resident_memory_bytes{container="prometheus"}which is:
{container="prometheus", ...} => 656150528 @[1757552297.024]Note that the metric name (internally the __name__ label) has vanished?
That's why our query failed. Prometheus removed __name__, making our series tuples non-unique on (namespace, pod, uid) .
This particular case can be worked around by switching to using and instead of * to filter:
topk(10,
count by (namespace, pod, uid, __name__) ({pod=~".+"})
and on (uid)
group by (uid) (
kube_pod_labels{some_label=~".+"})
)but this breaks down in more complex cases where you may need to inject labels from the RHS into the LHS using a non-empty group_left() operation, perform multiple stages of filtering, etc.
How to work around __name__ removal?
Prometheus 3 comes with the non-default --enable-feature=promql-delayed-name-removal option to delay __name__ removal. This can often, but not always, help with such cases. In particular, it'll fix many aggregations that act on __name__.
But sometimes you want to keep the name in the final output, too, particularly when introspecting your TSDB to discover patterns in the metrics themselves, identify cardinality issues, etc.
To preserve __name__ across a query, copy it to a different label name, then restore it afterwards.
For example, if I'm finding the highest-cardinality series grouped by the value of a custom namespace label exposed via kube-state-metrics, I might:
topk(10,
count by (__name__, custom_namespace_label) (
{namespace=~".+"}
* on (namespace)
group_left(custom_namespace_label)
group by (namespace, custom_namespace_label) (
kube_namespace_labels{custom_namespace_label=~".+"}
)
)
) by (custom_namespace_label)but this will fail with
multiple matches for labels: grouping labels must ensure unique matchesbecause Prometheus removed __name__ during evaluation of the *.
To work around it, copy __name__ to a different label then copy it back again after the operations that would've dropped it have finished:
topk(10,
# drop label used to save name, retain __name__
max by (__name__, custom_namespace_label) (
# restore name
label_replace(
count by (save__name__, custom_namespace_label) (
# save name before it gets dropped
label_replace(
{namespace=~".+"},
"save__name__", "$1", "__name__", "^(.+)$"
)
* on (namespace)
group_left(custom_namespace_label)
group by (namespace, custom_namespace_label) (
kube_namespace_labels{custom_namespace_label=~".+"}
)
),
"__name__", "$1", "save__name__", "^(.+)$"
)
)
) by (custom_namespace_label)
' What about range vectors?
Prometheus does not support label_replace over a range vector so this workaround will not apply:
expected type instant vector in call to function "label_replace", got range vectorIn most cases delayed name removal should help; at least, it'll delay the dropping of __name__ until a later phase in the query where you can use the label_replace hack if you do need to retain the result.
In some cases you may be able to use a subquery instead, but this can have surprising and subtle consequences. At minimum, you'll always need to limit any range selector steps to the subquery step size, and use last_over_time explicitly on any instant vectors within the subquery.