In this page, you can find some basic information on how to troubleshoot EDB Postgres for Kubernetes in your Kubernetes cluster deployment.
As a Kubernetes administrator, you should have the
kubectl Cheat Sheet page
Before you start
What can make a difference in a troubleshooting activity is to provide clear information about the underlying Kubernetes system.
Make sure you know:
- the Kubernetes distribution and version you are using
- the specifications of the nodes where PostgreSQL is running
- as much as you can about the actual storage, including storage class and benchmarks you have done before going into production.
- which relevant Kubernetes applications you are using in your cluster (i.e. Prometheus, Grafana, Istio, Certmanager, ...)
- the situation of continuous backup, in particular if it's in place and working correctly: in case it is not, make sure you take an emergency backup before performing any potential disrupting operation
On top of the mandatory
kubectl utility, for troubleshooting, we recommend the
following plugins/utilities to be available in your system:
jq, a lightweight and flexible command-line JSON processor
grep, searches one or more input files for lines containing a match to a specified pattern. It is already available in most *nix distros. If you are on Windows OS, you can use
findstras an alternative to
grepor directly use
wsland install your preferred *nix distro and use the tools mentioned above.
In some emergency situations, you might need to take an emergency logical
backup of the main
The instructions you find below must be executed only in emergency situations
and the temporary backup files kept under the data protection policies
that are effective in your organization. The dump file is indeed stored
in the client machine that runs the
kubectl command, so make sure that
all protections are in place and you have enough space to store the
The following example shows how to take a logical backup of the
cluster-example Postgres cluster, from the
You can easily adapt the above command to backup your cluster, by providing the names of the objects you have used in your environment.
The above command issues a
pg_dump command in custom format, which is the most
versatile way to take logical backups in PostgreSQL.
The next step is to restore the database. We assume that you are operating
on a new PostgreSQL cluster that's been just initialized (so the
The following example shows how to restore the above logical backup in the
app database of the
new-cluster-example Postgres cluster, by connecting to
the primary (
The example in this section assumes that you have no other global objects
(databases and roles) to dump and restore, as per our recommendation. In case
you have multiple roles, make sure you have taken a backup using
and you manually restore them in the new cluster. In case you have multiple
databases, you need to repeat the above operation one database at a time, making
sure you assign the right ownership. If you are not familiar with PostgreSQL,
we advise that you do these critical operations under the guidance of
a professional support company.
The above steps might be integrated into the
cnp plugin at some stage in the future.
Every resource created and controlled by EDB Postgres for Kubernetes logs to
standard output, as expected by Kubernetes, and directly in JSON
format. As a result, you should rely on the
command to retrieve logs from a given resource.
For more information, type:
JSON logs are great for machine reading, but hard to read for human beings.
Our recommendation is to use the
jq command to improve usability. For
example, you can pipe the
kubectl logs command with
| jq -C.
In the sections below, we will show some examples on how to retrieve logs about different resources when it comes to troubleshooting EDB Postgres for Kubernetes.
By default, the EDB Postgres for Kubernetes operator is installed in the
postgresql-operator-system namespace in Kubernetes as a
(see the "Details about the deployment" section
You can get a list of the operator pods by running:
Under normal circumstances, you should have one pod where the operator is
running, identified by a name starting with
In case you have set up your operator for high availability, you should have more entries.
Those pods are managed by a deployment named
Collect the relevant information about the operator that is running in pod
Then get the logs from the same pod by running:
Gather more information about the operator
Get logs from all pods in EDB Postgres for Kubernetes operator Deployment (in case you have a multi operator deployment) by running:
You can add
-f flag to above command to follow logs in real time.
Save logs to a JSON file by running:
Get EDB Postgres for Kubernetes operator version by using
You can check the status of the
<CLUSTER> cluster in the
The above example reports a healthy PostgreSQL cluster of 3 instances, all in
ready state, and with
<CLUSTER>-1 being the primary.
In case of unhealthy conditions, you can discover more by getting the manifest
Another important command to gather is the
status one, as provided by the
You can print more information by adding the
Besides knowing cluster status, you can also do the following things with the cnp plugin:
Promote a replica.
Make a rollout restart cluster to apply configuration changes.
Make a reconciliation loop to reload and apply configuration changes.
For more information, please see
cnp plugin documentation.
Get EDB PostgreSQL Advanced Server (EPAS) / PostgreSQL container image version:
Also you can use
kubectl-cnp status -n <NAMESPACE> <CLUSTER_NAME>
to get the same information.
You can retrieve the list of instances that belong to a given PostgreSQL cluster with:
You can check if/how a pod is failing by running:
You can get all the logs for a given PostgreSQL instance with:
If you want to limit the search to the PostgreSQL process only, you can run:
The following example also adds the timestamp in a user-friendly format:
Gather and filter extra information about PostgreSQL pods
Check logs from a specific pod that has crashed:
Get FATAL errors from a specific PostgreSQL pod:
Filter PostgreSQL DB error messages in logs for a specific pod:
Get messages matching
err word from a specific pod:
Get all logs from PostgreSQL process from a specific pod:
Get pod logs filtered by fields with values and join them separated by
You can list the backups that have been created for a named cluster with:
Backup labelling has been introduced in version 1.10.0 of EDB Postgres for Kubernetes. So only those resources that have been created with that version or a higher one will contain such a label.
Sometimes is useful to double-check the StorageClass used by the cluster to have some more context during investigations or troubleshooting, like this:
We are taking the StorageClass from one of the cluster pod here since often clusters are created using the default StorageClass.
Kubernetes nodes is where ultimately PostgreSQL pods will be running. It's strategically important to know as much as we can about them.
You can get the list of nodes in your Kubernetes cluster with:
Additionally, you can gather the list of nodes where the pods of a given cluster are running with:
The latter is important to understand where your pods are distributed - very useful if you are using affinity/anti-affinity rules and/or tolerations.
Like many native kubernetes
objects like here,
status.conditions as well. This allows one to 'wait' for a particular
event to occur instead of relying on the overall cluster health state. Available conditions as of now are:
LastBackupSucceeded is reporting the status of the latest backup. If set to
last backup has been taken correctly, it is set to
ContinuousArchiving is reporting the status of the WAL archiving. If set to
last WAL archival process has been terminated correctly, it is set to
True when the cluster has the number of instances specified by the user
and the primary instance is ready. This condition can be used in scripts to wait for
the cluster to be created.
How to wait for a particular condition
- Ready (Cluster is ready or not):
Below is a snippet of a
cluster.status that contains a failing condition.
Some common issues
Storage is full
If one or more pods in the cluster are in
CrashloopBackoff and logs
suggest this could be due to a full disk, you probably have to increase the
size of the instance's
PersistentVolumeClaim. Please look at the
"Volume expansion" section in the documentation.
Pods are stuck in
In case a Cluster's instance is stuck in the
Pending phase, you should check
Events section to get an idea of the reasons behind this:
Some of the possible causes for this are:
- No nodes are matching the
- Tolerations are not correctly configured to match the nodes' taints
- No nodes are available at all: this could also be related to
cluster-autoscalerhitting some limits, or having some temporary issues
In this case, it could also be useful to check events in the namespace:
Replicas out of sync when no backup is configured
Sometimes replicas might be switched off for a bit of time due to maintenance
reasons (think of when a Kubernetes nodes is drained). In case your cluster
does not have backup configured, when replicas come back up, they might
require a WAL file that is not present anymore on the primary (having been
already recycled according to the WAL management policies as mentioned in
postgresql section"), and
fall out of synchronization.
pg_rewind might require a WAL file that is not present
anymore in the former primary, reporting
pg_rewind: error: could not open file.
In these cases, pods cannot become ready anymore, and you are required to delete the PVC and let the operator rebuild the replica.
If you rely on dynamically provisioned Persistent Volumes, and you are confident in deleting the PV itself, you can do so with: