Dev Corner: error context stack corruption

Craig Ringer

February 23, 2018

PostgreSQL uses error context callbacks to allow code paths to annotate errors with additional information. For example, pl/pgsql uses them to add a CONTEXT message reporting the procedure that was executing at the time of the error.

But if you get it wrong when you use one in an extension or a patch to core, it can be quite hard to debug. I’d like to share some hints here for people involved in PostgreSQL’s C core and extensions.

Intro to error contexts

(If you know the postgres backend code, skip to the next heading).

Say you have a function like the following utterly contrived example:


void
my_func(bool do_it)
{
    if (!do_it)
    {
        elog(WARNING, "not doing it!");
        return;
    }

    do_the_thing();
}

and you want to report on errors that occur anywhere in it, even in code called by do_the_thing() that may be far away in different modules of PostgreSQL. So that you know that you reached that code via my_func() and what the value of do_it was.

You can add an error context callback, which pushes a function pointer + optional argument onto the head of a linked list of callbacks. The head is in a global error_context_stack. Typically the entries are stack-allocated, e.g.


struct my_func_ctx_arg
{
    bool do_it;
};
 
static void
my_func_ctx_callback(void *arg)
{
    struct my_func_ctx_arg *ctx_arg = arg;
    errcontext("during my_func(do_it=%d)", ctx_arg->do_it);
}

void
my_func(bool do_it)
{
    ErrorContextCallback myerrcontext;
    struct my_func_ctx_arg ctxinfo;

    ctxinfo.do_it = do_it;
    myerrcontext.callback = my_func_ctx_callback;
    myerrcontext.arg = &ctxinfo;
    myerrcontext.previous = error_context_stack;
    error_context_stack = &myerrcontext;

    if (!do_it)
    {
        elog(WARNING, "not doing it!");
        return;
    }

    do_the_thing();

    Assert(error_context_stack == &myerrcontext);
    error_context_stack = myerrcontext.previous;
}

It’s a bit verbose, but it gives you much more useful messages in important cases. For example


ERROR: relcache lookup for 2132 failed

isn’t that informative. But something like:


ERROR: relcache lookup for 2132 failed
CONTEXT: in my_extension_func(...) with user_callback_fn="user_func"

gives you a lot more of a hint about where to look.

There’s a bug

Fry from Futurama ponders: "Not sure if I should *facepalm* or *headdesk*"

A day in the life of a developer

OK, so we know what errcontext callbacks are. But when we added the above code, suddenly our postgres starts crashing… sometimes. Backtraces show that the crashes are usually in errfinish(), but in random and unpredictable places.


#0  0x0000000000000014 in ??
#1  0x000000000084bb88 in errfinish (dummy=) at elog.c:439
... some unrelated stack that doesn't mention my_func here ...

Much head scratching occurs. Valgrind is brought to bear, and maybe it complains about an invalid access in elog.c just before the crash, but says the pointed-to memory was not recently allocated or freed, and can’t really tell you anything more than the crash backtrace did.

You can see that the error context stack is mangled in gdb:


(gdb) set print pretty on
(gdb) p * error_context_stack
$4 = {
  previous = 0x4d430004, 
  callback = 0x4d430004, 
  arg = 0x18
}
(gdb) p *error_context_stack->previous
Cannot access memory at address 0x4d430004
(gdb)

but not why. The contents seem to vary randomly and are often null. Printing the memory around the pointer to error_context_stack doesn’t tend to reveal anything that jumps out at you. (Or didn’t to me, anyway; if you did more low level work and asm you might recognise it.)

When I present it like this, you can probably guess why. The problem is in my_func even though it doesn’t appear in any of the crashes, valgrind won’t report on it, etc. And it’s not directly in the new code, so in a larger or more complex function it might be harder to spot.


    if (!do_it)
        return; /*  --------------------- HERE */

See, on this path we failed to pop the error context stack. So error_context_stack still points inside the now-popped stack frame of myfunc, at myerrcontext. It doesn't point at the code address of myfunc so gdb won't give you a hint like myfunc+12, it's just some random stack space.

If we call ereport or elog now, they'll still work fine because the popped stack's contents aren't immediately overwritten. They'll access released stack frame contents, but valgrind doesn't check for that and won't care or complain.

If at some later stage we make calls that use up that stack space again, we may (or may not, depending on layout details, what the calls are, etc) overwrite the pointed-to memory with something else. At which point if we call ereport or elog we might crash. Or hey, we might not, if whatever's pointed to doesn't fault when treated as instructions.

Especially in an optimised binary, the crash can be unpredictable and come much later than the creation of the problem.

gcc's -fstack-protector-all won't help you either, since there's no stack-overwrite happening. Just a pointer to invalid stack frame contents.

So I thought I'd make this a bit more google-able to save the next person some hassle.
I'm sure this would be blindingly obvious to many people. But in a decent sized code base with a fair few changes across multiple modules, it gets more challenging.

The fix

The fix is trivial once you know where the problem is: pop the error context stack or, preferably, restructure the function to have a single non-error exit path, e.g.



    if (do_it)
        do_the_thing();
    else
        elog(WARNING, "not doing it!");

Yeah, in this contrived example it's hard to imagine why you'd write it any other way in the first place. But single-return isn't always worth the code contortions in more complex logic. And in some places in Pg return may be masked by macros.

Plus, I'm sure you're not the one who wrote the problem code anyway? Right?


git blame buggy_file.c

.... dammit. Yes, you were, you just forgot. Past-me, you write terrible code and your breath smells of onions.

Prevention

I wonder if a static checker could be taught to detect this issue by looking for return-paths? Or, in fact, already does? Hints welcomed, especially for something that won't spew false positives.

Debugging

I edited src/include/pg_config_manual.h to enable USE_VALGRIND. Added this to elog.h:


extern void verify_errcontext_stack(void)

and this to elog.c:


void
verify_errcontext_stack(void)
{
    ErrorContextCallback *econtext;
    for (econtext = error_context_stack;
         econtext != NULL;
         econtext = econtext->previous)
    {
        Assert(econtext != NULL);
#ifdef USE_VALGRIND
        VALGRIND_CHECK_VALUE_IS_DEFINED(econtext);
        Assert(VALGRIND_CHECK_MEM_IS_ADDRESSABLE(econtext, sizeof(ErrorContextCallback)) == 0);
        VALGRIND_CHECK_VALUE_IS_DEFINED(econtext->previous);
        if (econtext->previous != NULL)
        {
            VALGRIND_CHECK_MEM_IS_ADDRESSABLE(econtext->previous, sizeof(ErrorContextCallback));
            VALGRIND_CHECK_VALUE_IS_DEFINED(econtext->previous->callback);
            VALGRIND_CHECK_VALUE_IS_DEFINED(econtext->previous->previous);
        }
#endif
        Assert(econtext->previous == NULL
               || econtext->previous->callback != NULL);
    }
}

then scattered calls to it around elog.c.

This helped detect the fault earlier. It proved important to also build with optimisation entirely disabled:


CFLAGS="-O0 -ggdb -g3" ./configure --prefix=/home/craig/pg/10 --enable-cassert --enable-debug --enable-tap-tests

With these two changes, crashes occurred much closer to the actual callsite, relatively shortly after the problem. If you scatter calls around, particularly before and after calls to any function that sets up an error context, it'll help you narrow things down quickly.

In this Article

Intro to error contexts
There’s a bug
The fix
Prevention
Debugging

Resource Feature Callout 1

Dev Corner: error context stack corruption

Craig Ringer

Intro to error contexts

There’s a bug

The fix

Prevention

Debugging

More Blogs

Cloud Neutral Postgres Databases with Kubernetes and CloudNativePG

Scaling Breakthrough: EPAS 17 Scales Twice as Well, Thanks PostgreSQL!

How CloudNativePG Manages Physical Replication Slots for PostgreSQL in Kubernetes