The Cost of `std::make_unique`

compiler-explorer link

Intro

What I’m about to write is fairly obvious, but it’s always interesting to observe this kind of behavior in the assembly.

A common pattern for factory functions is to return by value. This approach is supported by Return Value Optimization (RVO), which ensures that the return value is constructed directly in the final memory location, skipping intermediate copies. This process works even if the destination memory is located on the heap.

Heap memory should be managed with some sort of smart pointer, and to hide naked new, we can use make_... functions.

However, while RVO and make_ functions can work together, they are not optimal in combination: the use of make_ functions forces the constructor’s arguments to reside on the stack, which means that they will be copied into the heap destination. Let’s take a look at some assembly code to illustrate this.

The experiment

This is the type that we will construct from a factory. It’s “big” just to reiterate the kind of cost we could incur.

using big_obj = std::array<int, 600>;

We will build it from this factory and return a value. Maybe we are reading data from a file to populate the result? It’s noexcept so that the compiler generates a simpler assembly.

auto big_obj_factory() noexcept -> big_obj;

std::make_unique

So we want to create a std::unique_ptr<big_obj> from big_obj_factory(). A standard solution could be this:

auto use_make_unique() -> std::unique_ptr<big_obj> {
    return std::make_unique<big_obj>(big_obj_factory());
}

The assembly shows the potential pitfall of this approach:

use_make_unique():                   # @use_make_unique()
        push    r15
        push    r14
        push    rbx
        sub     rsp, 2400            # allocate stack space for big_obj
        mov     r15, rdi
        mov     r14, rsp
        mov     rdi, r14
        call    big_obj_factory()    # call big_obj_factory with a pointer to our stack
        mov     edi, 2400
        call    operator new(unsigned long) # allocate a big_obj with new
        mov     rbx, rax
        mov     edx, 2400
        mov     rdi, rax
        mov     rsi, r14
        call    memcpy               # copy our stack big_obj on the heap 
        mov     qword ptr [r15], rbx
        mov     rax, r15
        add     rsp, 2400
        pop     rbx
        pop     r14
        pop     r15
        ret

This is what’s essentially happening: to invoke make_unique, we construct a big_obj on the stack since this is where parameters go. Then, the value is moved to the final location, and the assembly of this function is virtually identical to the previous one.

auto pretend_use_make_unique() -> std::unique_ptr<big_obj> {
    auto obj = big_obj_factory();
    return std::unique_ptr<big_obj>(new big_obj(std::move(obj)));
}

Using std::unique_ptr directly

To ensure that big_obj_factory constructs the value directly on the heap, we need to skip the stack altogether. This works, at the cost of using new:

auto use_naked_new() -> std::unique_ptr<big_obj> {
    return std::unique_ptr<big_obj>{new big_obj{big_obj_factory()}};
}

And the assembly shows that we get the desired behavior:

use_naked_new():                     # @use_naked_new()
        push    r14
        push    rbx
        push    rax
        mov     r14, rdi
        mov     edi, 2400
        call    operator new(unsigned long)   # allocate a big_obj on the heap
        mov     rbx, rax
        mov     rdi, rax                      # pass the pointer to big_obj_factory           
        call    big_obj_factory()             # call big_obj_factory to populate the heap space
        mov     qword ptr [r14], rbx
        mov     rax, r14
        add     rsp, 8
        pop     rbx
        pop     r14
        ret

Using a wrapper constructor

The issue with make_unique arises because we materialize the result of big_obj_factory on the stack, since it needs to be passed as a parameter. However, we can achieve a similar outcome to using a raw new by embedding the factory directly within the constructor, as this wrapper does. That said, this approach somewhat undermines the purpose of the factory pattern and may violate the principle of least astonishment, as it hides much of the complexity within the constructor wrapper that utilizes the factory.

struct big_obj_wrapper {
    big_obj _detail{big_obj_factory()};
};

auto use_a_wrapper() -> std::unique_ptr<big_obj_wrapper> {
    return std::make_unique<big_obj_wrapper>();
}

The assembly looks similar to use_naked_new. Note, however, the call to memset: it results from calling the default constructor of std::array. This is likely a missed optimization since compilers (allegedly) find it difficult to prove that a write to 0 is not needed.

use_a_wrapper():                     # @use_a_wrapper()
        push    r14
        push    rbx
        push    rax
        mov     r14, rdi
        mov     edi, 2400
        call    operator new(unsigned long)   # allocate a big_obj on the heap
        mov     rbx, rax
        mov     edx, 2400
        mov     rdi, rax
        xor     esi, esi
        call    memset                        # call the base constructor of big_obj (memset to zero for std::array)
        mov     rdi, rbx                      # pass the pointer to big_obj_factory 
        call    big_obj_factory()             # call big_obj_factory to populate the heap space
        mov     qword ptr [r14], rbx
        mov     rax, r14
        add     rsp, 8
        pop     rbx
        pop     r14
        ret

Final Considerations

All in all, this is kind of an unneeded optimization, something that might make sense in an embedded setting. Still, I found it interesting to analyze this behavior and see how these two features compose.

Bonus: gcc on x64 generates rep movsd, which is an inlined form of memcpy, and I think it’s cool.