GCC May “Save” You Some Recursive Functions Calls: an Analysis of a Function Call Stack Length Example
Posted on In Linux, Programming, TutorialWe know compilers like gcc can do lots smart optimization to make the program run faster. Regarding functions call optimization, gcc can do tail-call elimination to save the cost of allocating a new stack frame, and tail recursion elimination to turn a recursive function to non-recursive iterative one. gcc can even transform some recursive functions that are not tail-recursive into a tail-recursive one so that the compiler can then do tail recursion elimination. But what will happen if a function can not be optimized using tail recursion elimination because of some non-safe operations inside of the function body? In this post, we analyze one example.
Table of Contents
The C program we analyze
The prog.c
program we analyze is as follows.
#include <stdio.h>
void RecursiveFunction( int recursion_times )
{
printf("stack: %p\n", (void*)&recursion_times);
if (recursion_times <= 1) return;
return RecursiveFunction( --recursion_times );
}
int main(int argc, char* args[])
{
RecursiveFunction(4);
return 0;
}
We pass the &recursion_times
into another function which may change its value. C/C++ require each variable, including multiple instances of the same variable in recursive calls, to have distinct locations. The number of recursion_times
variables are only known during run time. So tail recursion elimination should not be done simply here by the compiler although RecursiveFunction
is tail recursive. So will the compiler just stop here and do nothing to optimize it? Let’s see by running it with different optimization levels.
The execution results
We use a script run.sh
to try different cases and disassemble the executable binary files generated by gcc.
#!/bin/bash
set -o errexit
for opt in 0 1 2 s 3 ; do
echo "Begin -O$opt"
gcc -fno-stack-protector -O$opt prog.c -o prog.$opt
./prog.$opt
objdump -d ./prog.$opt > prog.$opt.as
# gcc -fno-stack-protector -O$opt -Wa,-adhln -g prog.c > prog.$opt.list
echo "End -O$opt"
done
The -fno-stack-protector
option tells gcc not to generate stack protection code so that assembly code is cleaner to read.
I tried it on Ubuntu 18.04 with gcc 7.5.0 (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
) and the results are as follows.
$ ./run.sh
Begin -O0
stack: 0x7fff3690668c
stack: 0x7fff3690666c
stack: 0x7fff3690664c
stack: 0x7fff3690662c
End -O0
Begin -O1
stack: 0x7ffde5213d4c
stack: 0x7ffde5213d2c
stack: 0x7ffde5213d0c
stack: 0x7ffde5213cec
End -O1
Begin -O2
stack: 0x7ffd6a09587c
stack: 0x7ffd6a09585c
stack: 0x7ffd6a09583c
stack: 0x7ffd6a09581c
End -O2
Begin -Os
stack: 0x7ffef5b6ce9c
stack: 0x7ffef5b6ce7c
stack: 0x7ffef5b6ce5c
stack: 0x7ffef5b6ce3c
End -Os
Begin -O3
stack: 0x7ffd414fb53c
stack: 0x7ffd414fb54c
stack: 0x7ffd414fb50c
stack: 0x7ffd414fb51c
End -O3
For optimization levels 0,1,2 and s, the results are consistent. For each function calls, the stack frame increases (address decreases) by 0x20
bytes. The interesting one is the result when gcc optimization level is 3. The stack frame decreases by 0x10
bytes and then increases by 0x40
bytes. Why does the stack frame wigwag? Let’s take a look at the assembly code for the RecursiveFunction
.
The generated code at optimization level 0
Let’s first take a look at the generated code in prog.0.as
for optimization level 0. It is quite straightforward and the flow follows the C code’s order. I added annotations.
000000000000064a <RecursiveFunction>:
# save old stack frame base in stack, use 0x8 bytes on stack
64a: 55 push %rbp
# new stack frame base at old stack pointer
64b: 48 89 e5 mov %rsp,%rbp
# allocate 0x10 for the new stack
64e: 48 83 ec 10 sub $0x10,%rsp
# store recursion_times into stack
652: 89 7d fc mov %edi,-0x4(%rbp)
# get &recursion_times
655: 48 8d 45 fc lea -0x4(%rbp),%rax
# call printf()
659: 48 89 c6 mov %rax,%rsi
65c: 48 8d 3d d1 00 00 00 lea 0xd1(%rip),%rdi # 734 <_IO_stdin_used+0x4>
663: b8 00 00 00 00 mov $0x0,%eax
668: e8 b3 fe ff ff callq 520 <printf@plt>
# get recursion_times to %eas
66d: 8b 45 fc mov -0x4(%rbp),%eax
# if (resursion_times <= 1) return
670: 83 f8 01 cmp $0x1,%eax
673: 7e 15 jle 68a <RecursiveFunction+0x40>
# do --resursion_times
675: 8b 45 fc mov -0x4(%rbp),%eax
678: 83 e8 01 sub $0x1,%eax
67b: 89 45 fc mov %eax,-0x4(%rbp)
# Prepare arguments and call RecursiveFunction
67e: 8b 45 fc mov -0x4(%rbp),%eax
681: 89 c7 mov %eax,%edi
# callq will store the returning address on stack, using 0x8 bytes
683: e8 c2 ff ff ff callq 64a <RecursiveFunction>
688: eb 01 jmp 68b <RecursiveFunction+0x41>
68a: 90 nop
68b: c9 leaveq
68c: c3 retq
The 0x20
-byte stack frame consists of 0x8 bytes for storing old %rbp
(at 64a), 0x10 bytes for this function’s usage (at 64e), and the 0x8 bytes used by callq
at 683. There is no surprise.
The generated code at optimization level 3
Now let’s look at the generated code by gcc under optimization level 3. Annotations are added too.
0000000000000690 <RecursiveFunction>:
# allocate a stack frame of 0x28 bytes
690: 48 83 ec 28 sub $0x28,%rsp
694: 48 8d 35 e9 00 00 00 lea 0xe9(%rip),%rsi # 784 <_IO_stdin_used+0x4>
69b: 31 c0 xor %eax,%eax
69d: 48 8d 54 24 0c lea 0xc(%rsp),%rdx
# store recursion_times into stack at 0xc(%rsp)
6a2: 89 7c 24 0c mov %edi,0xc(%rsp)
6a6: bf 01 00 00 00 mov $0x1,%edi
# call __printf_chk()
6ab: e8 90 fe ff ff callq 540 <__printf_chk@plt>
# if recursion_times > 1, goto 6c0
# so, if recursion_times <= 1, return
6b0: 8b 44 24 0c mov 0xc(%rsp),%eax
6b4: 83 f8 01 cmp $0x1,%eax
6b7: 7f 07 jg 6c0 <RecursiveFunction+0x30>
6b9: 48 83 c4 28 add $0x28,%rsp
6bd: c3 retq
6be: 66 90 xchg %ax,%ax
# --recursion_times
6c0: 83 e8 01 sub $0x1,%eax
# call __printf_chk()
6c3: 48 8d 54 24 1c lea 0x1c(%rsp),%rdx
6c8: 48 8d 35 b5 00 00 00 lea 0xb5(%rip),%rsi # 784 <_IO_stdin_used+0x4>
# store recursion_times to stack at 0xc(%rsp)
6cf: 89 44 24 0c mov %eax,0xc(%rsp)
# store recursion_times to stack at 0x1c(%rsp)
6d3: 89 44 24 1c mov %eax,0x1c(%rsp)
6d7: bf 01 00 00 00 mov $0x1,%edi
6dc: 31 c0 xor %eax,%eax
6de: e8 5d fe ff ff callq 540 <__printf_chk@plt>
# if recursion_times <= 1, return
6e3: 8b 7c 24 1c mov 0x1c(%rsp),%edi
6e7: 83 ff 01 cmp $0x1,%edi
6ea: 7e cd jle 6b9 <RecursiveFunction+0x29>
# -- recursion_times
6ec: 83 ef 01 sub $0x1,%edi
# store recursion_times into stack at 0x1c(%rsp)
6ef: 89 7c 24 1c mov %edi,0x1c(%rsp)
# call RecursiveFunction
6f3: e8 98 ff ff ff callq 690 <RecursiveFunction>
6f8: eb bf jmp 6b9 <RecursiveFunction+0x29>
6fa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
Before we get into the RecursiveFunction logic, let’s first check the optimizations applied here.
%rbp
is not used and only%rsp
is maintained. So the 2 instructionspushq %rbp
andmovq %rsp,%rbp
in the code generated at level 0 are not needed.__printf_chk()
is directly called instead ofprintf()
because theprintf()
body only containsreturn __printf_chk(...
and the tail-call elimination takes effect here. You may use thegcc -fno-stack-protector -O3 -Wa,-adhln -g prog.c
(as a technique from Generating Mixed Source and Assembly List using GCC) to verify and see the mixed source code and assembly code.
Now, let’s take a look at why the &recursion_times
wigwags.
In the body of the generated code of RecursiveFunction
, the __printf_chk()
are called twice and recursion_times
is deducted by 1 for two times. So, 2 RecursiveFunction
functions’ logic from the C code may be executed in one RecursiveFunction
procedure call in the generated optimized code. So the RecursiveFunction
is not the same any more! That is, one RecursiveFunction
C function is inlined into itself. For the 2 recursive_times
variables in one RecursiveFunction
procedure call, they are stored in 0xc(%rsp)
and 0x1c(%rsp)
so their address differs by 0x10 bytes. The stack frame size for a RecursiveFunction
procedure (the generated one) call is actually 0x28+0x8 = 0x30 (by sub
at 690 and callq
at 6f3).
The reason is recursive inlining optimization
Now it is clear the cause is the recursive inlining optimization by gcc. gcc also has options to control the optimization behavior. From gcc
manual:
max-inline-insns-recursive
max-inline-insns-recursive-auto
Specifies the maximum number of instructions an out-of-line copy of a
self-recursive inline function can grow into by performing recursive
inlining.
--param max-inline-insns-recursive applies to functions declared inline.
For functions not declared inline, recursive inlining happens only when
-finline-functions (included in -O3) is enabled; --param
max-inline-insns-recursive-auto applies instead. The default value is 450.
max-inline-recursive-depth
max-inline-recursive-depth-auto
Specifies the maximum recursion depth used for recursive inlining.
--param max-inline-recursive-depth applies to functions declared inline.
For functions not declared inline, recursive inlining happens only when
-finline-functions (included in -O3) is enabled; --param
max-inline-recursive-depth-auto applies instead. The default value is 8.
min-inline-recursive-probability
Recursive inlining is profitable only for function having deep recursion in
average and can hurt for function having little recursion depth by
increasing the prologue size or complexity of function body to other
optimizers.
When profile feedback is available (see -fprofile-generate) the actual
recursion depth can be guessed from the probability that function recurses
via a given call expression. This parameter limits inlining only to call
expressions whose probability exceeds the given threshold (in percents).
The default value is 10.
Further tries
Now, let’s use the parameters to tune gcc’s optimization parameters and see what’s the result. Let’s set min-inline-recursive-probability
to 5. To make the results clearer, I changed the initial recursion_times
to 10.
$ gcc -fno-stack-protector -O3 --param min-inline-recursive-probability=5 prog.c -o prog.3
$ ./prog.3
stack: 0x7ffdfa315a7c
stack: 0x7ffdfa315a88
stack: 0x7ffdfa315a8c
stack: 0x7ffdfa315a4c
stack: 0x7ffdfa315a58
stack: 0x7ffdfa315a5c
stack: 0x7ffdfa315a1c
stack: 0x7ffdfa315a28
stack: 0x7ffdfa315a2c
stack: 0x7ffdfa3159ec
Now 3 RecursiveFunction
s are merged together. We can verify this by checking assembly code which calls __printf_chk()
3 times.
0000000000000690 <RecursiveFunction>:
690: 48 83 ec 28 sub $0x28,%rsp
694: 48 8d 35 19 01 00 00 lea 0x119(%rip),%rsi # 7b4 <_IO_stdin_used+0x4>
69b: 31 c0 xor %eax,%eax
69d: 48 8d 54 24 0c lea 0xc(%rsp),%rdx
6a2: 89 7c 24 0c mov %edi,0xc(%rsp)
6a6: bf 01 00 00 00 mov $0x1,%edi
6ab: e8 90 fe ff ff callq 540 <__printf_chk@plt>
6b0: 8b 44 24 0c mov 0xc(%rsp),%eax
6b4: 83 f8 01 cmp $0x1,%eax
6b7: 7f 07 jg 6c0 <RecursiveFunction+0x30>
6b9: 48 83 c4 28 add $0x28,%rsp
6bd: c3 retq
6be: 66 90 xchg %ax,%ax
6c0: 83 e8 01 sub $0x1,%eax
6c3: 48 8d 54 24 18 lea 0x18(%rsp),%rdx
6c8: 48 8d 35 e5 00 00 00 lea 0xe5(%rip),%rsi # 7b4 <_IO_stdin_used+0x4>
6cf: 89 44 24 0c mov %eax,0xc(%rsp)
6d3: 89 44 24 18 mov %eax,0x18(%rsp)
6d7: bf 01 00 00 00 mov $0x1,%edi
6dc: 31 c0 xor %eax,%eax
6de: e8 5d fe ff ff callq 540 <__printf_chk@plt>
6e3: 8b 44 24 18 mov 0x18(%rsp),%eax
6e7: 83 f8 01 cmp $0x1,%eax
6ea: 7e cd jle 6b9 <RecursiveFunction+0x29>
6ec: 83 e8 01 sub $0x1,%eax
6ef: 48 8d 54 24 1c lea 0x1c(%rsp),%rdx
6f4: 48 8d 35 b9 00 00 00 lea 0xb9(%rip),%rsi # 7b4 <_IO_stdin_used+0x4>
6fb: 89 44 24 18 mov %eax,0x18(%rsp)
6ff: 89 44 24 1c mov %eax,0x1c(%rsp)
703: bf 01 00 00 00 mov $0x1,%edi
708: 31 c0 xor %eax,%eax
70a: e8 31 fe ff ff callq 540 <__printf_chk@plt>
70f: 8b 7c 24 1c mov 0x1c(%rsp),%edi
713: 83 ff 01 cmp $0x1,%edi
716: 7e a1 jle 6b9 <RecursiveFunction+0x29>
718: 83 ef 01 sub $0x1,%edi
71b: 89 7c 24 1c mov %edi,0x1c(%rsp)
71f: e8 6c ff ff ff callq 690 <RecursiveFunction>
724: eb 93 jmp 6b9 <RecursiveFunction+0x29>
726: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
72d: 00 00 00
It’s great fun, right? You may try more combinations and see how gcc generate different code.