strace prelude in C

2022-04-17T10:34:57

Much has been written about the venerable strace, one of the most important allies of the systems programmer. However, the focus is often on the specifics of the particular program being examined, not on the functioning of strace itself and the operating system environment involved in the execution of the process. Today we explore this neglected part of the system call trace.

Minimal ELF

The first encounter with strace is very intimidating: even the most apparently trivial programs are revealed to not be so under its dissection.

$ strace true |& wc
     35     213    2485
$ strace ls /sys |& wc
     81     453    5148

For this reason, texts presenting techniques for analysis and debugging instruct the reader to ignore the beginning of the output. There are indeed a number of steps involved in executing a fully-featured program built for modern operating systems, but it need not be so. Examples traditionally start with a printf, but a minimal, much smaller and simpler executable can be crafted which results in an almost empty trace.

$ as min.s -o min.o
$ ld --oformat binary min.o -o min
$ du --bytes min
124     min
$ env --ignore-environment strace ./min
execve("./min", ["./min"], 0x7ffec174db90 /* 0 vars */) = 0
exit(0)                                 = ?
+++ exited with 0 +++

In fact, under strace, this is the smallest possible output for a program that correctly terminates. After the program is assembled, it is executed under env -i/--ignore-environment so that all environment variables from the shell are cleared, reducing the size and variability of the output. The first line of the trace is the execve(2) which begins the execution of the program specified in the command line. Nelson Elhage's article on building a minimal tracing tool using ptrace(2) has an excellent explanation of the reason why this is always the first line, as well as the path that gets there and much more.

The min.s file instructs the assembler to manually create an ELF file. This and all other files presented are built on and for an x86-64 Linux machine. They inevitably display platform-specific aspects, but the concepts are still generally applicable. The exact details of generating such a file will not be covered: many resources on this subject can be found on the internet. Consult the source code and the elf(5) manual page.

The entirety of the instruction stream is encoded in four bytes in the binary:

$ objdump --disassemble=_start min.o

min.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000078 <_start>:
  78:   b0 3c                   mov    $0x3c,%al
  7a:   0f 05                   syscall

This program simply sets up the rax register and issues a system call. rax receives the value 3c₁₆ / 60₁₀, which is the code for sys_exit on x86-64. rdi, implicitly set to zero, indicates the status code.

Depending on the version of binutils, it may be necessary to use as --mx86-used-note no in order to generate an unadulterated object file (or, alternatively, to remove that section post factum with objcopy --remove-section .note.gnu.property before the linking step).

C

To start moving to more complex examples, the previous program can be rewritten in (partially non-standard and very low-level) C. The entire purpose of the new source code is to instruct the compiler to generate the same _start entry point as expected by the (now implicitly invoked) assembler, with the exact same machine code.

#include <stdnoreturn.h>

noreturn void _start() {
    __asm__("mov $0x3c,%al\nsyscall");
    __builtin_unreachable();
}

When compiled as a static executable, this source code produces the same sequence of instructions and an equivalent result when traced (it is also about 70x larger).

$ gcc -std=c17 -static -nostdlib -O2 min.c
$ objdump --disassemble=_start a.out | tail -3
0000000000401000 <_start>:
  401000:       b0 3c                   mov    $0x3c,%al
  401002:       0f 05                   syscall
$ env -i strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffd8a6c9e20 /* 0 vars */) = 0
exit(0)                                 = ?
+++ exited with 0 +++

The -nostdlib flag is used to suppress the inclusion of the C standard library facilities, since none of them are used in this program. Even though it is a very minimalistic language (compared to its more modern counterparts), the C standard library manages a number of library and operating system components, and much of the initialization of a process — prior to the execution of the main function — is done by the library.

musl

Somewhat unconventionally compared to other Unix descendants (and operating systems in general), the C library is not strictly considered part of the operating system on Linux. As such, the GNU C library (glibc) is the traditional implementation used on most systems, but a number of alternatives exist. musl is one of these implementations. Its focus on simplicity is appropriate to begin exploring the C runtime. Tracing the true C "hello, world" program compiled with musl yields:

int main(void) {}

$ musl-gcc -std=c17 -static libc.c
$ env -i strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffc71f9cfc0 /* 0 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x404158)       = 0
set_tid_address(0x404290)               = 2961014
exit_group(0)                           = ?
+++ exited with 0 +++

All changes in the output are related to the threading library and its runtime setup:

arch_prctl(2) is an x86-64 system call for managing thread state; the ARCH_SET_FS operation manipulates the FS register, used in the implementation of thread-local storage⁰.
set_tid_address(2) is used in the pthread(3) implementation. The address passed to it corresponds to the child_tid argument of clone(2) and is part of the implementation of pthread_create(3), pthread_join(3), etc.
exit_group(2) is the real system call behind exit(3): it terminates not only the caller but all threads in a process.

glibc

Tracing the same program as before, but now compiled using the GNU standard library, yields a much longer output, detailed in parts below. All of the new calls come from the standard library initialization, done by the __libc_start_main_impl function (defined using the LIBC_START_MAIN macro, for those following the code).

$ gcc -std=c17 -static -O2 libc.c
$ env -i strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffefce488e0 /* 0 vars */) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc39156920) = -1 EINVAL (Invalid argument)

arch_prctl(2) is used here again, this time performing a mysterious 0x3001 operation¹. This code corresponds to the ARCH_CET_STATUS operation, used in the implementation of Intel's Control-flow Enforcement Technology, in particular shadow stacks and indirect branch tracking. In this particular machine, the CPU does not support these features.

brk(NULL)                               = 0x232b000
brk(0x232bdc0)                          = 0x232bdc0

brk(2) controls the program break, i.e. the maximum address of the program's heap. The first call uses NULL to retrieve the current value (n.b.: this is a Linux idiosyncrasy, POSIX brk(2) returns 0 on success), while the second sets it to 0xdc0 bytes above that value. This is dynamic memory allocation at its lowest level (above the kernel), virtually equivalent to malloc(0xdc0). Here, it is done in the initialization of thread-local storage.

arch_prctl(ARCH_SET_FS, 0x232b3c0)      = 0
set_tid_address(0x232b690)              = 2965460
set_robust_list(0x232b6a0, 24)          = 0

Threading configuration is done similarly to musl. set_robust_list is also related to threading: it ensures proper handling of futex(2)es when threads terminate unexpectedly.

rseq(0x232bd60, 0x20, 0, 0x53053053)    = 0

Restartable sequences are a kernel mechanism, recently incorporated into glibc, which allow user space to use per-CPU data as is done extensively in kernel mode. This requires operating system support, since user programs are constantly preempted, which is configured via this system call. You know where to find more information by now.

uname({sysname="Linux", nodename="tjobim", ...}) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
readlink("/proc/self/exe", "/home/bbguimaraes/src/bbguimarae"..., 4096) = 71

These calls are relatively simple. The antediluvian uname(2) is used to retrieve the kernel version. prlimit64(2) retrieves the maximum stack size, to be used as the default stacksize attribute in pthread_create(3). Finally, readlink(2) is used with the Linux-specific /proc/self/exe symbolic link to discover the path of the executable.

getrandom("\x94\x07\xa0\xf7\x13\xa7\xe9\xfc", 8, GRND_NONBLOCK) = 8
brk(0x234cdc0)                          = 0x234cdc0
brk(0x234d000)                          = 0x234d000

Amusingly, this is the first time malloc(3) is called (recall the program break was set directly using brk(2) before). As part of its internal initialization, a 64-bit value is set to a random value. Then, memory is allocated in the same way as before: by retrieving the current program break and incrementing it (this is actually a library version of sbrk(2) implemented using brk(2)).

mprotect(0x4b7000, 16384, PROT_READ)    = 0
exit_group(0)                           = ?
+++ exited with 0 +++

The final action before the execution of our uneventful main function is to make part of the address space (as indicated by the RELRO ELF header) read-only².

ld

To complete our investigation of the usual strace prelude, dynamic linking is introduced. Doing so generates many system calls performed by the linker, which appear in the output since they occur as part of the execve(2) call. Somewhat surprisingly, it also changes considerably the actions performed during glibc's initialization. Details on what is covered in this section can be found in the ld.so(8) and ldconfig(8) manual pages. Charng-Da Lu's article on the ELF format also details some of the glibc linker implementation.

$ gcc -std=c17 -O2 libc.c
$ env -i strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffdbf7c30d0 /* 0 vars */) = 0
brk(NULL)                               = 0x55ccf3f95000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc92198690) = -1 EINVAL (Invalid argument)

The first few lines are familiar, except there is now a single direct call to brk(2) to retrieve the current program break.

access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)

There are many ways to instruct the linker to load additional shared libraries before beginning the execution of a dynamically-linked program. The LD_PRELOAD environment variable is perhaps the best known of them, but /etc/ld.so.preload can also be used to affect every process in the system. This particular machine does not have this file.

openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=156427, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 156427, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f576c5e2000
close(3)                                = 0

When ldconfig(8) sets up shared libraries in the system directories, it also populates /etc/ld.so.cache. This file is structured so that the shared library names registered in a dynamically-linked program (sonames) can be searched quickly when its execution begins. The cache is a binary file: its contents can be viewed using ldconfig -p/--print-cache. During initialization, the linker opens the file and mmap(2)s it into the process' address space³.

openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\324\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0@\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 80, 848) = 80
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205vn\235\204X\261n\234|\346\340|q,\2"..., 68, 928) = 68
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=2463384, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f576c5e0000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2136752, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f576c3d6000
mprotect(0x7f576c402000, 1880064, PROT_NONE) = 0
mmap(0x7f576c402000, 1531904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2c000) = 0x7f576c402000
mmap(0x7f576c578000, 344064, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a2000) = 0x7f576c578000
mmap(0x7f576c5cd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1f6000) = 0x7f576c5cd000
mmap(0x7f576c5d3000, 51888, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f576c5d3000
close(3)                                = 0

Here dynamic linking per se begins. The first library to be linked is naturally the C standard library. It is found via /etc/ld.so.cache and opened directly. read(2) and pread64(2) (i.e. read with offset) are used to determine the contents of the shared library. Each section is then mmap(2)ed into the process' address space with the appropriate access permissions⁴.

mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f576c3d4000
arch_prctl(ARCH_SET_FS, 0x7f576c5e1600) = 0
set_tid_address(0x7f576c5e18d0)         = 2974699
set_robust_list(0x7f576c5e18e0, 24)     = 0
rseq(0x7f576c5e1fa0, 0x20, 0, 0x53053053) = 0
mprotect(0x7f576c5cd000, 12288, PROT_READ) = 0
mprotect(0x55ccf2f54000, 4096, PROT_READ) = 0
mprotect(0x7f576c63e000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0

Another malloc(3) call causes more system memory to be requested (corresponding to the initial brk(2) calls during TLS initialization in the statically-linked program). It is followed by a series of familiar calls, only this time there are a few read-only memory sections (and they are mapped to dynamic locations).

munmap(0x7f576c5e2000, 156427)          = 0
exit_group(0)                           = ?
+++ exited with 0 +++

With the dynamic linking done, the cache is unmapped and our magnanimous main function is executed. This brings us to the end of a complete strace output.

Coda

Of course, as a programmer or system administrator, it is at this point that your job would begin in tracing the actual program under consideration, but problems sometimes occur in or are caused by the library and operating system components described here. Understanding these concepts can be useful in those situations, and also allows us to explore the many interesting steps performed to execute almost every program in our operating systems.

Each of the manual pages and articles linked throughout the text provides more information on any topic that may be of interest. The implementation of these system calls in the Linux kernel and the GNU C library is fairly approachable once one gets accustomed to the general structure of those code bases. GDB also has a useful "catchpoint" command which can stop at the beginning and end of every system call:

$ gdb a.out
Reading symbols from a.out...
(No debugging symbols found in a.out)
(gdb) catch syscall
Catchpoint 1 (any syscall)
(gdb) r
Starting program: /tmp/a.out

Catchpoint 1 (call to syscall brk), 0x00007ffff7fe8dbb in brk () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  0x00007ffff7fe8dbb in brk () from /lib64/ld-linux-x86-64.so.2
#1  0x00007ffff7fe2ebe in _dl_sysdep_start () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ffff7fe4b91 in _dl_start () from /lib64/ld-linux-x86-64.so.2
#3  0x00007ffff7fe3938 in _start () from /lib64/ld-linux-x86-64.so.2
#4  0x0000000000000001 in ?? ()
#5  0x00007fffffffe2c0 in ?? ()
#6  0x0000000000000000 in ?? ()

With some experience in analyzing optimized machine language instructions, this can be a very good way to explore the runtime behavior of a program or library. For the bibliophiles such as myself, there are many amazing books on this topic (there is a reason my reading list has more operating system books than any other category). In particular, The Linux Programming Interface by Michael Kerrisk (who happens to also be the maintainer of Linux manual pages) is a fantastic resource.

Have fun!

Notes

This can be verified with code such as the following:
```
thread_local int i;

int f(void) {
    return i;
}
```
```
f:
    mov eax, DWORD PTR fs:0xfffffffffffffffc
    ret
```
where 0xfffffffffffffffc is a complicated spelling of -4 (i.e. -sizeof(i)) in two's complement. Similarly, printing the address of i in the main thread shows that it is located four bytes below the address passed to the arch_prctl(2) call.
It is common for programs which decode system call information, such as strace and gdb, to not have support for obscure and/or recently-added calls or parameters. Sometimes, they are documented in the manual pages. The Linux and glibc source code and git logs are good sources of information to figure out their meaning when that is not the case.

These values can be seen in the ELF file:

$ objdump --all a.out | grep -A1 RELRO
   RELRO off    0x00000000000b6770 vaddr 0x00000000004b7770 paddr 0x00000000004b7770 align 2**0
         filesz 0x0000000000003890 memsz 0x0000000000003890 flags r--

The start address is rounded down to the page size (0x4b7770 & ~0xfff = 0x4b7000) and then used to calculate the new size (0x3890 + 0x770 = 0x4000 = 16384).

For readers new to strace, it is important to understand how these four calls form a single overarching operation. openat(2) returns file descriptor 3 (i.e. just above stderr/STDERR_FILENO), as indicated at the end of the line. Each of the subsequent calls uses that same descriptor: as the first argument to newfstatat(2) (a specialized version of stat(2)) and as the fifth argument to mmap(2). close(2)ing the descriptor finalizes the operation.

While not all operations follow this pattern (files may be kept open, calls may be interleaved, etc.), this is a very common occurrence and worth getting accustomed to.
The first mmap(2) call, which does not specify a file descriptor (-1), is actually from yet another malloc(3) implementation inside the linker.

assembly c debugging linux man programming strace unix