strace prelude in C
2022-04-17T10:34:57
Much has been written about the venerable
strace
, one of the most important
allies of the systems programmer. However, the focus is often on the specifics
of the particular program being examined, not on the functioning of
strace
itself and the operating system environment involved in the
execution of the process. Today we explore this neglected part of the system
call trace.
Minimal ELF
The first encounter with strace
is very intimidating: even the most
apparently trivial programs are revealed to not be so under its dissection.
$ strace true |& wc
35 213 2485
$ strace ls /sys |& wc
81 453 5148
For this reason, texts presenting techniques for analysis and debugging instruct
the reader to ignore the beginning of the output. There are indeed a number of
steps involved in executing a fully-featured program built for modern operating
systems, but it need not be so. Examples traditionally start with a
printf
, but a minimal, much smaller and simpler executable can be
crafted which results in an almost empty trace.
$ as min.s -o min.o
$ ld --oformat binary min.o -o min
$ du --bytes min
124 min
$ env --ignore-environment strace ./min
execve("./min", ["./min"], 0x7ffec174db90 /* 0 vars */) = 0
exit(0) = ?
+++ exited with 0 +++
In fact, under strace
, this is the smallest possible output for a
program that correctly terminates. After the program is assembled, it is
executed under env -i
/--ignore-environment
so that all
environment variables from the shell are cleared, reducing the size and
variability of the output. The first line of the trace is the
execve(2)
which begins the execution of the program specified in
the command line. Nelson Elhage's
article
on building a minimal tracing tool using ptrace(2)
has an excellent
explanation of the reason why this is always the first line, as well as the path
that gets there and much more.
C
To start moving to more complex examples, the previous program can be rewritten
in (partially non-standard and very low-level) C. The entire purpose of the new
source code is to instruct the compiler to generate the same _start
entry point as expected by the (now implicitly invoked) assembler, with the
exact same machine code.
#include <stdnoreturn.h>
noreturn void _start() {
__asm__("mov $0x3c,%al\nsyscall");
__builtin_unreachable();
}
When compiled as a static executable, this source code produces the same sequence of instructions and an equivalent result when traced (it is also about 70x larger).
$ gcc -std=c17 -static -nostdlib -O2 min.c
$ objdump --disassemble=_start a.out | tail -3
0000000000401000 <_start>:
401000: b0 3c mov $0x3c,%al
401002: 0f 05 syscall
$ env -i strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffd8a6c9e20 /* 0 vars */) = 0
exit(0) = ?
+++ exited with 0 +++
The -nostdlib
flag is used to suppress the inclusion of the C
standard library facilities, since none of them are used in this program. Even
though it is a very minimalistic language (compared to its more modern
counterparts), the C standard library manages a number of library and operating
system components, and much of the initialization of a process — prior to the
execution of the main
function — is done by the library.
musl
Somewhat unconventionally compared to other Unix descendants (and operating
systems in general), the C library is not strictly considered part of the
operating system on Linux. As such, the
GNU C library
(glibc
) is the traditional implementation used on most systems, but
a number of alternatives exist.
musl
is one of these
implementations. Its focus on simplicity is appropriate to begin exploring the
C runtime. Tracing the true C "hello, world" program compiled with
musl
yields:
int main(void) {}
$ musl-gcc -std=c17 -static libc.c
$ env -i strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffc71f9cfc0 /* 0 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x404158) = 0
set_tid_address(0x404290) = 2961014
exit_group(0) = ?
+++ exited with 0 +++
All changes in the output are related to the threading library and its runtime setup:
-
arch_prctl(2)
is an x86-64 system call for managing thread state; theARCH_SET_FS
operation manipulates theFS
register, used in the implementation of thread-local storage0. -
set_tid_address(2)
is used in thepthread(3)
implementation. The address passed to it corresponds to thechild_tid
argument ofclone(2)
and is part of the implementation ofpthread_create(3)
,pthread_join(3)
, etc. -
exit_group(2)
is the real system call behindexit(3)
: it terminates not only the caller but all threads in a process.
glibc
Tracing the same program as before, but now compiled using the GNU standard
library, yields a much longer output, detailed in parts below. All of the new
calls come from the standard library initialization, done by the
__libc_start_main_impl
function (defined using the
LIBC_START_MAIN
macro, for those following the code).
$ gcc -std=c17 -static -O2 libc.c
$ env -i strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffefce488e0 /* 0 vars */) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc39156920) = -1 EINVAL (Invalid argument)
arch_prctl(2)
is used here again, this time performing a mysterious
0x3001
operation1. This code
corresponds to the ARCH_CET_STATUS
operation, used in the
implementation of Intel's Control-flow
Enforcement Technology, in particular
shadow stacks and
indirect branch tracking.
In this particular machine, the CPU does not support these features.
brk(NULL) = 0x232b000
brk(0x232bdc0) = 0x232bdc0
brk(2)
controls the program break, i.e. the maximum address
of the program's heap. The first call uses NULL
to retrieve the
current value (n.b.: this is a Linux idiosyncrasy, POSIX brk(2)
returns 0
on success), while the second sets it to
0xdc0
bytes above that value. This is dynamic memory allocation at
its lowest level (above the kernel), virtually equivalent to
malloc(0xdc0)
. Here, it is done in the initialization of
thread-local storage.
arch_prctl(ARCH_SET_FS, 0x232b3c0) = 0
set_tid_address(0x232b690) = 2965460
set_robust_list(0x232b6a0, 24) = 0
Threading configuration is done similarly to musl
.
set_robust_list
is also related to threading: it ensures proper
handling of futex(2)
es when threads terminate unexpectedly.
rseq(0x232bd60, 0x20, 0, 0x53053053) = 0
Restartable sequences are a kernel mechanism, recently incorporated
into glibc
, which allow user space to use per-CPU data as is done
extensively in kernel mode. This requires operating system support, since user
programs are constantly preempted, which is configured via this system call.
You know where to find more
information by now.
uname({sysname="Linux", nodename="tjobim", ...}) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
readlink("/proc/self/exe", "/home/bbguimaraes/src/bbguimarae"..., 4096) = 71
These calls are relatively simple. The antediluvian uname(2)
is
used to retrieve the kernel version. prlimit64(2)
retrieves the
maximum stack size, to be used as the default stacksize
attribute
in pthread_create(3)
. Finally, readlink(2)
is used
with the Linux-specific /proc/self/exe
symbolic link to discover
the path of the executable.
getrandom("\x94\x07\xa0\xf7\x13\xa7\xe9\xfc", 8, GRND_NONBLOCK) = 8
brk(0x234cdc0) = 0x234cdc0
brk(0x234d000) = 0x234d000
Amusingly, this is the first time malloc(3)
is called (recall the
program break was set directly using brk(2)
before). As part of
its internal initialization, a 64-bit value is set to a random value. Then,
memory is allocated in the same way as before: by retrieving the current program
break and incrementing it (this is actually a library version of
sbrk(2)
implemented using brk(2)
).
mprotect(0x4b7000, 16384, PROT_READ) = 0
exit_group(0) = ?
+++ exited with 0 +++
The final action before the execution of our uneventful main
function is to make part of the address space (as indicated by the
RELRO
ELF header) read-only2.
ld
To complete our investigation of the usual strace
prelude, dynamic
linking is introduced. Doing so generates many system calls performed by the
linker, which appear in the output since they occur as part of the
execve(2)
call. Somewhat surprisingly, it also changes
considerably the actions performed during glibc
's initialization.
Details on what is covered in this section can be found in the
ld.so(8)
and ldconfig(8)
manual pages. Charng-Da Lu's
article on the ELF
format also details some of the glibc
linker implementation.
$ gcc -std=c17 -O2 libc.c
$ env -i strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffdbf7c30d0 /* 0 vars */) = 0
brk(NULL) = 0x55ccf3f95000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc92198690) = -1 EINVAL (Invalid argument)
The first few lines are familiar, except there is now a single direct call to
brk(2)
to retrieve the current program break.
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
There are many ways to instruct the linker to load additional shared libraries
before beginning the execution of a dynamically-linked program. The
LD_PRELOAD
environment variable is perhaps the best known of them,
but /etc/ld.so.preload
can also be used to affect every process in
the system. This particular machine does not have this file.
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=156427, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 156427, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f576c5e2000
close(3) = 0
When ldconfig(8)
sets up shared libraries in the system
directories, it also populates /etc/ld.so.cache
. This file is
structured so that the shared library names registered in a dynamically-linked
program (soname
s) can be searched quickly when its execution
begins. The cache is a binary file: its contents can be viewed using
ldconfig -p
/--print-cache
. During initialization, the
linker opens the file and mmap(2)
s it into the process' address
space3.
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\324\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0@\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 80, 848) = 80
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205vn\235\204X\261n\234|\346\340|q,\2"..., 68, 928) = 68
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=2463384, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f576c5e0000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2136752, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f576c3d6000
mprotect(0x7f576c402000, 1880064, PROT_NONE) = 0
mmap(0x7f576c402000, 1531904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2c000) = 0x7f576c402000
mmap(0x7f576c578000, 344064, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a2000) = 0x7f576c578000
mmap(0x7f576c5cd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1f6000) = 0x7f576c5cd000
mmap(0x7f576c5d3000, 51888, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f576c5d3000
close(3) = 0
Here dynamic linking per se begins. The first library to be linked is naturally
the C standard library. It is found via /etc/ld.so.cache
and
opened directly. read(2)
and pread64(2)
(i.e. read
with offset) are used to determine the contents of the shared library. Each
section is then mmap(2)
ed into the process' address space with the
appropriate access permissions4.
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f576c3d4000
arch_prctl(ARCH_SET_FS, 0x7f576c5e1600) = 0
set_tid_address(0x7f576c5e18d0) = 2974699
set_robust_list(0x7f576c5e18e0, 24) = 0
rseq(0x7f576c5e1fa0, 0x20, 0, 0x53053053) = 0
mprotect(0x7f576c5cd000, 12288, PROT_READ) = 0
mprotect(0x55ccf2f54000, 4096, PROT_READ) = 0
mprotect(0x7f576c63e000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
Another malloc(3)
call causes more system memory to be requested
(corresponding to the initial brk(2)
calls during TLS
initialization in the statically-linked program). It is followed by a series of
familiar calls, only this time there are a few read-only memory sections (and
they are mapped to dynamic locations).
munmap(0x7f576c5e2000, 156427) = 0
exit_group(0) = ?
+++ exited with 0 +++
With the dynamic linking done, the cache is unmapped and our magnanimous
main
function is executed. This brings us to the end of a complete
strace
output.
Coda
Of course, as a programmer or system administrator, it is at this point that your job would begin in tracing the actual program under consideration, but problems sometimes occur in or are caused by the library and operating system components described here. Understanding these concepts can be useful in those situations, and also allows us to explore the many interesting steps performed to execute almost every program in our operating systems.
Each of the manual pages and articles linked throughout the text provides more information on any topic that may be of interest. The implementation of these system calls in the Linux kernel and the GNU C library is fairly approachable once one gets accustomed to the general structure of those code bases. GDB also has a useful "catchpoint" command which can stop at the beginning and end of every system call:
$ gdb a.out
Reading symbols from a.out...
(No debugging symbols found in a.out)
(gdb) catch syscall
Catchpoint 1 (any syscall)
(gdb) r
Starting program: /tmp/a.out
Catchpoint 1 (call to syscall brk), 0x00007ffff7fe8dbb in brk () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0 0x00007ffff7fe8dbb in brk () from /lib64/ld-linux-x86-64.so.2
#1 0x00007ffff7fe2ebe in _dl_sysdep_start () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7fe4b91 in _dl_start () from /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff7fe3938 in _start () from /lib64/ld-linux-x86-64.so.2
#4 0x0000000000000001 in ?? ()
#5 0x00007fffffffe2c0 in ?? ()
#6 0x0000000000000000 in ?? ()
With some experience in analyzing optimized machine language instructions, this can be a very good way to explore the runtime behavior of a program or library. For the bibliophiles such as myself, there are many amazing books on this topic (there is a reason my reading list has more operating system books than any other category). In particular, The Linux Programming Interface by Michael Kerrisk (who happens to also be the maintainer of Linux manual pages) is a fantastic resource.
Have fun!
Notes
-
This can be verified with code such as the following:
thread_local int i; int f(void) { return i; }
f: mov eax, DWORD PTR fs:0xfffffffffffffffc ret
where
0xfffffffffffffffc
is a complicated spelling of-4
(i.e.-sizeof(i)
) in two's complement. Similarly, printing the address ofi
in the main thread shows that it is located four bytes below the address passed to thearch_prctl(2)
call. -
It is common for programs which decode system call information, such as
strace
andgdb
, to not have support for obscure and/or recently-added calls or parameters. Sometimes, they are documented in the manual pages. The Linux andglibc
source code andgit
logs are good sources of information to figure out their meaning when that is not the case. -
These values can be seen in the ELF file:
$ objdump --all a.out | grep -A1 RELRO RELRO off 0x00000000000b6770 vaddr 0x00000000004b7770 paddr 0x00000000004b7770 align 2**0 filesz 0x0000000000003890 memsz 0x0000000000003890 flags r--
The start address is rounded down to the page size (
0x4b7770 & ~0xfff = 0x4b7000
) and then used to calculate the new size (0x3890 + 0x770 = 0x4000 = 16384
). -
For readers new to
strace
, it is important to understand how these four calls form a single overarching operation.openat(2)
returns file descriptor3
(i.e. just abovestderr
/STDERR_FILENO
), as indicated at the end of the line. Each of the subsequent calls uses that same descriptor: as the first argument tonewfstatat(2)
(a specialized version ofstat(2)
) and as the fifth argument tommap(2)
.close(2)
ing the descriptor finalizes the operation.While not all operations follow this pattern (files may be kept open, calls may be interleaved, etc.), this is a very common occurrence and worth getting accustomed to.
-
The first
mmap(2)
call, which does not specify a file descriptor (-1
), is actually from yet anothermalloc(3)
implementation inside the linker.