In this blog post we will try to understand how Inotify can be called from the user space to the kernel space, moving from our process to the libc, the c standard library.

If we invoke strace in our Linux shell with a program as argument, we will see the system call called by the process passed as an argument.
strace is a Linux command that records the system calls done by our program while executing, from the user space to the kernel space. It prints to screen the system call name, the parameters passed to the system call and the system call return value.

We will execute strace on the program we created in the previous post about Inotify: http://www.spaghettiml.com/en/2019/01/28/how-inotify-work-part-1/

Let’s compile the program and execute it:

gcc inotify_example.c && strace ./a.out 

Now, we are interested in the function inotify_add_watch and how it is called.

...
inotify_add_watch(3, "/home/joxer/code", IN_ACCESS|IN_MODIFY|IN_ATTRIB|IN_CLOSE_WRITE|IN_CLOSE_NOWRITE|IN_OPEN|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF) = 1
...

This is the system call that is called when we execute our code at the line inotify_add_watch( inotifyFd, bufPtr, IN_ALL_EVENTS );
We can see, how the constant IN_ALL_EVENTS has been exploded and 3 points to the file descriptor we are acting on.

Now, where is this system call defined? We just have to look up in Kernel source code and find the file inotify_user.c where it’s defined.
https://github.com/torvalds/linux/blob/master/fs/notify/inotify/inotify_user.c#L696

In a next article I will write about the body function and how it works, for now I’ll discuss how a system call is done from user space to kernel space.

Now, how is a system call called from the libc to kernel?
This depends on the CPU architecture and how parameters are passed to the kernel.

On both x86 and x86_64 systems, Linux invoke a system call through the interrupt 0x80, which is the instruction int $0x80.
On system x86_64 is available and is set as default the instruction syscall, instead of using the interrupt 0x80.
The arguments of the system call are passed in the current way:


Each system call has a own value that is defined to be unique and we have to pass to register eax to invoke our system call. The value of all system call are defined in files arch/x86/include/generated/uapi/asm/unistd_32.h and usr/include/asm/unistd_32.h

If we would like to call the system call from user space, the way is to use the library <sys/sycall.h> with the function long syscall(long number, …)

This function runs the following steps:

  • Copy the arguments and set the system call number in the various CPU registers where kernel expect to find them
  • It executes the interrupt to pass kernel mode, the execution control pass to the kernel and now the kernel can execute the system call
  • If the system call return error, the errno variable is set. The system call ends and the execution returns in user mode

Libc wraps system call in function or calls directly them. More informations can be found at the web page: https://sourceware.org/glibc/wiki/SyscallWrappers

If we visualize the assembly code for our program we can see how function inotify_add_watch is called

 gcc -g file.c
objdump -d -S a.out

In this file we can see the following lines:

int wd = inotify_add_watch( inotifyFd, bufPtr, IN_ALL_EVENTS );
a02: 48 8b 8d 20 f5 ff ff mov -0xae0(%rbp),%rcx
a09: 8b 85 04 f5 ff ff mov -0xafc(%rbp),%eax
a0f: ba ff 0f 00 00 mov $0xfff,%edx
a14: 48 89 ce mov %rcx,%rsi
a17: 89 c7 mov %eax,%edi
a19: e8 62 fd ff ff callq 780 <inotify_add_watch@plt>
a1e: 89 85 08 f5 ff ff mov %eax,-0xaf8(%rbp)

So it jumps to our eyes this line: callq 780 <inotify_add_watch@plt> that calls this fuction:

0000000000000780 <inotify_add_watch@plt>:
780: ff 25 42 18 20 00 jmpq *0x201842(%rip) # 201fc8 <inotify_add_watch@GLIBC_2.4>
786: 68 07 00 00 00 pushq $0x7
78b: e9 70 ff ff ff jmpq 700 <.plt>

We can see a jmpq instruction at the line 780 through the function linked in our program to glibc. The function has @plt as a trailer.

The PLT is the procedure Linkage Table, a table included in our program holding the address of function which final address is not yet know in linking time and is left as a duty of the dynamic linker when we execute our program.

Beside the PLT there’s another table, the GOT is the Global Offsets Table, it’s used in a similar way to resolve addresses. The difference is that address inside the GOT are fixed, this way the PLT is dynamic and change any time we run the program, while GOT is static and so our program knows where to find reference to the function. The GOT is updated during the execution of the program by the linker.
Let’s use the readelf command on our executable program, the flag —relocs will show us the section relocation

$ readelf –relocs a.out

Relocation section '.rela.plt' at offset 0x608 contains 9 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000201f90 000200000007 R_X86_64_JUMP_SLO 0000000000000000 puts@GLIBC_2.2.5 + 0
000000201f98 000300000007 R_X86_64_JUMP_SLO 0000000000000000 pathconf@GLIBC_2.2.5 + 0
000000201fa0 000400000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0
000000201fa8 000500000007 R_X86_64_JUMP_SLO 0000000000000000 getcwd@GLIBC_2.2.5 + 0
000000201fb0 000600000007 R_X86_64_JUMP_SLO 0000000000000000 read@GLIBC_2.2.5 + 0
000000201fb8 000900000007 R_X86_64_JUMP_SLO 0000000000000000 inotify_init@GLIBC_2.4 + 0
000000201fc0 000a00000007 R_X86_64_JUMP_SLO 0000000000000000 malloc@GLIBC_2.2.5 + 0
000000201fc8 000b00000007 R_X86_64_JUMP_SLO 0000000000000000 inotify_add_watch@GLIBC_2.4 + 0
000000201fd0 000c00000007 R_X86_64_JUMP_SLO 0000000000000000 exit@GLIBC_2.2.5 + 0

We can see the offset of our function in bold. The field of the command readelf —relocs are the following:

  • Offset: is the offset that our symbol will have
  • Info: It says the index of the symbol in the symbol table
  • Type: It points to the symbol type as defined from the ABI (Application binary interface)
  • Sym value: is the padding that we have to add to resolve the symbol address
  • Sym name and addend: it’s the symbol name plus the addend as padding

If we wanna see the section GOT of our program, we can execute the command that gives us the dump of the section: objdump -j .got -s a.out

a.out: file format elf64-x86-64

Contents of section .got:
201f78 881d2000 00000000 00000000 00000000 .. ………….
201f88 00000000 00000000 16070000 00000000 …………….
201f98 26070000 00000000 36070000 00000000 &…….6…….
201fa8 46070000 00000000 56070000 00000000 F…….V…….
201fb8 66070000 00000000 76070000 00000000 f…….v…….
201fc8 86070000 00000000 96070000 00000000 …………….
201fd8 00000000 00000000 00000000 00000000 …………….
201fe8 00000000 00000000 00000000 00000000 …………….
201ff8 00000000 00000000 ……..

Executing the command objdump -R a.out, we can obtain the relocation records of our table when program is executed.

a.out: file format elf64-x86-64

DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
0000000000201d78 R_X86_64_RELATIVE ABS+0x00000000000008b0
0000000000201d80 R_X86_64_RELATIVE ABS+0x0000000000000870
0000000000202008 R_X86_64_RELATIVE ABS+0x0000000000202008
0000000000201fd8 R_X86_64_GLOB_DAT _ITM_deregisterTMCloneTable
0000000000201fe0 R_X86_64_GLOB_DAT
libc_start_main@GLIBC_2.2.5 0000000000201fe8 R_X86_64_GLOB_DAT __gmon_start
0000000000201ff0 R_X86_64_GLOB_DAT _ITM_registerTMCloneTable
0000000000201ff8 R_X86_64_GLOB_DAT __cxa_finalize@GLIBC_2.2.5
0000000000201f90 R_X86_64_JUMP_SLOT puts@GLIBC_2.2.5
0000000000201f98 R_X86_64_JUMP_SLOT pathconf@GLIBC_2.2.5
0000000000201fa0 R_X86_64_JUMP_SLOT printf@GLIBC_2.2.5
0000000000201fa8 R_X86_64_JUMP_SLOT getcwd@GLIBC_2.2.5
0000000000201fb0 R_X86_64_JUMP_SLOT read@GLIBC_2.2.5
0000000000201fb8 R_X86_64_JUMP_SLOT inotify_init@GLIBC_2.4
0000000000201fc0 R_X86_64_JUMP_SLOT malloc@GLIBC_2.2.5
0000000000201fc8 R_X86_64_JUMP_SLOT inotify_add_watch@GLIBC_2
.4
0000000000201fd0 R_X86_64_JUMP_SLOT exit@GLIBC_2.2.5

We can see how these address labeled as offset are the one showed when we see the content of the GOT section. If we see the line:

780: ff 25 42 18 20 00 jmpq *0x201842(%rip) # 201fc8 <inotify_add_watch@GLIBC_2.4>

The value 201fc8 is the one where, in our GOT, we find the point where we expect to find the address of our function inotify_add_watch of GLIBC library.

So our processes get linked to the external function of the glibc library to call the system call.
If we launch objdump to see the content of our local libc, we will have a similar output:
objdump -d -S /lib/x86_64-linux-gnu/libc.so.6 | grep inotify_add_watch -A 10

00000000001222d0 :
1222d0: b8 fe 00 00 00 mov $0xfe,%eax
1222d5: 0f 05 syscall
1222d7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
1222dd: 73 01 jae 1222e0
1222df: c3 retq
1222e0: 48 8b 0d 81 8b 2c 00 mov 0x2c8b81(%rip),%rcx # 3eae68
1222e7: f7 d8 neg %eax
1222e9: 64 89 01 mov %eax,%fs:(%rcx)
1222ec: 48 83 c8 ff or $0xffffffffffffffff,%rax
1222f0: c3 retq
1222f1: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
1222f8: 00 00 00
1222fb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

So, we understood how our programs can access the kernel space to execute a system call and how they are linked with the linker in execution time to function that are still not defined.

In the next article, I will talk about the kernel source code, to understand how Inotify works and how system calls are defined inside the Linux Kernel.

While reading the The linux programming interface book I finally dived deep in Inotify and how it works.

Basically inotify is a notification system for events happening on files/directory like access, changes, edit and so on. It was meant to replace dnotify, an older notification system, that used signals to notify about changes. Inotify is based on file descriptor opened when you watch a file and a read syscall to check the status of the file.

The flow of calls is the following:

  • Our program uses inotify_init() to create an inotify instance
  • Then we use inotify_add_watch() to add an item to the watchlist and we decided on which event we want to listen to. The function returns a file descriptor that we use to distinguish between data arriving.
  • We do a read() on the file descriptor returned from the previous step. The read can return several data, not just one datum.
  • After our application exit we close the file descriptor associated with the inotify instance.

Following there’s the inotify event structure and the explaination I took from the book.

struct inotify_event {
int wd; /* Watch descriptor on which event occurred */
  uint32_t mask; /* Bits describing event that occurred */
uint32_t cookie; /* Cookie for related events (for rename()) */
uint32_t len; /* Size of 'name' field */
char name[]; /* Optional null-terminated filename */
};

This structure contains all the data relevant to understand where the event comes from. wd is the file descriptor we added the watch on, the mask is an integer mask to understand the event that occurred by doing the bitwise & of the integer with the event value, the cookie is a field used for rename related event (IN_MOVED_FROM and IN_MOVED_TO), len is the length of the name field, name is the filename where the event happened.

Following I attach a simple program I wrote to check when a user invoke an ls command the current directory or in the parent directory.