Info-Tech

Fixing Stutters in Papers Please on Linux

Tue Dec 28, 2021

Since I switched to Linux some time ago, I had my graceful fraction of considerations with working video games on Linux. To be graceful, all these had been no longer designed to flee on linux, however the superior Proton and its well-known mission wine develop it straightforward to flee most video games I am in seamlessly. More in overall than no longer, video games refuse to start, but some workaround exists to develop it flee.

After I needed to play some Papers Please I used to be once elated to gaze that a local port exists, which must develop it straightforward to flee it. Installing and starting the recreation from GOG was once straightforward ample, but starting in the principle menu, one thing was once off. After starting a recreation it was once definite that the animations had been pausing every few seconds for round a 2nd, which made it practically unplayable.

Taking a stare upon ProtonDB, people are complaining about stutters. Some are reporting that the 64-bit model stutters, whereas the 32-bit model works. Uncomfortable for me, GOG simplest has a derive for 64-bit on Linux. One commentary says there’s “a half a 2nd freeze every few seconds”, which looks akin to my abilities.

There has to be an answer why the shopper stutters, the developers must salvage tested this configuration! As a side expose, I am questioning if the 64-bit model worked in the future, but without a right starting point it is anxious to test. So let’s review the assign the pauses are coming from. Linux sides a full suite of instruments to debug many diversified sides from efficiency to correctness. And here I am assuming that the stutters are seemingly to be no longer inherent in the recreation logic, because the 32-bit model works wisely.

What’s occurring?

The first instrument I am using is strace. In its default configuration it runs a program and recordsdata all machine calls, which a program makes, and experiences them on the console. With out any flags every name is reported that could perchance additionally be overwhelming, but without a cost the assign to gaze, I need to account every thing.

After some digging thru precious alternate solutions, I seek for the -T and -t flags which account the time spent in the syscall and the timestamp when calling. So I account some recordsdata using strace -T -t ./PapersPlease > log.txt and sift thru the mess:

...
13: 08: 06.067929 pselect6(9, [8], NULL, NULL, {tv_sec=0, tv_nsec=0}, NULL) = 0 (Timeout) <0.000005>
13: 08: 06.067946 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource rapid unavailable) <0.000004>
13: 08: 06.067962 pselect6(9, [8], NULL, NULL, {tv_sec=0, tv_nsec=0}, NULL) = 0 (Timeout) <0.000005>
13: 08: 06.068033 getpid()                = 2541133 <0.000005>
13: 08: 06.068048 getpid()                = 2541133 <0.000004>
13: 08: 06.068062 getpid()                = 2541133 <0.000004>
13: 08: 06.068077 getpid()                = 2541133 <0.000004>
13: 08: 06.068091 getpid()                = 2541133 <0.000004>
13: 08: 06.068105 getpid()                = 2541133 <0.000004>
13: 08: 06.068119 poll([{fd=8, events=0}], 1, 0) = 0 (Timeout) <0.000005>
13: 08: 06.068138 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.019100>
-------------------------------------------------------------------------------
13: 08: 06.087261 ioctl(19, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x52, 0x10), 0x7ffce133ef70) = 0 <0.000008>
13: 08: 06.087288 sched_yield()           = 0 <0.000005>
13: 08: 06.087306 sched_yield()           = 0 <0.000005>
13: 08: 06.087325 sched_yield()           = 0 <0.000004>
13: 08: 06.087340 sched_yield()           = 0 <0.000004>
13: 08: 06.087357 recvmsg(8, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="6324015304j103521v3003366)3X3t1201"..., iov_len=4096}], msg_i>
13: 08: 06.087391 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource rapid unavailable) <0.000005>
13: 08: 06.087410 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource rapid unavailable) <0.000006>
13: 08: 06.087433 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource rapid unavailable) <0.000005>
13: 08: 06.087454 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource rapid unavailable) <0.000005>
...

strace recordsdata somewhat just a few recordsdata fancy the parameters handed into the syscalls, to boot because the return value. For some parameters, fancy flags, strace is conscious of their which approach. We can additionally inspect that the recreation calls at least some syscall every few milliseconds. I salvage highlighted the longest gap between syscalls, which originates from the outdated syscall taking 19 milliseconds to whole. The first parameter to poll gains a file descriptor, which we are able to salvage to the underside of to a direction using lsof -p . In this case it resolves to /dev/nvidia0, so some verbal change with my graphics card. The delay of 19 ms is shut to the unpleasant 16.67 ms which recreation developers can expend to map a single frame to restful carry out 60 fps. This could perchance be linked to a v-sync and permits us to filter odd operation.

The expend of this knowledge, we are able to filter for polling on file descriptor 19 using grep "fd=19" log.txt:

...
13: 08: 05.987848 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018904>
13: 08: 06.007749 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.019502>
13: 08: 06.028516 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018306>
13: 08: 06.048071 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018689>
13: 08: 06.068138 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.019100>
13: 08: 06.088641 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018155>
13: 08: 06.108200 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018565>
-------------------------------------------------------------------------------
13: 08: 07.519463 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.000005>
13: 08: 07.522046 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.004434>
13: 08: 07.539017 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.008308>
13: 08: 07.554338 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.012495>
13: 08: 07.570526 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.016243>
...

More in overall than no longer, the poll is called round every 19-20 ms, but here’s interrupted by a 1.4 s stop. The following poll returns in effectively zero time, because the v-sync is gradual and no stop is wanted to abet out a consistent frame rate. The length of 1.4 s is additionally very noticeable for a player and in step with the commentary on ProtonDB in regards to the half 2nd stuttering. The incompatibility between 0.5 s and 1.4 s will more than seemingly be human notion or the risk will more than seemingly be worse on my machine, both system we have got a clue.

Staunch thru the long stop just a few syscalls stand out, because I salvage no longer seen them someplace else:

13: 08: 06.127285 openat(AT_FDCWD, "/dev/input", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 29 <0.000010>
13: 08: 06.127311 newfstatat(29, "", {st_mode=S_IFDIR|0755, st_size=720, ...}, AT_EMPTY_PATH) = 0 <0.000005>
13: 08: 06.127340 getdents64(29, 0x55fb3ef36880 /36 entries */, 32768) = 1128 <0.000014>
13: 08: 06.127369 stat("/dev/input/event10", {st_mode=S_IFCHR|0660, st_rdev=makedev(0xd, 0x4a), ...}) = 0 <0.000005>
13: 08: 06.127388 openat(AT_FDCWD, "/dev/input/event10", O_RDONLY) = 32 <0.000007>
13: 08: 06.127406 ioctl(32, EVIOCGBIT(0, 8), [EV_SYN EV_KEY EV_MSC]) = 8 <0.000006>
13: 08: 06.127430 ioctl(32, EVIOCGBIT(EV_KEY, 96), [KEY_POWER KEY_SLEEP KEY_WAKEUP]) = 96 <0.000004>
13: 08: 06.127447 ioctl(32, EVIOCGBIT(EV_ABS, 8), []) = 8 <0.000005>
13: 08: 06.127463 shut(32)               = 0 <0.039483>
13: 08: 06.173139 stat("/dev/input/event9", {st_mode=S_IFCHR|0660, st_rdev=makedev(0xd, 0x49), ...}) = 0 <0.000008>
13: 08: 06.173177 openat(AT_FDCWD, "/dev/input/event9", O_RDONLY) = 32 <0.000009>
13: 08: 06.173200 ioctl(32, EVIOCGBIT(0, 8), [EV_SYN EV_KEY EV_REL EV_ABS ...]) = 8 <0.000005>
13: 08: 06.173219 ioctl(32, EVIOCGBIT(EV_KEY, 96), [KEY_ESC KEY_ENTER KEY_KPMINUS KEY_KPPLUS ...]) = 96 <0.000005>
13: 08: 06.173239 ioctl(32, EVIOCGBIT(EV_ABS, 8), [ABS_VOLUME]) = 8 <0.000005>
13: 08: 06.173256 shut(32)               = 0 <0.023215>
...
13: 08: 07.465881 stat("/dev/input/event0", {st_mode=S_IFCHR|0660, st_rdev=makedev(0xd, 0x40), ...}) = 0 <0.000008>
13: 08: 07.465904 openat(AT_FDCWD, "/dev/input/event0", O_RDONLY) = 32 <0.000009>
13: 08: 07.465926 ioctl(32, EVIOCGBIT(0, 8), [EV_SYN EV_KEY]) = 8 <0.000005>
13: 08: 07.465943 ioctl(32, EVIOCGBIT(EV_KEY, 96), [KEY_POWER]) = 96 <0.000004>
13: 08: 07.465958 ioctl(32, EVIOCGBIT(EV_ABS, 8), []) = 8 <0.000004>
13: 08: 07.465974 shut(32)               = 0 <0.017162>
13: 08: 07.483156 getdents64(29, 0x55fb3ef36880 /0 entries */, 32768) = 0 <0.000007>
13: 08: 07.483179 shut(29)               = 0 <0.000006>

The snippet starts with opening /dev/input, which is a folder containing input units, with the resulting file descriptor 29. The expend of getdents this technique iterates over the directory entries, which are resulting from this truth opened, checked for some flags and closed. In the damage, file descriptor 29 is closed. Overall, this operation takes round 1.4 s, it looks we learned, the assign the stop originates. The well-known delay is ended in by the shut operations, which takes as a lot as 100 ms (no longer included in the snippet). This result’s surprising, as a straightforward operation fancy shut must be noteworthy faster. If we stare upon the final line of the snippet, we are able to even inspect an example, closing file descriptor 29 simplest takes 6 µs. That is greater than a component of 100.000 faster! Something fishy is occurring here.

We can further ticket what’s occurring, when we know the calling stack value, which finally ends up in the slack shut operation. Luckily, strace is loaded with many precious sides fancy -ok which does precisely what we could well like, it recordsdata the calling value. We simplest need the stack traces for shut calls, which we are able to filter using -e value=shut. This finally ends up in a cost fancy this:

13: 08: 07.465974 shut(32)               = 0 <0.017162>
 > /usr/lib/libpthread-2.33.so(__close+0x3b) [0x1282b]
 > ~/Video games/gog/papers-please/recreation/lime.ndll(MaybeAddDevice+0x127) [0x4a6d07]
 > ~/Video games/gog/papers-please/recreation/lime.ndll(LINUX_JoystickDetect+0x1db) [0x4a70eb]
 > ~/Video games/gog/papers-please/recreation/lime.ndll(LINUX_JoystickInit+0x91) [0x4a7191]
 > ~/Video games/gog/papers-please/recreation/lime.ndll(SDL_JoystickInit+0x3f) [0x47d8ff]
 > ~/Video games/gog/papers-please/recreation/lime.ndll(SDL_InitSubSystem_REAL+0x1c5) [0x471c15]
 > ~/Video games/gog/papers-please/recreation/lime.ndll(lime::SDLApplication::SDLApplication()+0xa1) [0xea6a1]
 > ~/Video games/gog/papers-please/recreation/lime.ndll(lime::CreateApplication()+0x1b) [0xeb23b]
 > ~/Video games/gog/papers-please/recreation/lime.ndll(lime::lime_application_create()+0x9) [0xb9929]
 > ~/Video games/gog/papers-please/recreation/PapersPlease(lime::_internal::backend::native::NativeApplication_obj::__construct(hx::ObjectPtr)+0x48b) [>
 > ~/Games/gog/papers-please/game/PapersPlease(lime::_internal::backend::native::NativeApplication_obj::__alloc(hx::ImmixAllocator*, hx::ObjectPtr
 > ~/Games/gog/papers-please/game/PapersPlease(lime::app::Application_obj::__construct()+0xf6) [0xdfc206]
 > ~/Video games/gog/papers-please/recreation/PapersPlease(openfl::point out::Application_obj::__construct()+0x3b) [0x5d8a9b]
 > ~/Video games/gog/papers-please/recreation/PapersPlease(openfl::point out::Application_obj::__alloc(hx::ImmixAllocator*)+0x41) [0x5d8da1]
 > ~/Video games/gog/papers-please/recreation/PapersPlease(ApplicationMain_obj::manufacture(Dynamic)+0x6d) [0xdf3a4d]
 > ~/Video games/gog/papers-please/recreation/PapersPlease(ApplicationMain_obj::well-known()+0x66) [0x2b8936]
 > ~/Video games/gog/papers-please/recreation/PapersPlease(well-known+0x51) [0x2be641]
 > /usr/lib/libc-2.33.so(__libc_start_main+0xd5) [0x27b25]
 > ~/Video games/gog/papers-please/recreation/PapersPlease(_start+0x2a) [0x2bf48a]

The PapersPlease binary contains symbols, which helps lots. The operate calling shut is MaybeAddDevice which is called from a operate with a SDL_ prefix. Buying thru GitHub, I will get the next code for MaybeAddDevice. The code opens some direction, assessments if it is a joystick, and closes it yet again. It does this for every input instrument on my laptop. A system or the opposite, there are 30 input units in my /dev/input which could perchance develop the gain more extended on my laptop.

At this point, we have got an thought what’s occurring: PapersPlease makes expend of SDL on Linux, which assessments every few seconds if a joystick has been connected. To attain this, every input instrument is checked. Most incessantly, this operation must be practically instantaneous, but takes between 20-100 ms. On laptop programs with somewhat just a few input units this would possibly occasionally without problems take better than a 2nd, which finally ends up in a noticable grunt, because the operation takes plight on the principle gameloop.

Why is shut slack?

Time to bring out the mountainous guns, bcc. This toolkit could perchance additionally even be venerable to instrument processes and calls at diversified phases. Whereas writing instruments with this toolkit is no longer too anxious, it additionally comes bundled with just a few helpful instruments. One of these is funcslower, which takes an emblem from a course of or a kernel symbol and measures how long every invocation takes. If it takes longer than a threshold, it is logged. With /usr/fraction/bcc/instruments/funcslower /usr/lib/libpthread-2.33.so:shut -p we are able to filter shut calls which take longer than 1 ms.

Tracing operate calls slower than 1 ms... Ctrl+C to forestall.
TIME       COMM           PID    LAT(ms)             RVAL FUNC
0.000000   PapersPlease   539544   43.49                0 /usr/lib/libpthread-2.33.so:shut
0.043507   PapersPlease   539544   32.42                0 /usr/lib/libpthread-2.33.so:shut
0.075961   PapersPlease   539544   20.86                0 /usr/lib/libpthread-2.33.so:shut
...
1.163494   PapersPlease   539544   21.81                0 /usr/lib/libpthread-2.33.so:shut
1.185336   PapersPlease   539544   31.48                0 /usr/lib/libpthread-2.33.so:shut
1.216830   PapersPlease   539544   21.78                0 /usr/lib/libpthread-2.33.so:shut
3.000077   PapersPlease   539544   23.92                0 /usr/lib/libpthread-2.33.so:shut
3.024016   PapersPlease   539544   32.80                0 /usr/lib/libpthread-2.33.so:shut
3.056834   PapersPlease   539544   39.98                0 /usr/lib/libpthread-2.33.so:shut
...
4.261013   PapersPlease   539544   35.80                0 /usr/lib/libpthread-2.33.so:shut
4.296830   PapersPlease   539544   26.65                0 /usr/lib/libpthread-2.33.so:shut
4.323498   PapersPlease   539544   26.65                0 /usr/lib/libpthread-2.33.so:shut
6.000148   PapersPlease   539544   43.33                0 /usr/lib/libpthread-2.33.so:shut
6.043497   PapersPlease   539544   39.99                0 /usr/lib/libpthread-2.33.so:shut
6.083495   PapersPlease   539544   26.66                0 /usr/lib/libpthread-2.33.so:shut
...
7.265320   PapersPlease   539544   31.49                0 /usr/lib/libpthread-2.33.so:shut
7.296819   PapersPlease   539544   26.66                0 /usr/lib/libpthread-2.33.so:shut
7.323483   PapersPlease   539544   26.66                0 /usr/lib/libpthread-2.33.so:shut

This recreates the outcomes from our earlier analysis, that every few seconds, shut calls are slack. We also can inspect that the stuttering starts every three seconds and on my machine takes 1.2 – 1.4 s. However the assign is this latency coming from? In precept, the shut operate must honest correct invoke the syscall after which return, so it goes to no longer introduce latency itself, at least no longer measurable in milliseconds. The syscall is handled in the kernel in the _sys_close operate, which on my machine has the prefix __x64. So we start funcslower yet again with /usr/fraction/bcc/instruments/funcslower __x64_sys_close -p -t and we salvage, …, nothing. Both we venerable the unpleasant operate or the operate completes faster than 1 ms. With the -u 1 flag, we are able to filter invocations which take longer than 1 µs. And with this setup we attain salvage results:

TIME       COMM           PID    LAT(us)             RVAL FUNC
0.000000   PapersPlease   539544    1.00                0 __x64_sys_close
2.240417   PapersPlease   539544    1.55                0 __x64_sys_close
2.269315   PapersPlease   539544    1.33                0 __x64_sys_close
2.311854   PapersPlease   539544    1.03                0 __x64_sys_close
2.535184   PapersPlease   539544    1.31                0 __x64_sys_close
2.591848   PapersPlease   539544    1.11                0 __x64_sys_close
2.725182   PapersPlease   539544    1.14                0 __x64_sys_close
3.018515   PapersPlease   539544    1.00                0 __x64_sys_close
3.151844   PapersPlease   539544    1.04                0 __x64_sys_close
3.205187   PapersPlease   539544    1.55                0 __x64_sys_close
3.259313   PapersPlease   539544    1.08                0 __x64_sys_close
3.485194   PapersPlease   539544    1.49                0 __x64_sys_close
5.240487   PapersPlease   539544    1.54                0 __x64_sys_close
5.274988   PapersPlease   539544    1.14                0 __x64_sys_close

We can restful inspect a repetition after three seconds, the time between the 2nd and the 2nd to final name. But 11 of the 30 calls to shut are recorded and the latency is terribly shut to 1 µs, in all probability, the missing calls had been slighly faster than 1 µs. Mild, the kernel is quickly ample to no longer topic for our 50 ms delay.

In theory, if the kernel doesn’t add latency, the measured delay has to come support from the libpthread shut operate. We can expend gdb to disassemble a particular operate, gdb -batch -ex "file /usr/lib/libpthread-2.33.so" -ex "disas shut". The resulting disassembly is proven below. The three well-known sides are the two calls to some pthread functions to boot because the syscall instruction, which is effectively a name into the kernel. The the rest instructions are negligible from a latency perspective. Ideally, we would expend a profiling instrument fancy perf to detect the assign the latency originates, but I attain no longer know the very finest technique to expend perf for examining a single operate. perf recordsdata samples of the stack every few ticks, but at present, the technique could perchance additionally very well be doing one thing else and the 2nd the assign the MaybeAddDevice operate is working could perchance additionally no longer be recorded.

Dump of assembler code for operate shut:
   ...
   0x0000000000012810 <+32>:	sub    $0x18,%rsp
   0x0000000000012814 <+36>:	mov    %edi,0xc(%rsp)
   0x0000000000012818 <+40>:	name   0x12450 <__pthread_enable_asynccancel>
   0x000000000001281d <+45>:	mov    0xc(%rsp),%edi
   0x0000000000012821 <+49>:	mov    %eax,%r8d
   0x0000000000012824 <+52>:	mov    $0x3,%eax
   0x0000000000012829 <+57>:	syscall
   0x000000000001282b <+59>:	cmp    $0xfffffffffffff000,%rax
   0x0000000000012831 <+65>:	ja     0x12868 
   0x0000000000012833 <+67>:	mov    %r8d,%edi
   0x0000000000012836 <+70>:	mov    %eax,0xc(%rsp)
   0x000000000001283a <+74>:	name   0x124d0 <__pthread_disable_asynccancel>
   0x000000000001283f <+79>:	mov    0xc(%rsp),%eax
   0x0000000000012843 <+83>:	add    $0x18,%rsp
   0x0000000000012847 <+87>:	ret
   ...
   0x0000000000012868 <+120>:	mov    0x9739(%rip),%rdx        # 0x1bfa8
   0x000000000001286f <+127>:	neg    %eax
   0x0000000000012871 <+129>:	mov    %eax,%fs:(%rdx)
   0x0000000000012874 <+132>:	mov    $0xffffffff,%eax
   0x0000000000012879 <+137>:	jmp    0x12833 

We can compose our hang toy profiler with the linked funclatency instrument from bcc, which recordsdata a histogram of all latencies, so we attain no longer omit any calls. As well to specifying a operate, with its corresponding tackle, which we could well like to value, we are able to elaborate an offset in the operate, the assign the tracing begins. The expend of this characteristic, we are able to start tracing at somewhat just a few sides within the operate and inspect how the latencies alternate. The added latency is then the adaptation between the latency earlier than and after the instruction. This vogue is no longer respectable when we’re microprofiling on the characterize of microseconds or shorter, but our latency is long ample to develop a measurable incompatibility.

As the principle latency must come from the three calls, we are able to measure at the two sides between the calls. For my testing, I venerable the offset 57 and 59, so appropriate earlier than and after the syscall. With this setup I salvage the next histograms.

#> /usr/fraction/bcc/instruments/funclatency /lib/libpthread-2.33.so:shut -o 57 -p 
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 0        |                                        |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 17       ||
  33554432 -> 67108863   : 39       ||
  67108864 -> 134217727  : 2        ||

avg = 46994220 nsecs, total: 2725664779 nsecs, count: 58
#> /usr/fraction/bcc/instruments/funclatency /lib/libpthread-2.33.so:shut -o 59 -p 
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 15       ||
      1024 -> 2047       : 68       ||
      2048 -> 4095       : 27       ||
      4096 -> 8191       : 1        |                                        |

avg = 1678 nsecs, total: 186310 nsecs, count: 111

Right here is as a ways as I will analyze this danger. I ponder, this proves that the latency originates in the kernel, but circuitously from the __x64_sys_close operate, but by one thing else, that the kernel is doing. Doing some analysis with perf, I learned the synchronize_rcu operate within the kernel, that could perchance additionally be the muse living off. It is allotment of the RCU machine in the kernel, which supplies asynchronous salvage entry to to recordsdata structures. Right here’s a rabbit gap, I’m no longer occurring this time.

With all this knowledge, it is no longer glaring the very finest technique to repair this danger, I’m no longer a kernel developer, and this latency could perchance restful come from some somewhat just a few course of having access to the identical recordsdata. So a conventional repair is no longer seemingly, however the total point of enumerating the input units every few seconds is to detect connected joysticks. Papers Please is no longer the sort of recreation the assign a joystick or more seemingly a recreation controller is priceless. So why attain we could well like this functionality in the first plight?

Binary patching

The operate which we’re having a stare upon is called MaybeAddDevice. So basically based totally on this title, no longer adding the instrument must be a suitable behaviour. So altering the implementation to honest correct return at once looks fancy a ultimate option. We can attain this alternate by bettering the binary at the purpose the assign the operate is found.

MaybeAddDevice is allotment of lime.ndll, a shared library, potentially containing the recreation engine. With objdump -t we are able to examine the total symbols which lime.ndll imports or exports, proven below. The first number is the tackle the assign the operate will more than seemingly be loaded at runtime, plus a random offset. After the sort of the emblem is given, here we have got a world and a neighborhood symbol. With this knowledge, we could perchance additionally expend a hexeditor and patch the binary straight away, but there’s a, in my notion, more orderly system.

000000000049c030 g     F .text	0000000000000009              SDL_SemWait
...
00000000004a6be0 l     F .text	000000000000032d              MaybeAddDevice

On linux, there’s a mechanism, the assign we are able to inject our hang code into a course of. We can load our hang shared library using the LD_PRELOAD atmosphere variable. Every world operate from a shared library could perchance additionally even be overwritten using this methodology. That you’ll want to additionally salvage noticed that the operate which we’re in is no longer a world operate. This complicates our endeavour somewhat. In plight of overriding the library straight away, we override the library loading operate dlopen. We inject ourselves into the loading course of. So we forward most calls to the usual dlopen calls, till we discover the lime.ndll library.

At this point, the library is loaded into memory, but we attain no longer know the assign. We can query the living of world symbols using dlsym and using the records from objdump we are able to calculate the assign MaybeAddDevice is found relative to SDL_SemWait. Because both of these could perchance be found in the identical share, .text, their relative living doesn’t alternate. So we honest correct write to the calculated tackle, appropriate? Well on contemporary programs there are protections installed, that memory with executable can no longer be written. Right here’s a really crucial protection to in the reduction of the assault surface of instrument, because an attacker could perchance additionally write his hang code into memory and honest correct jump there, but we could well like to expend this energy for right! Luckily, linux additionally involves the flexibility to alternate the protections at runtime using mprotect. In every other case instrument using JIT could perchance additionally no longer be implemented.

So, we inject the library loading course of, get the living of the operate, alternate the memory protection to writable and patch the operate. The patching itself is somewhat straightforward, we honest correct inject a ret because the first instruction, which is encoded as 0xc3. The following time this operate is called, it returns at once.

void patch_function(void* ptr, size_t offset) {
	char* func = reinterpret_cast<char*>(ptr);
	func += offset;
	const long page_size = sysconf(_SC_PAGESIZE);
	int result = mprotect(page_round_down(func, page_size), 1, PROT_READ | PROT_WRITE | PROT_EXEC);
	if(result) {
		fprintf(stderr, "result %d %dn", result, errno);
		return;
	}
	func[0] = 0xc3; // the true patching
	mprotect(page_round_down(func, page_size), 1, PROT_READ | PROT_EXEC);
}

using dlopen_t = void* (*)(const char*, int);

extern "C"
void* dlopen(const char* filename, int flags) {
	static dlopen_t real_dlopen = nullptr;
    if(!real_dlopen)
        real_dlopen = reinterpret_cast<dlopen_t>(dlsym(RTLD_NEXT, "dlopen"));
	void* tackle = real_dlopen(filename, flags);
	if(!filename || filename != "././lime.ndll"s) {
		return tackle;
	}
	void* func = dlsym(tackle, "SDL_SemWait");
	if(!func) {
		fprintf(stderr, "could perchance additionally no longer get symbol: %sn", symbol);
		return tackle;
	}
	patch_function(func, 43952);
	return tackle;
}

We can load assemble this code to a shared library and inject it using LD_PRELOAD. A fleshy implementation could perchance additionally even be learned here. I salvage modified the code to allow atmosphere the operate which we gaze up and the offset using an environment variable. This kind, even if the binary changes, the code doesn’t must be recompiled.

The atmosphere variables could perchance additionally even be living in the commence.sh file fancy this:

export LD_PRELOAD=$(realpath papers_please_fix.so)
export PP_SYMBOL=SDL_SemWait PP_OFFSET=43952

With all of this in plight, Papers Please works as intended, at least of you attain no longer play it in conjunction with your joystick. Unfortunately, I restful attain no longer know if the slack shut name is a malicious program and if that is so, the assign does it come from? On the different hand, I will lastly play Papers Please and I am hoping that you just will more than seemingly be in a plan to too or at least learned one thing along the system.

Content Protection by DMCA.com

Back to top button