Fork this mess

Wed Nov 2 '22

Recently a computational pipeline was hanging indefinitely on large inputs. This pipeline was written in nextflow, but the process that was hanging was a python process that used python’s multiprocessing.Pool to speed up some analysis.[1]

Running the same code outside of the context of nextflow didn’t yield the same indefinite hang, which was extremely curious. Some of my initial thoughts were that nextflow was doing some weird process management, or that the way in which it was handling stdout/stderr from the processes was causing a deadlock, or that resource contention was the issue, but these were fairly easy to rule out.

Infinity pool

Here is a small program that I will use to reproduce the issue, since I am not permitted to share the actual code. The following program uses a Pool of three processes, each of which catches some 💤’s:

import multiprocessing
import time

def sleepy_boi(i):
    time.sleep(900)
    print(f'OK {i}')

if __name__ == '__main__':
    with multiprocessing.Pool(3) as executor:
        executor.map(sleepy_boi, [1,2,3])

If you run this program, you will see the three child processes sawing logs, but also some threads:

$ pstree -Aacutspn 452457
  `-python3,452457 sleepy.py
      |-python3,452458 sleepy.py
      |-python3,452459 sleepy.py
      |-python3,452460 sleepy.py
      |-{python3},452461
      |-{python3},452462
      `-{python3},452463

It’s not totally clear what these threads are doing or why they were created, but we can see their current state with strace:

$ strace -fp 452457
  strace: Process 452457 attached with 4 threads
  [pid 452462] futex(0x5567c19bba80, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid 452457] futex(0x5567c1987f30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid 452461] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
  [pid 452463] read(5,

While it’s not immediately clear what’s happening here—there are some locks and a blocking reads—python probably knows what it’s doing, and I don’t. At some point later, the sleepy bois finish their winkers and the program exits. Now let’s see Paul Allen’s program…

A clue, finally

Look at that subtle syscall. The tasteful read.

Oh my god. It even has a futex!

We now run the deadlocking program; inspecting the running process and threads, things look remarkably similar:

$ pstree -Aacutspn 456088
  `-python3,456088 forky.py
     |-python3,456089 forky.py
     |-python3,456091 forky.py
     |-{python3},456092
     |-{python3},456093
     |-{python3},456094
     `-python3,456096 forky.py

$ strace -fp 456088
  strace: Process 456088 attached with 4 threads
  [pid 456093] futex(0x5642cf689270, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
  [pid 456092] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
  [pid 456094] read(5,  <unfinished ...>
  [pid 456088] futex(0x5642cf655f30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

On closer comparison, however, there is a difference: the PIDs are no longer sequential, and the processes are out of order from before. This in and of itself is not necessarily an issue, as there are many other processes being spawned on the same system, but it is curious.

More to the point, there are still three processes running, but one of them was started much later! Could one of the original processes have been killed? To test, let’s kill one of the earlier processes:

$ kill 456089

$ pstree -Aacutspn 456088
  `-python3,456088 forky.py
     |-python3,456091 forky.py
     |-{python3},456092
     |-{python3},456093
     |-{python3},456094
     |-python3,456096 forky.py
     `-python3,479607 forky.py

Sure enough, the process is killed and comes back again with a higher PID. At this point, we have enough to introduce a minimal test case.

oopsies.py

import multiprocessing
import os

def murder_bot(i):
     if i == 2:
         # Sorry kid, nothing personal
         os.system('kill %d' % os.getpid())
     print(f'OK {i}')

 if __name__ == '__main__':
     with multiprocessing.Pool(3) as executor:
         executor.map(murder_bot, [1,2,3])

If you run this, it will hang forever:

$ /usr/bin/python3 oopsies.py
  OK 1
  OK 3
  █

We don’t want our pipeline to hang forever if one of the child processes is terminated. One way to handle this is to literally handle it: by raising an exception in a SIGTERM handler, and this works fairly nicely:

import multiprocessing
import os
import signal

def handle_term(signum, frame):
    raise Exception('oh no')

def init_process(*initargs):
    signal.signal(signal.SIGTERM, handle_term)

def murder_bot(i):
    if i == 2:
        os.system('kill %d' % os.getpid())
    print(f'OK {i}')

if __name__ == '__main__':
    with multiprocessing.Pool(3, initializer=init_process) as executor:
        executor.map(murder_bot, [1,2,3])

The unhandled exception in the child process is then bubbled up to the main process, which re-raises it.

$ python forky.py
 OK 1
 OK 3

Traceback (most recent call last):
  File "/usr/lib64/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/home/brandon/forky.py", line 15, in murder_bot
    os.system('kill %d' % os.getpid())
  File "/home/brandon/forky.py", line 8, in handle_term
    raise Exception('oh no')
Exception: oh no

The OOM slayer

This could be the end of the blog post, but unfortunately the pipeline runs in the real world, where nobody sends nice SIGTERM messages to softly ask your process to exit. In this cruel world, you answer to the OOM killer[2], and it’s ready to fuck your shit up with a SIGKILL:

Unfortunately—for reasons unknown—you can’t do this:

def heh_nice_try(signum, frame):
    return '*teleports behind you*'

signal.signal(signal.SIGKILL, heh_nice_try)

What’s next

It might be nice if someone looks into alternatives to the multiprocessing Pool that don’t have this issue, as well as dive into the internals of how and why this is happening.