Multiprocessing and subprocess support

Sciagraph supports automatically profiling subprocesses and multiprocessing. Here’s an overview:

Method	Supported?
`multiprocessing.get_context("spawn")`	Yes
`subprocess` (`Popen`, `run`, `check_*`)	Yes
`multiprocessing.get_context("forkserver")`	Yes (but `"spawn"` is more robust)
`multiprocessing.get_context("fork")`	Yes (but you should avoid it!)

`multiprocessing` contexts, and the Very Bad default on Linux

Python’s multiprocessing library supports different ways of creating subprocesses, called “contexts”. The three types are:

"fork": This is the default on Linux. It involves making an exact copy of the current process’ in-memory state using the fork() system call… with the caveat that all threads are gone in this subprocess.
"forkserver": A Python subprocess is started, and then that is fork()ed. Since the new subprocess is in a much more quiescent state (no threads!) it is less likely to break when fork()ed.
"spawn": This is the default on macOS and Windows. The subprocesses are completely new Python processes, just like you might launch some other random subprocess from some other executable.

The default "fork" context will, sometimes, make your program freeze, depending on which libraries you use, the phase of the moon, and how lucky you are that week. Some libraries don’t support it at all, notably Polars. You should avoid it if at all possible. In Python v3.14 the default will change.

Instead, you should use "spawn", or if you must and you’re on Linux, "forkserver".

Using a better context on Linux

On macOS, you don’t need to do anything, the default is already "spawn".

For Linux, you can set the default context globally:

import multiprocessing
multiprocessing.set_start_method('spawn')

or you can get a new context for a localized usage:

import multiprocessing as mp
ctx = mp.get_context('spawn')
with ctx.Pool(4) as pool:
    # ...

For more details see the relevant Python documentation.

Practical differences

There are some details to keep in mind when using "spawn" or "forkserver".

Don’t run multiprocessing code at module level

Make sure you don’t do the multiprocessing code at module level, because the subprocess will be a new Python interpreter and will therefore re-import the module!

For single-module scripts, don’t do this:

import multiprocessing as mp

def printer():
    print("hello")

# WRONG
ctx = mp.get_context('spawn')
p = ctx.Process(target=printer)
p.start()
p.join()

Do this instead:

import multiprocessing as mp

# RIGHT
if __name__ == '__main__':
    ctx = mp.get_context('spawn')
    p = ctx.Process(target=print, args=("hello world",))
    p.start()
    p.join()

For library modules, just make sure the code running subprocesses and the like is in a function, not at module-level.

Passing data to subprocesses

Passing large arrays to subprocesses involves serializing them, which can be expensive and increase memory usage.

If you want to share data between the parent process and child process, a better idea might be to e.g. store the data in a file in the parent process, and then load the data in the subprocess. Then all you need to pass to the subprocess is the path to the file, rather than the data itself.

For more details, see:

More details on why `"fork"` is bad

More in-depth explanations of why fork() is bad:

Why your multiprocessing Pool is stuck (it’s full of sharks!)
A bug I filed against CPython to make "spawn" the default on Linux, with lots of fun references, like how sometimes PyTorch breaks with "fork".