Multiprocessing and subprocess support
Sciagraph supports automatically profiling subprocesses and multiprocessing. Here’s an overview:
Method | Supported? |
---|---|
multiprocessing.get_context("spawn") |
Yes |
subprocess (Popen , run , check_* ) |
Yes |
multiprocessing.get_context("forkserver") |
Yes (but "spawn" is more robust) |
multiprocessing.get_context("fork") |
Yes (but you should avoid it!) |
multiprocessing
contexts, and the Very Bad default on Linux
Python’s multiprocessing
library supports different ways of creating subprocesses, called “contexts”.
The three types are:
"fork"
: This is the default on Linux. It involves making an exact copy of the current process’ in-memory state using thefork()
system call… with the caveat that all threads are gone in this subprocess."forkserver"
: A Python subprocess is started, and then that isfork()
ed. Since the new subprocess is in a much more quiescent state (no threads!) it is less likely to break whenfork()
ed."spawn"
: This is the default on macOS and Windows. The subprocesses are completely new Python processes, just like you might launch some other random subprocess from some other executable.
The default "fork"
context will, sometimes, make your program freeze, depending on which libraries you use, the phase of the moon, and how lucky you are that week.
Some libraries don’t support it at all, notably Polars.
You should avoid it if at all possible.
In Python v3.14 the default will change.
Instead, you should use "spawn"
, or if you must and you’re on Linux, "forkserver"
.
Using a better context on Linux
On macOS, you don’t need to do anything, the default is already "spawn"
.
For Linux, you can set the default context globally:
import multiprocessing
multiprocessing.set_start_method('spawn')
or you can get a new context for a localized usage:
import multiprocessing as mp
ctx = mp.get_context('spawn')
with ctx.Pool(4) as pool:
# ...
For more details see the relevant Python documentation.
Practical differences
There are some details to keep in mind when using "spawn"
or "forkserver"
.
Don’t run multiprocessing code at module level
Make sure you don’t do the multiprocessing code at module level, because the subprocess will be a new Python interpreter and will therefore re-import the module!
For single-module scripts, don’t do this:
import multiprocessing as mp
def printer():
print("hello")
# WRONG
ctx = mp.get_context('spawn')
p = ctx.Process(target=printer)
p.start()
p.join()
Do this instead:
import multiprocessing as mp
# RIGHT
if __name__ == '__main__':
ctx = mp.get_context('spawn')
p = ctx.Process(target=print, args=("hello world",))
p.start()
p.join()
For library modules, just make sure the code running subprocesses and the like is in a function, not at module-level.
Passing data to subprocesses
Passing large arrays to subprocesses involves serializing them, which can be expensive and increase memory usage.
If you want to share data between the parent process and child process, a better idea might be to e.g. store the data in a file in the parent process, and then load the data in the subprocess. Then all you need to pass to the subprocess is the path to the file, rather than the data itself.
For more details, see:
- Python’s multiprocessing performance problem (+ some solutions)
- Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5
- The mmap() copy-on-write trick: reducing memory usage of array copies
More details on why "fork"
is bad
More in-depth explanations of why fork()
is bad:
- Why your multiprocessing Pool is stuck (it’s full of sharks!)
- A bug I filed against CPython to make
"spawn"
the default on Linux, with lots of fun references, like how sometimes PyTorch breaks with"fork"
.