Fresh Hacker News | Show HN: Threadprocs – executables sharing one address space (0-copy pointers)

▲Show HN: Threadprocs – executables sharing one address space (0-copy pointers)(github.com)

55 points by jer-irl 4 hours ago | 12 comments

▲tombert 3 hours ago

Interesting.

I gotta admit that my smell test tells me that this is a step backwards; at least naively (I haven't looked through the code thoroughly yet), this just kind of feels like we're going back to pre-protected-memory operating systems like AmigaOS; there are reasons that we have the process boundary and the overhead associated with it.

If there are zero copies being shared across threadprocs, what's to stop Threadproc B from screwing up Threadproc A's memory? And if the answer to this is "nothing", then what does this buy over a vanilla pthread?

I'm not trying to come off as negative, I'm actually curious about the answer to this.

▲jer-irl 3 hours ago

Not negative at all, thanks for commenting. You're right that the answer is "nothing," and that this is a major trade-off inherent in the model. From a safety perspective, you'd need to be extremely confident that all threadprocs are well-behaved, though a memory-safe language would help some. The benefit is that you get process-like composition as separate binaries launched at separate times, with thread-like single-address space IPC.

After building this, I don't think this is necessarily a desirable trade-off, and decades of OS development certainly suggest process-isolated memory is desirable. I see this more as an experiment to see how bending those boundaries works in modern environments, rather than a practical way forward.

▲tombert 3 hours ago

It's certainly an interesting idea and I'm not wholly opposed to it, though I certainly wouldn't use it as a default process scheduler for an OS (not that you were suggesting that). I would be very concerned about security stuff. If there's no process boundary saying that threadproc A can't grab threadproc B's memory, there could pretty easily be unintended leakages of potentially sensitive data.

Still, I actually do think there could be an advantage to this if you know you can trust the executables, and if the executables don't share any memory; if you know that you're not sharing memory, and you're only grabbing memory with malloc and the like, then there is an argument to be made that there's no need for the extra process overhead.

▲furyofantares 3 hours ago

I think it is possible for B to screw up A when you share a pointer but otherwise unlikely. B doing stuff with B's memory is unlikely to screw up A.

▲tombert 3 hours ago

Sure but that's true of threads as well. The advantage of having these threadprocs is that there can be zero-copy sharing, which isn't necessarily bad but if they aren't copied then B could screw up A's stuff.

If you're ok with threads keeping their own memory and not sharing then pthreads already do that competently without any additional library. The problem with threads is that there's a shared address space and so thread B can screw up thread A's memory and juggling concurrency is hard and you need mutexes. Processes give isolation but at the cost of some overhead and IPC generally requiring copying.

I'm just not sure what this actually provides over vanilla pthreads. If I'm in charge of ensuring that the threadprocs don't screw with each other then I'm not sure this buys me anything.

▲saidnooneever 1 hour ago

of they are zero copy sharing essentially they have both access to same data and this they can screw eachother up. you'd need to design the programs around this... which might aswell make u just use shared memory.

without locking if multiple of these things would read or write to the same place the CPU will not appreciate it... u might read or write partials or garbage etc.?

still a fun little project but i dont see any use case personally. (perhaps the author has a good one, its not impossible)

▲tombert 1 hour ago

The author mentioned in a sibling thread that the advantage is that they could be separate executables, which is admittedly a little neat. If processes don't need to share anything then there could be a performance advantage to having them live under the same address space without the overhead of a whole other process.

▲lstodd 3 hours ago

Judging by the description, it's exactly like AmigaOS, Windows 3.x or early Apple. Or MS-DOS things like DESQview.

I fail to see the point - if you control the code and need performance so much that an occassional copy bites, you can as well just link it all into a single address space without those hoopjumps. It won't function as separate processes if it's modified to rely on passing pointers around anyway.

And if you don't, good luck chasing memory corruption issues.

Besides, what's wrong with shared memory?

▲tombert 1 hour ago

> Besides, what's wrong with shared memory?

I generally think that it's bad to share memory for anything with concurrency, simply because it can make it very hard to reason about the code. Mutexes are hard to get right for anything that's not completely trivial, and I find that it's almost always better to figure out a way to do work without directly sharing memory if possible (or do some kind of borrow/ownership thing like Rust to make it unambiguous who actually owns it). Mutexes can also make it difficult to performance test in my experience, since there can be weird choke points that don't show up in local testing and only ever show up in production.

Part of the reason I love Erlang so much is specifically because it really doesn't easily allow you to share memory. Everything is segmented and everything needs to be message-passed, so you aren't mucking around with mutexes and it's never ambiguous where memory is who who owns it. Erlang isn't the fastest language but since I'm not really dealing with locks the performance is generally much more deterministic for me.

▲PaulDavisThe1st 3 hours ago

Are you familiar with the Opal OS from UW CS&E during the 90s ? All tasks in a single system-wide address space, with h/w access control for memory regions.

> The Opal project is exploring a new operating system structure, tuned to the needs of complex applications, such as CAD/CAM, where a number of cooperating programs manipulate a large shared persistent database of objects. In Opal, all code and data exists with in a single, huge, shared address space. The single address space enhances sharing and cooperation, because addresses have a unique (for all time) interpretation. Thus, pointer-based data structures can be directly communicated and shared between programs at any time, and can be stored directly on secondary storage without the need for translation. This structure is simplified by the availability of a large address space, such as those provided by the DEC Alpha, MIPS, HP/PA-RISC, IBM RS6000, and future Intel processors.

> Protection in Opal is independent of the single address space; each Opal thread executes within a protection domain that defines which virtual pages it has the right to access. The rights to access a page can be easily transmitted from one process to another. The result is a much more flexible protection structure, permitting different (and dynamically changing) protection options depending on the trust relationship between cooperating parties. We believe that this organization can improve both the structure and performance of complex, cooperating applications.

> An Opal prototype has been built for the DEC Alpha platform on top of the Mach operating system.

https://homes.cs.washington.edu/~levy/opal/opal.html

▲jer-irl 3 hours ago

I wasn't but I'll have to read more! Some good relevant discussion here too https://news.ycombinator.com/item?id=7554921 . I wanted to keep this project in user-space, but there's a lot of interesting ground on the OS side of things too. Something like Theseus <https://github.com/theseus-os/Theseus> is also interesting, providing similar protections (in theory) by enforced compiler invariants rather than hardware features.

▲kgeist 2 hours ago

From what I remember, in the Linux kernel, there's already barely any distinction between processes and threads, a thread is just a process that shares virtual memory with another process, you specify if memory should be shared or not when calling clone()

So we already have threads that do exactly what you're trying to do? Isn't it somewhat easier and less risky to just compile several programs into one binary? If you have no control over the programs you're trying to "fuse" (no source), then you probably don't want to fuse them, because it's very unsafe.

Maybe I don't understand something. I think it can work if you want processes with different lib versions or even different languages, but it sounds somewhat risky to pass data just like that (possible data corruption)

▲otterley 2 hours ago

The code uses Linux's clone3() syscall under the hood: see https://github.com/jer-irl/threadprocs/blob/main/docs/02-imp...

The interesting thing is that it's loading existing binaries and executing them in a different way than usual. I think it's pretty clever (even if unsafe, as you mentioned).

This is a splendid example of a "hack" and I wish HN had more of these!

▲ 2 hours ago

▲jer-irl 2 hours ago

> I think it can work if you want processes with different lib versions or even different languages

This is exactly right, unrelated binaries can coexist, or different versions of the same binary, etc.

> it sounds somewhat risky to pass data just like that

This is also right! I started building an application framework that could leverage this and provide some protections on memory use: https://github.com/jer-irl/tproc-actors , but the model is inherently tricky, especially with elaborate data structures where ABI compatibility can be so brittle

▲Retr0id 1 hour ago

> Unlike dlopen-based plugin systems, threadprocs run traditional executables with a `main()` function.

Why not dlopen with something that calls plugin_main() (etc.) in its own thread?

▲jer-irl 46 minutes ago

Good call-out, and I think that's a more practical approach for most systems.

For this project, one of my goals was to impose the fewest dependencies possible on the loaded executables, and give the illusion that they're running in a fully independent process, with their own stdin/out/err and global runtime resources.

  "./my_prog abc" -> "launcher s.sock ./my_prog abc"

There's a rich design space if you impose "compile as .so with well-known entry point," and certainly that's what I'd explore for production apps that need this sort of a model.

▲junon 2 hours ago

I considered this with my OS, and ultimately decided it was indeed the wrong approach. Really cool to see someone experimenting with these ideas nonetheless.

▲wswin 2 hours ago

Cool project. I think most real life problems would be solved with shared memory objects, c.f. shm_open [1].

Python has a wrapper on in the standard library [2], not sure about other languages

1. https://www.man7.org/linux/man-pages/man3/shm_open.3.html

2. https://docs.python.org/3/library/multiprocessing.shared_mem...

▲lifis 3 hours ago

"Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should"

It's very cool, but would only be useful in some marginal cases, specifically if you don't want to modify the programs significantly and the reliability reduction is worth either the limited performance upside of avoiding mm switches or the ability to do somewhat easier shared memory.

Generally this problem would be better solved in either of these ways: 1. Recompile the modules as shared libraries (or statically link them together) and run them with a custom host program. This has less memory waste and faster startup. 2. Have processes that share memory via explicit shared memory mechanisms. This is more reliable.

▲jer-irl 2 hours ago

Thanks! The idea of launching additional components nearly "natively" from the shell was compelling to me early on, but I agree that shared libraries with a more opinionated "host program" is probably a more practical approach.

Explicit shared memory regions is definitely the standard for this sort of a problem if you desire isolated address spaces. One area I want to explore further is using allocation/allocators that are aware of explicit shared memory regions, and perhaps ensuring that the regions get mmap'd to the same virtual address in all participants.

▲L8D 3 hours ago

It is lovely to see experimentation like this going on. I think this has a lot of potential with something like Haskell's green threading model, basically taking it up a notch and doing threading across pthreads instead of being restricted to the VM threads. Surely, if this can be well fleshed-out, it could be implemented at the compiler-level or library-level so existing multi-threaded Haskell software can switch over to something like this to squeeze out even more performance. I'm not an expert here, though, so ¯\_(ツ)_/¯ take my words with a grain of salt.

▲fwsgonzo 3 hours ago

I actually just published a paper about something like this, which I implemented in both libriscv and TinyKVM called "Inter-Process Remote Execution (IPRE): Low-latency IPC/RPC using merged address spaces".

Here is the abstract: This paper introduces Inter-Process Remote Execution (IPRE), whose primary function is enabling gated persistence for per-request isolation architectures with microsecond-latency access to persistent services. IPRE eliminates scheduler dependency for descheduled processes by allowing a virtual machine to directly and safely call, execute functions in a remote virtual machines address space. Unlike prior approaches requiring hardware modifications (dIPC) or kernel changes (XPC), IPRE works with standard virtualization primitives, making it immediately deployable on commodity systems. We present two implementations: libriscv (12-14ns overhead, emulated execution) and TinyKVM (2-4us overhead, native execution). Both eliminate data serialization through address-space merging. Under realistic scheduler contention from schbench workloads (50-100% CPU utilization), IPRE maintains stable tail latency (p99<5us), while a state-of-the-art lock-free IPC framework shows 1,463× p99 degradation (4.1us to 6ms) when all CPU cores are saturated. IPRE thus enables architectural patterns (per-request isolation, fine-grained microservices) that incur millisecond-scale tail latency in busy multi-tenant systems using traditional IPC.

Bottom line: If you're doing synchronous calls to a remote party, IPRE wouldn't require any scheduler mediation. The same applies to your repo. Passing allocator-less structures to the remote is probably a landmine waiting to happen. If you structure both parties to use custom allocators, at least for the remote calls, you can track and even steal allocations (using a shared memory area). With IPRE there is extra risk of stale pointers because the remote part is removed from the callers memory after it completes. The paper will explain all the details, but for example since we control the VMM we can close the remote session if anything bad happens. (This paper is not out yet, but it should be very soon)

The best part about this kind of architecture, which you immediately mention, is the ability to completely avoid serialization. Passing a complex struct by reference and being able to use the data as-is is a big benefit. It breaks down when you try to do this with something like Deno, unfortunately. But you could do Deno <-> C++, for example.

For libriscv the implementation is simpler: Just loan remote-looking pages temporarily so that read/write/execute works, and then let exception-handling handle abnormal disconnection. With libriscv it's also possible for the host to take over the guests global heap allocator, which makes it possible to free something that was remotely allocated. You can divide the address space into the number of possible callers, and one or more remotes, then if you give the remote a std::string larger than SSO, the address will reveal the source and the source tracks its own allocations, so we know if something didn't go right. Note that this is only an interest for me, as even though (for example) libriscv is used in large codebases, the remote RPC feature is not used at all, and hasn't been attemped. It's a Cool Idea that kinda works out, but not ready for something high stakes.

▲jeffbee 3 hours ago

Looks like you forgot the URL. Interested.

▲fwsgonzo 3 hours ago

Best I can do is reply to an e-mail if someone asks for the paper, since it's not out yet. The e-mail ends with hotmail.

▲kjok 3 hours ago

> I actually just published a paper...

This gives me an impression that the paper has already been published and is available publicly for us to read.

▲fwsgonzo 2 hours ago

Sorry about that, the conference was on Feb 2, and it's supposed to be out any day/week now. I don't have a date.

There is a blog-style writeup here: https://fwsgonzo.medium.com/an-update-on-tinykvm-7a38518e57e...

Not as rigorous as the paper, but the gist is there.

▲jeffbee 1 hour ago

Thanks! I'll keep an eye out for the paper.

▲self_awareness 1 hour ago

Wait. Wasn't address space separation an improvement?

This is going back to DOS age.

▲whalesalad 3 hours ago

I feel like this could unlock some huge performance gains with Python. If you want to truly "defeat the GIL" you must use processes instead of threads. This could be a middle ground.

▲hun3 3 hours ago

This is exactly what subinterpreters are for! Basically isolated copies of Python in the same process.

https://docs.python.org/3/library/concurrent.interpreters.ht...

If you want a higher-level interface, there is InterpreterPoolExecutor:

https://docs.python.org/3/library/concurrent.futures.html#co...

▲short_sells_poo 3 hours ago

How would this really help python though? This doesn't solve the difficult problem, which is that python objects don't support parallel access by multiple threads/processes, no? Concurrent threads, yes, but only one thread can be operating on a python object at a time (I'm simplifying here for brevity).

There are already means of passing around bulk data with zero copy characteristics in python, but there's a lot of bureaucracy around it. A true solution must work with the GIL (or remove it altogether), no?

▲jer-irl 3 hours ago

I'm not familiar with CPython GC internals, but I there there are mechanisms for Python objects to be safely handed to C,C++ libraries and used there in parallel? Perhaps one could implement a handoff mechanism that uses those same mechanisms? Interesting idea!

▲philipwhiuk 3 hours ago

This is basically 'De-ASLR' is it not?

▲jer-irl 3 hours ago

Could you clarify what you mean by that? This does heavily rely on loaded code being position-independent, because the memory used will go into whatever regions `mmap(..., ~MAP_FIXED)` returns.