I gotta admit that my smell test tells me that this is a step backwards; at least naively (I haven't looked through the code thoroughly yet), this just kind of feels like we're going back to pre-protected-memory operating systems like AmigaOS; there are reasons that we have the process boundary and the overhead associated with it.
If there are zero copies being shared across threadprocs, what's to stop Threadproc B from screwing up Threadproc A's memory? And if the answer to this is "nothing", then what does this buy over a vanilla pthread?
I'm not trying to come off as negative, I'm actually curious about the answer to this.
After building this, I don't think this is necessarily a desirable trade-off, and decades of OS development certainly suggest process-isolated memory is desirable. I see this more as an experiment to see how bending those boundaries works in modern environments, rather than a practical way forward.
Still, I actually do think there could be an advantage to this if you know you can trust the executables, and if the executables don't share any memory; if you know that you're not sharing memory, and you're only grabbing memory with malloc and the like, then there is an argument to be made that there's no need for the extra process overhead.
If you're ok with threads keeping their own memory and not sharing then pthreads already do that competently without any additional library. The problem with threads is that there's a shared address space and so thread B can screw up thread A's memory and juggling concurrency is hard and you need mutexes. Processes give isolation but at the cost of some overhead and IPC generally requiring copying.
I'm just not sure what this actually provides over vanilla pthreads. If I'm in charge of ensuring that the threadprocs don't screw with each other then I'm not sure this buys me anything.
without locking if multiple of these things would read or write to the same place the CPU will not appreciate it... u might read or write partials or garbage etc.?
still a fun little project but i dont see any use case personally. (perhaps the author has a good one, its not impossible)
I fail to see the point - if you control the code and need performance so much that an occassional copy bites, you can as well just link it all into a single address space without those hoopjumps. It won't function as separate processes if it's modified to rely on passing pointers around anyway.
And if you don't, good luck chasing memory corruption issues.
Besides, what's wrong with shared memory?
I generally think that it's bad to share memory for anything with concurrency, simply because it can make it very hard to reason about the code. Mutexes are hard to get right for anything that's not completely trivial, and I find that it's almost always better to figure out a way to do work without directly sharing memory if possible (or do some kind of borrow/ownership thing like Rust to make it unambiguous who actually owns it). Mutexes can also make it difficult to performance test in my experience, since there can be weird choke points that don't show up in local testing and only ever show up in production.
Part of the reason I love Erlang so much is specifically because it really doesn't easily allow you to share memory. Everything is segmented and everything needs to be message-passed, so you aren't mucking around with mutexes and it's never ambiguous where memory is who who owns it. Erlang isn't the fastest language but since I'm not really dealing with locks the performance is generally much more deterministic for me.
> The Opal project is exploring a new operating system structure, tuned to the needs of complex applications, such as CAD/CAM, where a number of cooperating programs manipulate a large shared persistent database of objects. In Opal, all code and data exists with in a single, huge, shared address space. The single address space enhances sharing and cooperation, because addresses have a unique (for all time) interpretation. Thus, pointer-based data structures can be directly communicated and shared between programs at any time, and can be stored directly on secondary storage without the need for translation. This structure is simplified by the availability of a large address space, such as those provided by the DEC Alpha, MIPS, HP/PA-RISC, IBM RS6000, and future Intel processors.
> Protection in Opal is independent of the single address space; each Opal thread executes within a protection domain that defines which virtual pages it has the right to access. The rights to access a page can be easily transmitted from one process to another. The result is a much more flexible protection structure, permitting different (and dynamically changing) protection options depending on the trust relationship between cooperating parties. We believe that this organization can improve both the structure and performance of complex, cooperating applications.
> An Opal prototype has been built for the DEC Alpha platform on top of the Mach operating system.
So we already have threads that do exactly what you're trying to do? Isn't it somewhat easier and less risky to just compile several programs into one binary? If you have no control over the programs you're trying to "fuse" (no source), then you probably don't want to fuse them, because it's very unsafe.
Maybe I don't understand something. I think it can work if you want processes with different lib versions or even different languages, but it sounds somewhat risky to pass data just like that (possible data corruption)
The interesting thing is that it's loading existing binaries and executing them in a different way than usual. I think it's pretty clever (even if unsafe, as you mentioned).
This is a splendid example of a "hack" and I wish HN had more of these!
This is exactly right, unrelated binaries can coexist, or different versions of the same binary, etc.
> it sounds somewhat risky to pass data just like that
This is also right! I started building an application framework that could leverage this and provide some protections on memory use: https://github.com/jer-irl/tproc-actors , but the model is inherently tricky, especially with elaborate data structures where ABI compatibility can be so brittle
Why not dlopen with something that calls plugin_main() (etc.) in its own thread?
For this project, one of my goals was to impose the fewest dependencies possible on the loaded executables, and give the illusion that they're running in a fully independent process, with their own stdin/out/err and global runtime resources.
"./my_prog abc" -> "launcher s.sock ./my_prog abc"
There's a rich design space if you impose "compile as .so with well-known entry point," and certainly that's what I'd explore for production apps that need this sort of a model.Python has a wrapper on in the standard library [2], not sure about other languages
1. https://www.man7.org/linux/man-pages/man3/shm_open.3.html
2. https://docs.python.org/3/library/multiprocessing.shared_mem...
It's very cool, but would only be useful in some marginal cases, specifically if you don't want to modify the programs significantly and the reliability reduction is worth either the limited performance upside of avoiding mm switches or the ability to do somewhat easier shared memory.
Generally this problem would be better solved in either of these ways: 1. Recompile the modules as shared libraries (or statically link them together) and run them with a custom host program. This has less memory waste and faster startup. 2. Have processes that share memory via explicit shared memory mechanisms. This is more reliable.
Explicit shared memory regions is definitely the standard for this sort of a problem if you desire isolated address spaces. One area I want to explore further is using allocation/allocators that are aware of explicit shared memory regions, and perhaps ensuring that the regions get mmap'd to the same virtual address in all participants.
Here is the abstract: This paper introduces Inter-Process Remote Execution (IPRE), whose primary function is enabling gated persistence for per-request isolation architectures with microsecond-latency access to persistent services. IPRE eliminates scheduler dependency for descheduled processes by allowing a virtual machine to directly and safely call, execute functions in a remote virtual machines address space. Unlike prior approaches requiring hardware modifications (dIPC) or kernel changes (XPC), IPRE works with standard virtualization primitives, making it immediately deployable on commodity systems. We present two implementations: libriscv (12-14ns overhead, emulated execution) and TinyKVM (2-4us overhead, native execution). Both eliminate data serialization through address-space merging. Under realistic scheduler contention from schbench workloads (50-100% CPU utilization), IPRE maintains stable tail latency (p99<5us), while a state-of-the-art lock-free IPC framework shows 1,463× p99 degradation (4.1us to 6ms) when all CPU cores are saturated. IPRE thus enables architectural patterns (per-request isolation, fine-grained microservices) that incur millisecond-scale tail latency in busy multi-tenant systems using traditional IPC.
Bottom line: If you're doing synchronous calls to a remote party, IPRE wouldn't require any scheduler mediation. The same applies to your repo. Passing allocator-less structures to the remote is probably a landmine waiting to happen. If you structure both parties to use custom allocators, at least for the remote calls, you can track and even steal allocations (using a shared memory area). With IPRE there is extra risk of stale pointers because the remote part is removed from the callers memory after it completes. The paper will explain all the details, but for example since we control the VMM we can close the remote session if anything bad happens. (This paper is not out yet, but it should be very soon)
The best part about this kind of architecture, which you immediately mention, is the ability to completely avoid serialization. Passing a complex struct by reference and being able to use the data as-is is a big benefit. It breaks down when you try to do this with something like Deno, unfortunately. But you could do Deno <-> C++, for example.
For libriscv the implementation is simpler: Just loan remote-looking pages temporarily so that read/write/execute works, and then let exception-handling handle abnormal disconnection. With libriscv it's also possible for the host to take over the guests global heap allocator, which makes it possible to free something that was remotely allocated. You can divide the address space into the number of possible callers, and one or more remotes, then if you give the remote a std::string larger than SSO, the address will reveal the source and the source tracks its own allocations, so we know if something didn't go right. Note that this is only an interest for me, as even though (for example) libriscv is used in large codebases, the remote RPC feature is not used at all, and hasn't been attemped. It's a Cool Idea that kinda works out, but not ready for something high stakes.
This gives me an impression that the paper has already been published and is available publicly for us to read.
There is a blog-style writeup here: https://fwsgonzo.medium.com/an-update-on-tinykvm-7a38518e57e...
Not as rigorous as the paper, but the gist is there.
This is going back to DOS age.
https://docs.python.org/3/library/concurrent.interpreters.ht...
If you want a higher-level interface, there is InterpreterPoolExecutor:
https://docs.python.org/3/library/concurrent.futures.html#co...
There are already means of passing around bulk data with zero copy characteristics in python, but there's a lot of bureaucracy around it. A true solution must work with the GIL (or remove it altogether), no?