TL;DR: Most C++ and Rust thread‑pool libraries leave significant performance on the table - often running 10× slower than OpenMP on classic fork‑join workloads and micro-benchmarks. So I’ve drafted a minimal ~300‑line library called Fork Union that lands within 20% of OpenMP. It does not use advanced NUMA tricks; it uses only the C++ and Rust standard libraries and has no other dependencies.
OpenMP has been the industry workhorse for coarse‑grain parallelism in C and C++ for decades. I lean on it heavily in projects like USearch, yet I avoid it in larger systems because: