//TODO: For explicitly threaded version: Make a lazy thread spawner, spawning up to 8 threads which will each work on a byte and atomically write back to a central array of `uint64_t [num_of_bytes / num_of_threads]`.
//TODO: But how to sync the fwrites of this buffer? An atomic counter of which threads have completed their operations, and a condvar that allows the controlling thread to check if the counter is full before `fwrite_all()`ing the array, resetting the counter, and carrying on
//TODO: In explicitly threaded version, we can use an atomic (atomic.h) 64-bit integer `store()` to output those 8 bytes in `out`, which will be re-cast as `uint64_t out[restrict 1]`