It’s not just less memory though - it might also introduce spurious data dependencies, e.g. to store a bit you now need to also read the old value of the byte that it’s in.
Could definitely be worse for latency in particular cases, but if we imagine a write heavy workload it still might win. Writing a byte/word basically has to do the same thing: read, modify write of cache lines, it just doesn’t confuse the dependency tracking quite as much. So rather than stalling on a read, I think that would end up stalling on store buffers. Writing to bits usually means less memory, and thus less memory to read in that read-modify-write part, so it might still be faster.
It might also introduce spurious data dependencies
Those need to be in the in smallest cache or a register anyway. If they are in registers, a modern, instruction reordering CPU will deal with that fine.
to store a bit you now need to also read the old value of the byte that it’s in.
Many architectures read the cache line on write-miss.
The only cases I can see, where byte sized bools seems better, are either using so few that all fit in one chache line anyways (in which case the performance will be great either way) or if you are repeatedly accessing a bitvector from multiple threads, in which case you should make sure that’s actually what you want to be doing.
A lot of times using less memory is actually better for performance because the main bottleneck is memory bandwidth or latency.
It’s not just less memory though - it might also introduce spurious data dependencies, e.g. to store a bit you now need to also read the old value of the byte that it’s in.
Could definitely be worse for latency in particular cases, but if we imagine a write heavy workload it still might win. Writing a byte/word basically has to do the same thing: read, modify write of cache lines, it just doesn’t confuse the dependency tracking quite as much. So rather than stalling on a read, I think that would end up stalling on store buffers. Writing to bits usually means less memory, and thus less memory to read in that read-modify-write part, so it might still be faster.
Those need to be in the in smallest cache or a register anyway. If they are in registers, a modern, instruction reordering CPU will deal with that fine.
Many architectures read the cache line on write-miss.
The only cases I can see, where byte sized bools seems better, are either using so few that all fit in one chache line anyways (in which case the performance will be great either way) or if you are repeatedly accessing a bitvector from multiple threads, in which case you should make sure that’s actually what you want to be doing.
Yep, and anding with a bit ask is incredibly fast to process, so it’s not a big issue for performance.