|The following sections focus on pointer-based data structures that tend to be irregular. We are not considering regular data structures such as arrays, matrices, etc where generally the size and access patterns are predictable.|
In this paper, the authors propose two techniques for laying out data structures: clustering and coloring. Clustering tries to put data structures that are usually accessed together closer together in memory. For some data structures such as trees, there is a natural ordering of how the data is accessed so it makes sense to organize them in that manner i.e. by tree height for depth first traversal. Coloring tries to move data structures that are usually accessed together to different memory addresses to prevent conflict misses in an associative cache i.e. so that the data doesn't lie in the same cache set and evict one another. The authors implemented two tools to semi-automatically perform these transformations ccmorph and ccalloc. ccmorph takes a tree data structure and transforms it so that it supports both clustering and coloring. ccmorph results in 28 - 138% speedups over prefetching. ccalloc replaces the normal malloc and tries to allocate new memory that are allocated contemporaneously so that they tend to lie closer together. ccalloc results in 20 - 194% speedups over prefetching.
This work was done before multicores were prevalent so some of these layout decisions could incur the penalty of false sharing if the data is being accessed by more than one processor.
In this paper, the authors take two common techniques for cache-conscious layout – structure splitting and field reordering – and create tools to automatically perform them. These tools use profiling and static analyses.
Structure splitting takes a large object and tries to split it so that the commonly accessed portion fits into one cache line. The remaining portion is relegated to a new object which the old object references through a pointer. Structure splitting is input-dependent; different input sets will use different fields in the object more regularly. Therefore the authors create some heuristics on how to split. Structure splitting reduced cache miss rates 10-27% and improve performance 6-18% in the Java programs that they tested on.
Field reordering tries to reorder the fields based on temporal accesses - fields that are frequently accessed together are moved into the same cache line. Like structure splitting, field reordering is also input-dependent and the authors relied on profiling data to come up with appropriate heuristics to create a field affinity graph of the program that tracks access frequencies and clusters them together. They created a tool called bbcache that recommends C structure field reorderings (it does not perform the transformation since it is harder to verify if the transformations are safe in C). They reported a 2 - 3% speedup in Microsoft SQL Server 7, a highly-tuned application.
Useful tools used in this paper: BIT Instrumenting Java Bytecode and Vortex Optimizing Compiler because if Java bytecode isn't compiled to native code, the programmer has very little guarantee over the layout of data in memory.
Also see the extensive list of work on memory management and caches on this page
The authors modified the generational copying garbage collector for a prototypical language called Cecil. They implemented a low overhead monitor (3-5% overhead) to monitor accesses to objects. They did not monitor accesses to fields in an object but treated the object as a whole (based on their study of programs that show that most objects are small i.e. < 32 bytes) and that most objects are accessed in their entirety instead of just a single field. They implemented a version of the Cheney copying algorithm so that it records the accesses and builds an object affinity graph. During the copying phase of the garbage collector, the algorithm copies objects that have high affinity and places them close together in memory to improve cache locality. On the programs that they tested, they found that their technique reduces cache miss rates by 21 - 42% and improves performance by 14 - 37%.
- Transformations for cache-conscious layout is possible and has been attempted before. Depending on the complexity of the language and the types of cache layout transformations, heuristics have to be used. And semi-automatic approaches (not full ones) have to be done.
- Using the garbage collector to move data structures to be more cache conscious has been attempted before (< 1998) and is still being done. Some of the earlier work operated on larger granularity e.g. page size instead of cache line size.
- No one seems to have done it using .NET attributes yet.