Current tools (Visual Studio 2010 Profiling Tools, VTune Performance Analyzer, Intel Parallel Studio, OS X Shark) focus on giving the following information:
- Hotspot analysis (Sampling data)
- Detailed call graph analysis (Instrumentation)
- Thread analysis - this measures thread idle time, resource contention, over/under-threading, etc
- Hardware counters e.g. cache misses, (Vtune & Shark)
Except for hardware counters, the other analysis are very high-level profiling information. While they serve as a good starting point, they can only guide you so far. For instance, it is very common now to conflate scalability with performance. Just because your algorithm scales to n processors doesn't mean that you are achieving the full performance.
For a detailed example, watch this video. In it, the programmer successfully scales the algorithm to 4 processors and improves performance by up to 4 times. However, is he done? No. Because by scaling, he has now encountered issues such as false sharing and greater cache misses. By optimizing these areas, he brought the total speed up to 33x!
For another example, read my blog entry about matrix multiplication and how scaling naively does not address the issue of cache misses which dominate the cost!
Cache misses, false sharing, etc are what I call missed opportunities for performance tuning. Without detailed profiling information, it is extremely hard to figure what to optimize. And even if we have all the ability to collect such hardware information, the programmer might not actually know what to do with it; we overwhelm the programmer with so much information!
How do we actually profile for the information that we need to actually improve performance? I suggest that we use a scriptable profiler. A scriptable profiler is one that allows the programmer to easily create his own performance metrics. This idea is not new; scriptable debuggers (PyDBG, Immunity Debugger) have allowed their users to define special breakpoints and conditions that they are interested in for reverse engineering. If you think about it, performance profiling and reverse engineering/debugging have a lot in common: we are trying to diagnose a problem by collecting clues.
By allowing programmers to create their own scripts (and also by providing a few known scripts for the programmer) we should be able to help the programmer target those missed opportunities. You can think of this as FindBugs for performance tuning!
Fortunately, we don't have to create such a tool from scratch. Dtrace for Solaris, FreeBSD, OS X is a tool that we can leverage to create such scripts.
A tool such as this would allow us to address the second part of our research question ...
Sciptable Profilers will easily allow us to form hypotheses about our performance. Currently, here are the hypotheses categories that I have identified:
- Ratios - Measuring the ratio of cache out vs. cache in. A high ratio of cache flushing indicates that we are bringing in more data than we need to. Perhaps transforming our data structures might be a possible fix.
- Threshold - After a certain number of events, raises a flag that we might be doing something wrong. For instance, making too many syscall for a program.
- Patterns - Patterns are very complex hypotheses. They essentially mean look for this particular behavior in a particular time-frame. For instance, if you have resource contention between threads, then you will see a ping-pong like behavior within a small time-frame. The notion of a time-frame is important because even if a pattern manifests itself in an application, as long as it doesn't happen frequently enough it is not going to be a problem.
With such concrete data, it should be easier to suggest some refactorings that can be implemented. It will be still hard to have such refactorings, but at least we have some profile information to guide us.
- Check out Intel's Performance Tuning Utility (Windows) and see what kinds of suggestions does it give
- Check out Acumem Performance Tools (demonstrated in the video) to see what it can do
- Check out DTrace and how much can be done with it
- Check out FindBugs and see if anyone has tried to use it as a guide for refactoring