As a gamedev, I almost never need the total time spent in a function, rather I need to visualize the total time spent in a function for that frame. And then I scan the output for long frames and examine those hotspots one frame at a time. Would be nice to be able to use that workflow in this, but visualizing it would be much different.
It focuses on creating flamegraphs to view on e.g. https://www.speedscope.app/.
I wanted to use std::stacktrace, but they are very costly to evaluate, even just lazily at exit. Eventually, I just tracked thread and call layer manually.
If I understand correctly, you're tracking your call stack manually as well using some graph structure on linear ids? Mind elaborating a bit on its functionality and performance?
Also proper platform-independent function names were a pita. Any comments on how you addressed that?
Ive been thinking about using speed scope as a reference to make a native viewer like that.
Sampling profilers (like perf) are just so much easier to use than source markup ones. Just feel like the tooling around perf is bad and that speedscope is part of the solution.
About linear IDs: A call graph in general case is a tree of nodes, each node has a single parent and an arbitrary amount of children. Each node accumulates time spend in the "lower" branches. A neat property of the callgraph relative to a generic tree, is that every node can be associated with a callsite. For example, if a some function f() calls itself 3 recursively, there will be multiple nodes corresponding to it, but in terms of callsite there is still only one. So lets take some simple call graph as an example:
Let's say f() has callsite id '0', g() has callsite id '1', h() has callsite id '2'. The callgraph will then consist of N=5 nodes with M=3 different callsites:
Node id: { 0 1 2 3 4 }
Callsite id: { 0 0 0 1 2 }
We can then encode all "prev."" nodes as a single N vector, and all "next" nodes as a MxN matrix, which has some kind of sentinel value (like -1) in places with no connection. For this example this will result in following:
Node id: { 0 1 2 3 4 }
Prev. id: { x 0 1 2 2 }
Next id matrix: [ 1 2 3 x x ]
[ x x 4 x x ]
Every thread has a thread-local callgraph object that keeps track of all this graph traversal, it holds 'current_node_id'. Traversing backwards on the graph is a single array lookup:
current_node_id = prev_node_ids[current_node_id];
Traversing forwards to an existing callgraph node is a lookup & branch:
next_node_id = next_node_ids[callsite_id, current_node_id]
if (next_node_id = x) create_node(next_node_id); // will be usually predicted
else current_node_id = next_node_id;
New nodes can be created pretty cheaply too, but too verbose for a comment. The key to tracking the callsites and assigning them IDs is thread_local local variables generated by the macro:
When callsite marker initializes (which only happens once), it gets a new ID. Timer then gets this 'callsite_id' an passes it to the forwards-traversal. The way we get function names is by simply remembering __FILE__, __func__, __LINE__ pointers in another array of the call graph, they get saved during the callsite marker initialization too. As far as performance goes everything we do is cheap & simple operations, at this point the main overhead is just from taking the timestamps.
Btw, I recently worked with a library that had their own profiler which generated a Chrome trace file, so you could load it up in the Chrome dev tools to explore the call graph and timings in a fancy UI.
It seems like such a good idea and I wish more profiling frameworks tried to do that instead of building their own UI.
Haven't worked with it, but based on initial look it's a quite different thing that stands closer to a frame-based profiler like Tracy (https://github.com/wolfpld/tracy).
As far as differences go:
Microprofile:
- frame-based
- needs a build system
- memory usage starts at 2 MB per thread
- runs 2 threads of its own
- provides system-specific info
- good support for GPU workloads
- provides live view
- seems like a good choice for gamedev / rendering
utl::profiler:
- no specific pipeline
- single include
- memory usage starts at approx. nothing and would likely stay in kilobytes
- doesn't run any additional threads
- fully portable, nothing platform specific whatsoever, just standard C++
- doesn't provide system-specific info, just pure timings
- seems like a good choice for small projects or embedded (since the only thing it needs is a C++ compiler)
This looks great! I've been needing something like this for a while, for a project which is quite compute-heavy and uses lots of threads and recursion. I've been using valgrind to profile small test examples, but that's definitely the nuclear option since it slows down the execution so much. I'm going to try this out right away.
Do you also have some tools or scripts to help annotate code?
One inconvenience with this library's approach is having to modify the code to add/remove instrumentation, compared to something like GNU gprof which has compiler support and doesn't require modifying the code.
I've though about this but had yet to come up with a simple approach, perhaps something like a python script hooked to GCC-XML can do the trick, will look into that in the future.
Great work! The colored, structured output is clean! Folks may also be interested in nanobench, which is also a single header C++ lib. It focuses on benchmarking blocks of code, though.
As a gamedev, I almost never need the total time spent in a function, rather I need to visualize the total time spent in a function for that frame. And then I scan the output for long frames and examine those hotspots one frame at a time. Would be nice to be able to use that workflow in this, but visualizing it would be much different.
Nice, I like the colored output tables. Started tinkering with a small profiling lib as well a while ago.
https://github.com/gurki/glimmer
It focuses on creating flamegraphs to view on e.g. https://www.speedscope.app/. I wanted to use std::stacktrace, but they are very costly to evaluate, even just lazily at exit. Eventually, I just tracked thread and call layer manually.
If I understand correctly, you're tracking your call stack manually as well using some graph structure on linear ids? Mind elaborating a bit on its functionality and performance? Also proper platform-independent function names were a pita. Any comments on how you addressed that?
Speed scope is awesome.
Ive been thinking about using speed scope as a reference to make a native viewer like that.
Sampling profilers (like perf) are just so much easier to use than source markup ones. Just feel like the tooling around perf is bad and that speedscope is part of the solution.
General rundown of the logic can be found in this comment on reddit: https://www.reddit.com/r/cpp/comments/1jy6ver/comment/mmze20...
About linear IDs: A call graph in general case is a tree of nodes, each node has a single parent and an arbitrary amount of children. Each node accumulates time spend in the "lower" branches. A neat property of the callgraph relative to a generic tree, is that every node can be associated with a callsite. For example, if a some function f() calls itself 3 recursively, there will be multiple nodes corresponding to it, but in terms of callsite there is still only one. So lets take some simple call graph as an example:
Let's say f() has callsite id '0', g() has callsite id '1', h() has callsite id '2'. The callgraph will then consist of N=5 nodes with M=3 different callsites: We can then encode all "prev."" nodes as a single N vector, and all "next" nodes as a MxN matrix, which has some kind of sentinel value (like -1) in places with no connection. For this example this will result in following: Every thread has a thread-local callgraph object that keeps track of all this graph traversal, it holds 'current_node_id'. Traversing backwards on the graph is a single array lookup: Traversing forwards to an existing callgraph node is a lookup & branch: New nodes can be created pretty cheaply too, but too verbose for a comment. The key to tracking the callsites and assigning them IDs is thread_local local variables generated by the macro:https://github.com/DmitriBogdanov/UTL/blob/master/include/UT...
When callsite marker initializes (which only happens once), it gets a new ID. Timer then gets this 'callsite_id' an passes it to the forwards-traversal. The way we get function names is by simply remembering __FILE__, __func__, __LINE__ pointers in another array of the call graph, they get saved during the callsite marker initialization too. As far as performance goes everything we do is cheap & simple operations, at this point the main overhead is just from taking the timestamps.
How does the compare to Microprofile?
https://github.com/jonasmr/microprofile
Btw, I recently worked with a library that had their own profiler which generated a Chrome trace file, so you could load it up in the Chrome dev tools to explore the call graph and timings in a fancy UI.
It seems like such a good idea and I wish more profiling frameworks tried to do that instead of building their own UI.
Haven't worked with it, but based on initial look it's a quite different thing that stands closer to a frame-based profiler like Tracy (https://github.com/wolfpld/tracy).
As far as differences go:
Microprofile:
utl::profiler:This looks great! I've been needing something like this for a while, for a project which is quite compute-heavy and uses lots of threads and recursion. I've been using valgrind to profile small test examples, but that's definitely the nuclear option since it slows down the execution so much. I'm going to try this out right away.
Also discussed on /r/cpp: https://www.reddit.com/r/cpp/comments/1jy6ver/utlprofiler_si...
Do you also have some tools or scripts to help annotate code?
One inconvenience with this library's approach is having to modify the code to add/remove instrumentation, compared to something like GNU gprof which has compiler support and doesn't require modifying the code.
I've though about this but had yet to come up with a simple approach, perhaps something like a python script hooked to GCC-XML can do the trick, will look into that in the future.
Great work! The colored, structured output is clean! Folks may also be interested in nanobench, which is also a single header C++ lib. It focuses on benchmarking blocks of code, though.