Tracy requires compiler support for C++11, Thread Local Storage and a way to workaround static initialization order fiasco. There are no other requirements. The following platforms are confirmed to be working (this is not a complete list):
- Windows (x86, x64)
- Linux (x86, x64, ARM, ARM64)
- Android (ARM, x86)
- FreeBSD (x64)
- Cygwin (x64)
- WSL (x64)
- OSX (x64)
Source code and more information: https://bitbucket.org/wolfpld/tracy
A quick FAQ:
Q: I already use VTune/perf/Very Sleepy/callgrind/MSVC profiler.
A: These are statistical profilers, which can be used to find hot spots in the code. This is very useful, but it won't show you the underlying reason for semi-random frame stutter that may occur every couple of seconds.
Q: You can use Telemetry for that.
A: Telemetry license costs about 8000 $ per year. Tracy is open source software. Telemetry doesn't have Lua bindings.
Q: You can use the free Brofiler. Crytek does use it, so it has to be good.
A: After a cursory look at the Brofiler code I can tell that the timer resolution there is at 300 ns. Tracy can achieve 5 ns timer resolution. Brofiler event logging infrastructure seems to be overengineered. Brofiler can't track lock contention, nor does it have Lua bindings.
Q: So tracy is supposedly faster?
A: My measurements show that logging a single zone with tracy takes only 15 ns. In theory, if the program was doing nothing else, tracy should be able to log 66 million zones per second.
Q: Bullshit, RAD is advertising that they are able only to log about a million zones, over the network nevertheless: "Capture over a million timing zones per second in real-time!"
A: Tracy can perform network transfer of 15 million zones per second. Should the client and server be on separate machines, this number will be even higher, but you will need more than a gigabit link to achieve the maximum throughput. https://www.youtube.com/watch?v=DSMIHShKGAc
Q: Can I connect to my application at any time and start profiling at this moment?
A: No, all events are registered from the beginning of program's execution and are waiting in a queue.
Q: Am I seeing correctly that the profiler allocates one gigabyte of memory per second?
A: Only in extreme cases. Normal usage has much lower memory pressure.
Q: Why do you do magic with the static initialization order? Everyone says that's a bad practice.
A: It allows tracking construction of static objects and memory allocations performed before main() is entered.
Q: There's no support for consoles.
A: Welp. But there's mobile support.
Q: I do need console support.
A: The code is open. Write your own, then send a patch.
Following is the annotated assembly code (generated from C++ sources) that's responsible for logging start of the zone:
call qword ptr [__imp_GetCurrentThreadId]
mov qword ptr [rsp+0F0h],r14 // save thread id for later use
mov rax,qword ptr gs:[58h] // TLS
mov r15,qword ptr [rax] // queue address
mov rdi,qword ptr [r12+r15] // data address
mov rbp,qword ptr [rdi+20h] // buffer counter
and ebx,7Fh // 128 item buffer
jne Application::InnerLoop+66h --+
mov rdx,rbp |
mov rcx,rdi |
call enqueue_begin_alloc | // reclaim/alloc next buffer
shl rbx,5 <---------------------+ // buffer items are 32 bytes
add rbx,qword ptr [rdi+40h]
mov byte ptr [rbx],4 // queue item type
mov dword ptr [rbx+19h],ecx // cpu id
or rax,rdx // 64 bit timestamp
mov qword ptr [rbx+1],rax
mov qword ptr [rbx+9],r14 // thread id
lea rax,[__tracy_source_location] // static struct address
mov qword ptr [rbx+11h],rax
lea rax,[rbp+1] // increment buffer counter
mov qword ptr [rdi+20h],rax