Doug Binks - 27 Feb 2014 - edited 27 Oct 2014
The video showcases three players zooming around at high speed in an environment being streamed from one player's PC to the other two, with the recording being done on one of the clients. This devblog post is about the steps I took to get this working smoothly.
Performance tuning is a large and varied topic, so I'm going to concentrate on one aspect which came up during development of the streaming system - framerate hitch removal. A framerate hitch occurs when one frame takes longer than most of the rest of the frames. To the player it can feel like a sudden start/stop. If frequent, the game takes on the feeling of being played on sandpaper.
When I introduced networked octree streaming I also introduced a hitching problem. The addition of streamed octree data hadn't changed the rendering of the game (though we will confirm this in analysis), so any extra hitching was likely down to extra CPU processing.
One approach to profiling is to use event or timer based profilers. These interrupt the system at regular intervals (timer based) or when a condition is met (event based) and log which processes, functions, and threads are currently active. This works with any program, and with debug information available you can analyze the amount of time in functions based on statistics. Such classical performance profiling may not display the reasons for hitching, since it may occur so infrequently that the culprit doesn't show up in the statistics. Increasing the resolution of the profiling doesn't help, since this can begin to adversely affect the measured performance significantly. To analyse hitching in games we need something which can record what occurs within a single frame accurately enough to resolve intervals of the order of microseconds, yet have low overhead.
The standard solution to getting accurate timings within a frame is to insert lightweight profiling code into your application. I'm currently using Intel's ITT API, along with the free Intel GPA Platform Analyzer, though for more detail you can also combine this with Intel's VTune Amplifier XE.
Many developers use their own internal API based around ideas similar to the "Real-Time Hierarchical Profiling" article by Greg Hjelstrom, Byon Garrabrant in Game Programming Gems 3, which is a great option in many cases as you can get a real time view and implement things like a circular buffer with a hitch detection trigger to dump out data on frame hitches, or just output information on screen about them.
One danger of writing your own profiler on general purpose systems (as opposed to consoles) is that you may not catch external events.
Figure 1: The main thread has been preempted by the OS to run another thread. Solid green shows the CPU working on a thread, light green shows that thread is awaiting a synchronization event, and grey signals that the thread has been preempted. (click to enlarge)
In Figure 1 above you can see a single frame of Avoyd displayed in GPA Platform Analyzer (PA). The game is in a windowed mode with vertical sync enabled and is locked to 60Hz or ~16ms, which the game easily achieves most of the time. The occasional hitch is taking the frame over 16ms, and in this frame we have ~33ms (two vertical sync times).
The grey block outlined in red on the ‘Main Thread' row is where the OS has preempted the main thread. It looks like one of our tasking threads was suspended, and when this thread was woken up due to it receiving the task available event this caused the main thread to be preempted. Without the ability to detect context switches, we'd have incorrectly assumed the main thread was processing a function for too long, and so causing the hitch. We now know that instead to resolve this issue we need to consider why the context switch occured and whether we can help resolve it - there's more data displayed in PA I've not shown which could help here, along with tools such as Windows Event Tracing (see Bruce Dawson's blog for more excellent information on this). It turns out this occurs rarely, so I'll leave that discussion for another day other than to highlight the importance of being aware of such possibilities.
Using the ITT API and Intel GPA Platform Analyzer is well covered by the documentation, though I'll note that on OpenGL applications there's no display of when you've triggered a trace capture. Recently the capture triggers appear to not work, so just use the manual keyboard triggering via CTRL-SHIFT-T and hope it works. If there's a call for more information on these details let me know in the comments and I'll write a short tutorial.
Scrolling around the timeline in PA shows that the vast majority of frames are 16ms long, indeed both the GPU work and CPU are well under with vertical sync being our bottleneck (for those wondering, yes I do include an option to turn vsync off). When looking for hitching having vertical sync on helps show up cases where we can't meet the 16ms deadline very clearly, and I was able to see a pattern, with a representative frame example shown in figure 2.
The image from PA displays information on the GPU queue, and threads in our application as well as others. We're running in windowed mode here, which I'd not advise generally for performance tuning. Indeed, you can see the dwm.exe process is using both some CPU and GPU time near the beginning of our frame, though this isn't causing the frame hitch.
Before I detail the problem, I need to explain that the three threads marked AvoydUniverse_Profile.exe are from Avoyd, and are part of the tasking system. Avoyd has both frame synchronous tasks and asynchronous tasks - the right hand side of solid green activity with GenVertsOctreeTask_ExecuteRange is an asynchronous set of tasks which can take more than one frame to complete, so they're not relevant to this investigation of hitching as they don't sync with the main thread and so don't normally cause stalls. However on the left hand side we can see some tasks on these threads along with similar activity in the main thread - this is a set of synchronous tasks copying vertex data calculated from prior work into vertex buffers. You can see a lot of inactive light green between each task, signalling poor utilization. This matters here, as overall we're causing the main thread to have a delay of about 5ms, and it looks like this could be reduced to about 3ms with better utilization. We'll call this a vertex copy problem, and come back to it later as it's not the worst issue visible here.
In the screenshot I've hovered the mouse over the main problem, a heinous sudden appearance of an extra 6.7ms of work on the main thread under the task label CubeWorld_UpdateModifications. Scanning around the rest of the trace I found this function to be taking significant time in all the hitched frames. Below this you can just make out the start of the label CubeWorld_ID_OCTREE_CHUNK_DATA_INTERNAL_CLIENT which is the identifier from the task which processes the octree streaming data send from the server. Not only is this lengthening the main thread execution for that frame, but meanwhile we're leaving other threads idle as my manual task diagram doesn't allow reading from an octree whilst it's being written too (something I could resolve, perhaps more on that later!).
So we've found two issues. The parallel copying of data into our vertex buffers, and the processing of incoming octree streaming data.
There are a number of high level ways I could resolve the octree chunk processing problem. All my task threads are idle during this timeframe, so I could parallelize the operation. There are multiple chunks being processed in most of the problem frames, so I could also spread these out over a few frames to reduce the cost per frame. However I first wanted to see if there was way to improve single threaded performance of the code.
Indeed, the code I'd written for octree chunk handling used a naive method of iterating through the incoming octree and doing a leaf by leaf insertion into the client octree. This was fairly fast, but an obvious faster alternative is to insert the octree data in a similar way to doing a list insertion.
In order to reduce fragmentation and cache misses, Avoyd's octrees are split up into fixed sized blocks of memory. This means I can't simply change a node to point to the incoming octree data, I also have to copy the data into the client's octree. Additionally I need to modify the incoming block and index values (which point to the location of a node in terms of which memory block and which index in that block it is) to represent the location in the new octree. This turned out to be very fast, giving me roughly a hand waving 10x performance increase, easily sufficient for me to move on to other issues rather than gold plate the solution with multithreaded processing.
Prior to octree streaming my simple buffer copying mechanism worked well across platforms and needed no advanced features such as permanent buffer mapping (GL_ARB_buffer_storage). I simply mapped them and then did a multithreaded memcpy of the data. Sadly I was doing multiple maps, and was unmapping each buffer before mapping the next so synchronising the tasks with the main thread. A better solution would be to use a single buffer, but to get moving I simply mapped all the buffers at the start of the frame and set of memcpy tasks then went on to other single threaded work in the main thread before unmapping once complete. This isn't ideal, but it worked well and I'll be modifying buffer management to resolve this later.
The results were great, a steady smooth sub 16ms frame time across the board with good CPU utilization when running tasks across threads, as you can see from the solid green in the left half of the screenshot from PA in figure 3. There are two frames shown, and the second frame has no work on the tasking system as there were no vertex data or octree changes to process, and there is currently no other work in the game such as AI or physics to process.
Some quick afterthoughts:
Optimizing for a smooth framerate is subtly different from working to get high mean performance, and appropriate tools are required.
Context switching might confuse simple timer based analysis.
If you have octrees, and well designed octree insert is a useful utility function.
You can map multiple buffers and copy to them asynchronously.
And some ideas on the tools:
Intel GPA is platform analyzer is decent, but it's fiddly to use. It could do with the following improvements (I'll contact the team and see if they can help):
Pause / unpause tracing via button on ‘Analyze Application' app.
Notification of trace start/stop via monitor or app (i.e. not just through DirectX HUD which doesn't show for non DX apps).
Ensure triggers work.
Ensure long traces work - I can only get 5secs to work repeatedly with traces longer than this missing data.
Circular buffer continuously recording with ability to dump output on hitch trigger.
Add OpenGL call tracing.
Add ability to profile OS X and Linux apps (the tools work on these systems for analysing Android, but not OS X and Linux despite the ITT API working there for VTune).
I should check out RAD Game Tools Telemetry.
I'm considering writing an open source task profiling system using intrusive timers like ITT, with visualization either via Google Chrome's perf monitoring API or through a lightweight UI build with Mikko Mononen's NanoVG (NanoVG on Github). Since I think most performance conscious developers already have tools available, I've not put this as a high priority but if people are genuinely interested let me know in the comments.
For Avoyd, I'll be doing more performance work in future. For now I need to work on functionality, as I find doing optimizing an easy rat-hole to get too absorbed in!
The following tools can provide performance analysis showing instrumented functions along with context switches and synchronization:
If you're interested in analyzing data from your own performance monitoring API, you can use the trace-viewer in Google Chrome.
For general profiling, try:
[Edit 27 Oct 2014: Intel GPA is no longer completely free; updated Intel GPA link and added mentions of Microprofile and Remotery]