[Blog] Profiling - The hunt for speed

Post by **TrueBrain** » 01 Jan 2012 21:28

From time to time I like to hear myself talk, or in this case write. Today a random story about profiling. Its a bit technical.

What is profiling?
Profiling is the process of mapping what an application does. This can be seen in different ways. Normally I use an application called 'gprof', which works in combination with the GCC '-p' option. Basically, during compilation, it leaves traces behind that are written to a file during execution of your application. Afterwards you run gprof over it, which tells you which function got called how many times, and how many of the CPU time it spend in this.

Gprof is nice. It works, is efficient, and commonly available. The main issue for me with it on applications like OpenTTD: it is CLI. You have to read to massive amounts of text to find what you are looking for. Not really user friendly.

So, I have been looking into the MSVC profiling tool. About this I can only say: BAH! The tool itself is very nice and everything. But a small profile on OpenTTD (a few game-days) consumes 8GB of disk storage. That is close to insane. Even worse: it randomly adds 500msec to functions, because of I/O time. Despite the fact you are using the values that claim it is without I/O time. Short to say: MSVC profiling is unusable, as the information is not consistent.

This brings me to what profiling is good for. If I see that function A takes 50% of the CPU time, I can fiddle with it over and over, till I see it is 40%. That means that I increase the speed of OpenTTD with 10%! And that function with 20%! That kind of information you want to know. MSVC in this case, randomly adds 500msec to random functions, making it impossible for me to see if there was any gain, or another f*** up profile.

A few days ago, michi_cc told me (most likely by accident

) about callgrind. I know about callgrind, just never took any interest in it. The concept of callgrind is different then the above methods. The above methods run on your real machine, profiling how long your CPU really takes. This gives a bit of a random error rate, because processor caches can invalidate because of other applications, I/O can get stuck for what-ever reason (HDD spin-down

), ... Short to say: they are not meant to give identical results every run. Callgrind does it differently. It emulates your application, tracking of every line in your code how many instructions it executes. Because of this, every run under the same conditions give identical results. Now we are talking!

As a good plus, KCacheGrind gives me a GUI to walk through the whole tree. I can also see the so called cache-misses, branch prediction misses, and many more. Most of those terms will mean nothing to you (and if they do: good for you), but for performance they can be very important. As cherry on top, callgrind can tell me per line in the source, how much time it took the execute. Not per function. Per source line.

As I was sick today, and wanted to do something boring, my quest began!

Initial results
After fiddling with the parameters a bit, making a simple savegame (a completely empty map), I went off. The first profile (it takes about 30 seconds to profile an empty map, FYI

), it became clear that 1 function consumed (relatively) a bit much CPU. SmatZ had an idea to try: switch those statements. The result was insane:

Code: Select all

if (_settings_game.construction.freeform_edges && (x == 0 || y == 0)) return INVALID_TILE; 
if ((x == 0 || y == 0) && _settings_game.construction.freeform_edges) return INVALID_TILE;

What you see here, are 2 identical lines. They do exactly the same. But one is many many times faster than the other. Why? In the first case it always resolves freeform_edges, which takes a few instructions (5+, can't really remember). The second one, first checks if 'x' or 'y' is 0, which is rare. These values are very likely in the register (== fast access), but for sure in the cache and available. Because in the second case the first condition fails much more often, it rarely resolves the freeform_edges. Such a small difference. And huge impact on performance. That is: it only saves a few instructions, but this function is called many many many many times. A clear example of: put the most rare condition to hold true first

Expert talk: at first glance, you would think that the value of freeform_edges matters. But it does not. Fetching freeform value takes at least 5 instructions (can't remember the real value

). (x == 0 || y == 0) takes 4 instructions (assuming they are in register or nearby at least). So we now have 4 cases. Lets first do if freeform_edges is true. In the old case, it always takes 9 instructions. Always. In the new case, it takes either 4 or 9, depending if the first part hold true. So, a clear gain. The other case: false. In the old case, it always took 5 instructions. In the new case, it takes 4 instructions + (chance of hitting) * 5 instructions. The chance of hitting is so low on normal sized maps, it is still faster. Surprisingly, not? The only exception is for small maps. On 64x64, the new method is slower with freeform off. But the chance you have freeform set to false, and you play on a small map ...

The likelyhood of something to hold true, introduces so many fun (and I guess unexpected?) results

The beauty about having a visual representation of your application, you can walk so called "hot paths", and see what is really using CPU, and what is not. Improving a random function is good, but if it is rarely called, you won't really notice it. For example, if cloning orders would be very slow (I have no idea, I haven't looked!), you can optimize it all you want, but in the end .. the gain is very low. After all, how often do you clone orders? Once a game day? Not really relevant if you look at the bigger picture.

Going from the modification above, I found a function that had a for loop, and in that loop it checked if the value you loop about was outside the map. This is inefficient. Easier is to check before the loop once, and don't loop over the values outside the map. A few minutes later, the function got 3 times faster. Now that is impressive.

When it comes to profiling, you have to be careful how you represent values. I just told you I made a function 3 times faster. But what does it mean? In this case I mean it took 3 times less instructions to execute the function. Well, that is nice. But the function on its own was only good for 15% of the total CPU for that specific map. It went from 15% to 5%. That is only a 10% reduction. But if I would load another savegame, the function would (before my modification) only have consumed 0.01% of the total CPU. So .. I saved .. well .. nothing. At least, not for that savegame. What I try to say is that improving an application in this approach only always locally improves a certain set of conditions. If they change enough, the improvement becomes unnoticeable. So this doesn't magically means your huge game that runs slow, runs faster today. Sure, it does. A bit. But not noticeable. Not for you at least

There is a mean problem with profiling. You mostly observe a tree of function-calls in relative values. So, function A takes 30% of the CPU, function B takes 50%, and the rest combined take 20%. When I make function B twice as fast, it doesn't consume 25%. No, instead, function A starts to consume 45% (guestimated values), and function B 40%. So by fixing one function, another one started to come higher. This is of course okay, and to be expected, but it can be annoying. You are constantly optimizing a function, only to see another one popping up. Ugh

Real world test results
I asked in another thread for savegames of you guys, which stress OpenTTD. I already received a few (tnx guys

), and I put them to the test. Like said, the above functions vanished, but surprisingly 1 other pop'd up: Train::GetImage.

Over the years we profiled OpenTTD many times. Some of us (like myself) like to toy around in applications on this level, and often we have optimized OpenTTD, by fixing that one function to work much much faster, or by simply avoiding it being called. Other times we added caches, made better hashes, the amount of optimizations we have done is not small

But like said before, if you fix one function, you gain another. For latest trunk this is GetImage. The function itself is not slow. Lets say, it is 100 instructions long. For what it does, that is fine. It depends a bit on which NewGRF you have loaded how long it really takes, but okay, not really important.

The problem in this case is the frequency in which it is called. For every time a train is updated, it is called for all his wagons. And this a few times. So if 100k trains update, this function is called 10M times if you have trains of length 10. At these values, things become hard to optimize. In this case, optimizing the function itself is not really possible. The only real solution would be to lower the amount of calls you make to this function. Sadly, that is always a tricky job. You can introduce cache. We have done that a lot in the past. It is rather efficient. It is what we often refer to as "CPU vs Memory". A basic principle is: you can make any function faster at the expensive of memory. In fact, you can go as far as defining every state and result in your memory, making the function very very fast (so you predefine the result for all his inputs, meaning the function only does I/O, and doesn't do any calculation). This of course consumes too much memory for any sane machine to handle. You are always looking for a fair balance between the CPU and the memory.

In this case it is most likely unlikely to introduce a cache to solve the problem. Mainly, because NewGRF might use this feature. I know very little about NewGRFs, so I might be completely wrong

In this case my initial advise was to change the specs and move the location GetImage gets called to the viewport, where I would expect this function. So only call it when you are about to draw something. That would reduce the 10M value to something like 100. Now that is how you improve speed. (hypothetical!) How this problem will be fixed, I don't know yet. That is for people with more NewGRF knowledge than me to solve. When they solve it, and say it will get called very rarely, the next function is already waiting to become the top user. But in contrast to the first optimization I told you about, you will notice this one.

In big games atm it takes, like said, 50% of your CPU time to do this GetImage. If it is no longer called, it means that your game will run a bit faster. Lets say .. 20%. This means that after building 20% more trains, you will be at the same performance level. And you will notice that ... a bit

For those who are interested, the next function in line is VehicleMove. That will be one though function to optimize. So don't get your hopes up

Why do you do this?
I mostly do these profiles, because it is fun. Often we debate within the dev-team if a change will make the game more efficient or not. Mostly we go by our own knowledge of how compilers and CPUs work. Sometimes it get as far as that we compile it and look at the assembly. But now I have a shiny tool that just tells me, black on white, which is better and which is not.

Also, from time to time, it will show us functions like GetImage, which should be relative easy to change. About a year ago (to name a random well documented case) SmatZ did similar profiling (with gprof I imagine), which showed GetSlopeZ was called a lot in the Train handler. He fixed it up (with some rather complex patches), to not update all the time (and thereby reducing the amount of calls to it), speeding up the game with a few factors. (exact details might differ, I just try to paint you a picture

).

We will continue to profile, and look for places to get clear benefits. We have some ideas for some new algorithms to improve speed, and now we have (next to our trusted TIC/TOC, another story

) the tools to see if they work.

For me, it is just a nice adventure, something different from the normal things I do during a day. It was nice to spend my day profiling, and I hope you enjoyed reading about it

why this blog?
As some of you might know, I also work on OpenDUNE. I regular write blogs there talking about OpenDUNE, how the game works, what we experienced. It is kinda nice, and today I was in the mood to do the same for OpenTTD

Digitalfox · Post by **Digitalfox** » 01 Jan 2012 22:16

Great reading TrueBrain

andythenorth · Post by **andythenorth** » 01 Jan 2012 22:47

I enjoyed reading this

bokkie · Post by **bokkie** » 02 Jan 2012 10:53

Nice

I can see the appeal, it has parallels with optimizing your network and junctions in OpenTTD to maximize throughput

FooBar · Post by **FooBar** » 02 Jan 2012 12:15

Was an interesting read as bed time story last night!

If you're at some point in doubt on whether or not you should write another one of these blog posts, I'd say yes! I'll read it

Digitalfox · Post by **Digitalfox** » 02 Jan 2012 14:33

FooBar wrote: If you're at some point in doubt on whether or not you should write another one of these blog posts, I'd say yes! I'll read it

Definitely!!!

AndersI · Post by **AndersI** » 02 Jan 2012 17:06

As a programmer by profession, I always have a great interest in stories like this. I may not benefit from the information directly, but then some day in the future I have a problem, and I remember (parts of) a post like this - and I may have a direct gain by remembering that post. Thank you!

As a side note - have you looked at sampling profilers? They are (almost) completely non-intrusive, and will tell you what is a bottleneck on a certain computer. I only have experience with one for Delphi (http://delphitools.info/samplingprofiler/), but I am sure the same thing exists for C(++). With the above mentioned sampling profiler you can even watch the statistics over http while the sampled program is running!

MinchinWeb · Post by **MinchinWeb** » 02 Jan 2012 17:37

Thanks for sharing; very interesting read.

Is it possible to do this for AI's? Or does running it in the Squirrel VM eliminate that possibility?

ChillCore · Post by **ChillCore** » 02 Jan 2012 18:24

Great read!

You should be sick and in the mood for boring stuff more often.

Zuu · Post by **Zuu** » 02 Jan 2012 18:34

Thanks TrueBrain for an interesting article!

MinchinWeb wrote:Is it possible to do this for AI's? Or does running it in the Squirrel VM eliminate that possibility?

Using GetTick() and GetOpsTillSuspend() on AIController, you can get some timing data that is precise enough to clock the performance of different parts of your AI.

I got some profiling of what part of CluelessPlus is taking most execution time over a year. At the end of the game year, stats are printed to the log. Using the break on log feature I can catch this log message easily. This does not give results down to single functions or lines of code, but more on larger areas as in "vehicle management", "station management" etc. As each time it goes from one are to another there is a slight overhead of recording times. To test it:

Configure AIs in main menu and enable the last setting in the CluelessPlus parameters. Also change log level to 3 (maximum).
Start a new game
Start the AI(s)
Bring up the AI debug console, select CluelessPlus and enter "Timer counts end" as break string.
Don't close the AI debug window - but you can resize it to be smaller to improve performance slightly.
Fast forward until the game is paused due to CluelessPlus printing "Timer counts end" in the AI debug output. Apparently the current CluelessPlus version use a measure period of 5 years, so you have to wait for that or change line 774 in main.nut

: Unnamed, 16th Dec 1967.png (21.65 KiB) Viewed 3308 times

As one can see, Clueless plus spend 697 time units out of total recorded 915 time units on finding pairs to connect. So that gives me an indication on where to focus my efforts to save cpu cycles. As for the record CluelessPlus doesn't cache the pair finder results so there I have a possible thing to work on.

Kogut · Post by **Kogut** » 02 Jan 2012 19:42

I really enjoyed this post!

MinchinWeb · Post by **MinchinWeb** » 10 Jan 2012 04:46

Thanks Zuu for the idea. I've done a little timing of sections to see how many ticks I've used, mostly in trying to improve my pathfinder. But I like your idea of keeping a tally of everything and then displaying it every so often - I'll have to look into doing that myself.

Transport Tycoon Forums

[Blog] Profiling - The hunt for speed

[Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Re: [Blog] Profiling - The hunt for speed

Who is online