What is profiling?
Profiling is the process of mapping what an application does. This can be seen in different ways. Normally I use an application called 'gprof', which works in combination with the GCC '-p' option. Basically, during compilation, it leaves traces behind that are written to a file during execution of your application. Afterwards you run gprof over it, which tells you which function got called how many times, and how many of the CPU time it spend in this.
Gprof is nice. It works, is efficient, and commonly available. The main issue for me with it on applications like OpenTTD: it is CLI. You have to read to massive amounts of text to find what you are looking for. Not really user friendly.
So, I have been looking into the MSVC profiling tool. About this I can only say: BAH! The tool itself is very nice and everything. But a small profile on OpenTTD (a few game-days) consumes 8GB of disk storage. That is close to insane. Even worse: it randomly adds 500msec to functions, because of I/O time. Despite the fact you are using the values that claim it is without I/O time. Short to say: MSVC profiling is unusable, as the information is not consistent.
This brings me to what profiling is good for. If I see that function A takes 50% of the CPU time, I can fiddle with it over and over, till I see it is 40%. That means that I increase the speed of OpenTTD with 10%! And that function with 20%! That kind of information you want to know. MSVC in this case, randomly adds 500msec to random functions, making it impossible for me to see if there was any gain, or another f*** up profile.
A few days ago, michi_cc told me (most likely by accident ) about callgrind. I know about callgrind, just never took any interest in it. The concept of callgrind is different then the above methods. The above methods run on your real machine, profiling how long your CPU really takes. This gives a bit of a random error rate, because processor caches can invalidate because of other applications, I/O can get stuck for what-ever reason (HDD spin-down ), ... Short to say: they are not meant to give identical results every run. Callgrind does it differently. It emulates your application, tracking of every line in your code how many instructions it executes. Because of this, every run under the same conditions give identical results. Now we are talking!
As a good plus, KCacheGrind gives me a GUI to walk through the whole tree. I can also see the so called cache-misses, branch prediction misses, and many more. Most of those terms will mean nothing to you (and if they do: good for you), but for performance they can be very important. As cherry on top, callgrind can tell me per line in the source, how much time it took the execute. Not per function. Per source line.
As I was sick today, and wanted to do something boring, my quest began!
Initial results
After fiddling with the parameters a bit, making a simple savegame (a completely empty map), I went off. The first profile (it takes about 30 seconds to profile an empty map, FYI ), it became clear that 1 function consumed (relatively) a bit much CPU. SmatZ had an idea to try: switch those statements. The result was insane:
Code: Select all
if (_settings_game.construction.freeform_edges && (x == 0 || y == 0)) return INVALID_TILE;
if ((x == 0 || y == 0) && _settings_game.construction.freeform_edges) return INVALID_TILE;
Expert talk: at first glance, you would think that the value of freeform_edges matters. But it does not. Fetching freeform value takes at least 5 instructions (can't remember the real value ). (x == 0 || y == 0) takes 4 instructions (assuming they are in register or nearby at least). So we now have 4 cases. Lets first do if freeform_edges is true. In the old case, it always takes 9 instructions. Always. In the new case, it takes either 4 or 9, depending if the first part hold true. So, a clear gain. The other case: false. In the old case, it always took 5 instructions. In the new case, it takes 4 instructions + (chance of hitting) * 5 instructions. The chance of hitting is so low on normal sized maps, it is still faster. Surprisingly, not? The only exception is for small maps. On 64x64, the new method is slower with freeform off. But the chance you have freeform set to false, and you play on a small map ... The likelyhood of something to hold true, introduces so many fun (and I guess unexpected?) results
The beauty about having a visual representation of your application, you can walk so called "hot paths", and see what is really using CPU, and what is not. Improving a random function is good, but if it is rarely called, you won't really notice it. For example, if cloning orders would be very slow (I have no idea, I haven't looked!), you can optimize it all you want, but in the end .. the gain is very low. After all, how often do you clone orders? Once a game day? Not really relevant if you look at the bigger picture.
Going from the modification above, I found a function that had a for loop, and in that loop it checked if the value you loop about was outside the map. This is inefficient. Easier is to check before the loop once, and don't loop over the values outside the map. A few minutes later, the function got 3 times faster. Now that is impressive.
When it comes to profiling, you have to be careful how you represent values. I just told you I made a function 3 times faster. But what does it mean? In this case I mean it took 3 times less instructions to execute the function. Well, that is nice. But the function on its own was only good for 15% of the total CPU for that specific map. It went from 15% to 5%. That is only a 10% reduction. But if I would load another savegame, the function would (before my modification) only have consumed 0.01% of the total CPU. So .. I saved .. well .. nothing. At least, not for that savegame. What I try to say is that improving an application in this approach only always locally improves a certain set of conditions. If they change enough, the improvement becomes unnoticeable. So this doesn't magically means your huge game that runs slow, runs faster today. Sure, it does. A bit. But not noticeable. Not for you at least
There is a mean problem with profiling. You mostly observe a tree of function-calls in relative values. So, function A takes 30% of the CPU, function B takes 50%, and the rest combined take 20%. When I make function B twice as fast, it doesn't consume 25%. No, instead, function A starts to consume 45% (guestimated values), and function B 40%. So by fixing one function, another one started to come higher. This is of course okay, and to be expected, but it can be annoying. You are constantly optimizing a function, only to see another one popping up. Ugh
Real world test results
I asked in another thread for savegames of you guys, which stress OpenTTD. I already received a few (tnx guys ), and I put them to the test. Like said, the above functions vanished, but surprisingly 1 other pop'd up: Train::GetImage.
Over the years we profiled OpenTTD many times. Some of us (like myself) like to toy around in applications on this level, and often we have optimized OpenTTD, by fixing that one function to work much much faster, or by simply avoiding it being called. Other times we added caches, made better hashes, the amount of optimizations we have done is not small But like said before, if you fix one function, you gain another. For latest trunk this is GetImage. The function itself is not slow. Lets say, it is 100 instructions long. For what it does, that is fine. It depends a bit on which NewGRF you have loaded how long it really takes, but okay, not really important.
The problem in this case is the frequency in which it is called. For every time a train is updated, it is called for all his wagons. And this a few times. So if 100k trains update, this function is called 10M times if you have trains of length 10. At these values, things become hard to optimize. In this case, optimizing the function itself is not really possible. The only real solution would be to lower the amount of calls you make to this function. Sadly, that is always a tricky job. You can introduce cache. We have done that a lot in the past. It is rather efficient. It is what we often refer to as "CPU vs Memory". A basic principle is: you can make any function faster at the expensive of memory. In fact, you can go as far as defining every state and result in your memory, making the function very very fast (so you predefine the result for all his inputs, meaning the function only does I/O, and doesn't do any calculation). This of course consumes too much memory for any sane machine to handle. You are always looking for a fair balance between the CPU and the memory.
In this case it is most likely unlikely to introduce a cache to solve the problem. Mainly, because NewGRF might use this feature. I know very little about NewGRFs, so I might be completely wrong In this case my initial advise was to change the specs and move the location GetImage gets called to the viewport, where I would expect this function. So only call it when you are about to draw something. That would reduce the 10M value to something like 100. Now that is how you improve speed. (hypothetical!) How this problem will be fixed, I don't know yet. That is for people with more NewGRF knowledge than me to solve. When they solve it, and say it will get called very rarely, the next function is already waiting to become the top user. But in contrast to the first optimization I told you about, you will notice this one.
In big games atm it takes, like said, 50% of your CPU time to do this GetImage. If it is no longer called, it means that your game will run a bit faster. Lets say .. 20%. This means that after building 20% more trains, you will be at the same performance level. And you will notice that ... a bit
For those who are interested, the next function in line is VehicleMove. That will be one though function to optimize. So don't get your hopes up
Why do you do this?
I mostly do these profiles, because it is fun. Often we debate within the dev-team if a change will make the game more efficient or not. Mostly we go by our own knowledge of how compilers and CPUs work. Sometimes it get as far as that we compile it and look at the assembly. But now I have a shiny tool that just tells me, black on white, which is better and which is not.
Also, from time to time, it will show us functions like GetImage, which should be relative easy to change. About a year ago (to name a random well documented case) SmatZ did similar profiling (with gprof I imagine), which showed GetSlopeZ was called a lot in the Train handler. He fixed it up (with some rather complex patches), to not update all the time (and thereby reducing the amount of calls to it), speeding up the game with a few factors. (exact details might differ, I just try to paint you a picture ).
We will continue to profile, and look for places to get clear benefits. We have some ideas for some new algorithms to improve speed, and now we have (next to our trusted TIC/TOC, another story ) the tools to see if they work.
For me, it is just a nice adventure, something different from the normal things I do during a day. It was nice to spend my day profiling, and I hope you enjoyed reading about it
why this blog?
As some of you might know, I also work on OpenDUNE. I regular write blogs there talking about OpenDUNE, how the game works, what we experienced. It is kinda nice, and today I was in the mood to do the same for OpenTTD