[PATCH] Several optimizations and performance test tool

Forum for technical discussions regarding development. If you have a general suggestion, problem or comment, please use one of the other forums.

Moderator: OpenTTD Developers

Post Reply
kamil.nowosad
Engineer
Engineer
Posts: 3
Joined: 22 Aug 2010 22:04

[PATCH] Several optimizations and performance test tool

Post by kamil.nowosad »

It is my first post here, so.. hello everyone ! :wink:

I have attached 3 things to this post:

1. Patch with some low-level optimizations, especially for linux/gcc configuration.
2. Patch that introduces a cache with station locations to speed up expensive FindStationsAroundTiles function.
3. A tool that helps in performance testing and that can be employed for automated tests.

I had some problems while attaching patch 3, so i've uploaded it here:
students.mimuw.edu.pl/~kn248436/opos/tester.tgz

Below you will find some a bit more detailed descriptions of them.

I have made some performance tests to show that the optimizations work.

The test cases were three savegames and the command I launched was:
openttd -x -v null:ticks=10000 -s null -m null -g [SavegameName]

and then I measured cpu time using time command.

The results were (in seconds, averaged):

1024x1024 map with huge cities and lots of trains:
none: 113
mod1: 108
mod2: 100
both: 97

2048x2048 empty map with medium number of medium-sized cities:
none: 47
mod1: 46
mod2: 44
both: 43

2048x2048 empty map with lots of large cities:
none: 100
mod1: 94
mod2: 74
both: 73

My system configuration is:
Linux 2.6.33-ARCH #1 SMP PREEMPT Sun Apr 4 10:27:30 CEST 2010 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux
4GB RAM

All three savegames are included in the test tool (attachment 3). They are 1024x1024-huge.sav, 2048x2048-notPopulated.sav, 2048x2048-populated.sav respectively.

Here you can see the reports generated with attached test tool:

http://students.mimuw.edu.pl/~kn248436/ ... Report.pdf
http://students.mimuw.edu.pl/~kn248436/ ... Report.txt

Now a bit more detailed description of the patches.

1.
I have introduced some gcc builtin functions: __builtin_ffs, __builtin_popcount, __builtin_clz and used them to improve some functions in core/bitmath_func.hpp
I have also realized (using callgrind) that some of the functions called really often but being very short (like GetTileType) were not inlined. So I introduced a macro, OTTD_ALWAYS_INLINE that expands under gcc to
__attribute__((always_inline)) so as to raise the probability of inlining for some functions. I have also moved some functions to headers from .cpps.

Although new functions became inlined after my modifications the size of the binary has decreased on my system. Before modifications it has 5766276 bytes and after them - 5759997.

2.
I have realized that one of the most expensive functions in code being often called is FindStationsAroundTiles. It did a bit weird thing: it scanned the whole area of - for example - 10x10 tiles (or more), one by one, to get all stations there.

I have introduced some kind of a cache for stations positions so as to improve it.

At first I divide the whole map on parts sized by default (can be changed, but this seems to be the best choice for my computer) 2^6 x 2^6 tiles. For each part I store a list of TileAreas related to stations located there.
So when I need to scan an area for stations I take all map parts intersecting with this area, take all stations stored in list related to the part and check if they are located in requested area.

Despite this solution is not very sophisticated, it works very well, especially on maps without many stations (the structures with staiton areas are empty so the process is trivial), however on very crowded maps it also gives a significant performance gain without significant memory overhead. What's more, high setting of station catchment radius does not significantly affect the game speed.

The code that updates (when adding/removing/deleting station) the cache may look a bit primitive and like it could be done much faster - but the updating happens so rarely that I consider it good enough. Only lookup time seems for me to be performance-critical.

3.
I have created a tool that may be helpful when running performance tests and can be good for automated regression tests for example. See README inside for more details. I hope you will find it useful :).
Attachments
ll.patch
patch 1
(15.34 KiB) Downloaded 218 times
stationcache.patch
patch 2
(20.77 KiB) Downloaded 229 times
SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

Re: [PATCH] Several optimizations and performance test tool

Post by SmatZ »

Thank you for your effort! Those numbers are really impressive. That reminds me of two patches I have/had:

1) #define FORCEINLINE inline __attribute__((always_inline))
(similiar to your first patch, but it would need review of all header files, and applying FORCEINLINE where appropriate)
that patch is really outdated nowadays...

2) cache stations near given industry/town - similiar to your second patch, it would be interesting to compare those two patches. However, I lost that patch somewhere...
Image
Eddi
Tycoon
Tycoon
Posts: 8272
Joined: 17 Jan 2007 00:14

Re: [PATCH] Several optimizations and performance test tool

Post by Eddi »

i think you're on the right track here. congratulations :)
Michi_cc
OpenTTD Developer
OpenTTD Developer
Posts: 619
Joined: 14 Jun 2004 23:27
Location: Berlin, Germany
Contact:

Re: [PATCH] Several optimizations and performance test tool

Post by Michi_cc »

Part 1 reminds me of FS#2266 (which was closed as "won't implement"). For that I determined that __builtin_ffs (if it is using the bitscan instructions) is only faster for newer Intel CPUs. AMD and older CPUs don't profit. Is __builtin_popcount using a generic implementation or is it on x86 just an intrinsic for the POPCNT instruction? I couldn't find clear docs on this. If yes, it will fail on older CPUs.

-- Michael Lutz
kamil.nowosad
Engineer
Engineer
Posts: 3
Joined: 22 Aug 2010 22:04

Re: [PATCH] Several optimizations and performance test tool

Post by kamil.nowosad »

Michi_cc wrote:Part 1 reminds me of FS#2266 (which was closed as "won't implement"). For that I determined that __builtin_ffs (if it is using the bitscan instructions) is only faster for newer Intel CPUs. AMD and older CPUs don't profit. Is __builtin_popcount using a generic implementation or is it on x86 just an intrinsic for the POPCNT instruction? I couldn't find clear docs on this. If yes, it will fail on older CPUs.

-- Michael Lutz
Unfortunately I do not have any strong arguments (documentation) that gcc builtins have generic implementations. When I have a bit more time I can search for that (or look into assembly code). But I can say that I would be really, really surprised if something went wrong here... The only doc I have used was: http://gcc.gnu.org/onlinedocs/gcc-4.4.4 ... r-Builtins

And about generic implementations and failing on older cpus...: you already use builtin bswap32. Has anybody complained about it?

You write that __builtin_ffs is only faster for newer Intel CPUs. Ok.. but to be sure - is it faster only on that cpus than the implementation in core/bitmath_func?

Now let's take __builtin_popcount... And let's assume that there exist the generic implementations of the builtins. I have used it in one of the functions in core/bitmath_func because I'm sure that the function there was not the fastest possible implementation of popcount (just simple googling brings good examples of some magical implementations of it using max ~5 instructions - the function present in openttd was much longer). But instead of searching for these magical solutions and checking which is really fast (and correct), I have decided to rely on gcc developers - rather small probability of buggy implementation and large probability of getting one of the fastest implementation of this function possible (for specific architecture).
kamil.nowosad
Engineer
Engineer
Posts: 3
Joined: 22 Aug 2010 22:04

Re: [PATCH] Several optimizations and performance test tool

Post by kamil.nowosad »

SmatZ wrote:Thank you for your effort! Those numbers are really impressive. That reminds me of two patches I have/had:

2) cache stations near given industry/town - similiar to your second patch, it would be interesting to compare those two patches. However, I lost that patch somewhere...
How did you cache that? Did you cache nearby stations for every building in town? I have thought about keeping lists of nearby stations for all buildings and industries, but I dropped it and coded the present solution, because it might cost a bit more memory.
Michi_cc
OpenTTD Developer
OpenTTD Developer
Posts: 619
Joined: 14 Jun 2004 23:27
Location: Berlin, Germany
Contact:

Re: [PATCH] Several optimizations and performance test tool

Post by Michi_cc »

kamil.nowosad wrote:And about generic implementations and failing on older cpus...: you already use builtin bswap32. Has anybody complained about it?
No, and as this instruction is available from the 486 onwards, I don't expect that to happen in the future either.
kamil.nowosad wrote:You write that __builtin_ffs is only faster for newer Intel CPUs. Ok.. but to be sure - is it faster only on that cpus than the implementation in core/bitmath_func?
When I compared it back then on an AMD system, no difference was measurable. This was using MSVC and not GCC though, so GCC might compile the current function worse and the builtin better. This is a general problem with these kinds of micro-optimization: The results are highly dependent on compiler version and CPU used. While something might give a huge boost with a specific compiler and CPU, a different compiler version and CPU might have the opposite result.

Note that all this only applies to patch 1, algorithmic improvements like patch 2 are much more useful because they help regardless of compiler and CPU.

-- Michael Lutz
SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

Re: [PATCH] Several optimizations and performance test tool

Post by SmatZ »

kamil.nowosad wrote: Now let's take __builtin_popcount... And let's assume that there exist the generic implementations of the builtins.
I did a simple test. With -msse4, popcnt instruction is generated. Without it, even at -O3 "call __popcountdi2" is generated.
In the case of __builtin_bswap32, for -march=i386, bytes are swapped using rotations. For other architectures, bswap is generated.
kamil.nowosad wrote:How did you cache that? Did you cache nearby stations for every building in town?
Cache for every industry and town. eg. Industry::stations_near and Town::stations_near. One disadvantage was it had to be recomputed when a house was built/destroyed.
Now I think I remember - the performance difference was very small in my testcases, and not worth the added code. However, so there is little improvement with your patch for my testcase... I hope I find that patch and test it with your testcases :)
I usually test in games with many vehicles, so GetClosestStation() is not a bottleneck.
Image
SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

Re: [PATCH] Several optimizations and performance test tool

Post by SmatZ »

Your patch #2, StationCache, looks fine in theory, but it fails in real-life games, where performance is important.
For example, http://www.openttdcoop.org/files/public ... _Final.sav :

Code: Select all

time taskset -c 3 bin/openttd -g ~/.openttd/save/openttdcoop/PublicServerGame_136_Final.sav -s null -m null -v null:ticks=10000
Median "user" time for 3 runs:
trunk: 3m36.150s
sc: 3m37.620s
There your patch makes the performance actually worse.

Don't forget to configure with --disable-assert --disable-ai (else the AI will cause random differences in measured times when it starts).
Image
Post Reply

Return to “OpenTTD Development”

Who is online

Users browsing this forum: Ahrefs [Bot] and 41 guests