[PATCH] Several optimizations and performance test tool
Moderator: OpenTTD Developers
-
- Engineer
- Posts: 3
- Joined: 22 Aug 2010 22:04
[PATCH] Several optimizations and performance test tool
It is my first post here, so.. hello everyone !
I have attached 3 things to this post:
1. Patch with some low-level optimizations, especially for linux/gcc configuration.
2. Patch that introduces a cache with station locations to speed up expensive FindStationsAroundTiles function.
3. A tool that helps in performance testing and that can be employed for automated tests.
I had some problems while attaching patch 3, so i've uploaded it here:
students.mimuw.edu.pl/~kn248436/opos/tester.tgz
Below you will find some a bit more detailed descriptions of them.
I have made some performance tests to show that the optimizations work.
The test cases were three savegames and the command I launched was:
openttd -x -v null:ticks=10000 -s null -m null -g [SavegameName]
and then I measured cpu time using time command.
The results were (in seconds, averaged):
1024x1024 map with huge cities and lots of trains:
none: 113
mod1: 108
mod2: 100
both: 97
2048x2048 empty map with medium number of medium-sized cities:
none: 47
mod1: 46
mod2: 44
both: 43
2048x2048 empty map with lots of large cities:
none: 100
mod1: 94
mod2: 74
both: 73
My system configuration is:
Linux 2.6.33-ARCH #1 SMP PREEMPT Sun Apr 4 10:27:30 CEST 2010 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux
4GB RAM
All three savegames are included in the test tool (attachment 3). They are 1024x1024-huge.sav, 2048x2048-notPopulated.sav, 2048x2048-populated.sav respectively.
Here you can see the reports generated with attached test tool:
http://students.mimuw.edu.pl/~kn248436/ ... Report.pdf
http://students.mimuw.edu.pl/~kn248436/ ... Report.txt
Now a bit more detailed description of the patches.
1.
I have introduced some gcc builtin functions: __builtin_ffs, __builtin_popcount, __builtin_clz and used them to improve some functions in core/bitmath_func.hpp
I have also realized (using callgrind) that some of the functions called really often but being very short (like GetTileType) were not inlined. So I introduced a macro, OTTD_ALWAYS_INLINE that expands under gcc to
__attribute__((always_inline)) so as to raise the probability of inlining for some functions. I have also moved some functions to headers from .cpps.
Although new functions became inlined after my modifications the size of the binary has decreased on my system. Before modifications it has 5766276 bytes and after them - 5759997.
2.
I have realized that one of the most expensive functions in code being often called is FindStationsAroundTiles. It did a bit weird thing: it scanned the whole area of - for example - 10x10 tiles (or more), one by one, to get all stations there.
I have introduced some kind of a cache for stations positions so as to improve it.
At first I divide the whole map on parts sized by default (can be changed, but this seems to be the best choice for my computer) 2^6 x 2^6 tiles. For each part I store a list of TileAreas related to stations located there.
So when I need to scan an area for stations I take all map parts intersecting with this area, take all stations stored in list related to the part and check if they are located in requested area.
Despite this solution is not very sophisticated, it works very well, especially on maps without many stations (the structures with staiton areas are empty so the process is trivial), however on very crowded maps it also gives a significant performance gain without significant memory overhead. What's more, high setting of station catchment radius does not significantly affect the game speed.
The code that updates (when adding/removing/deleting station) the cache may look a bit primitive and like it could be done much faster - but the updating happens so rarely that I consider it good enough. Only lookup time seems for me to be performance-critical.
3.
I have created a tool that may be helpful when running performance tests and can be good for automated regression tests for example. See README inside for more details. I hope you will find it useful .
I have attached 3 things to this post:
1. Patch with some low-level optimizations, especially for linux/gcc configuration.
2. Patch that introduces a cache with station locations to speed up expensive FindStationsAroundTiles function.
3. A tool that helps in performance testing and that can be employed for automated tests.
I had some problems while attaching patch 3, so i've uploaded it here:
students.mimuw.edu.pl/~kn248436/opos/tester.tgz
Below you will find some a bit more detailed descriptions of them.
I have made some performance tests to show that the optimizations work.
The test cases were three savegames and the command I launched was:
openttd -x -v null:ticks=10000 -s null -m null -g [SavegameName]
and then I measured cpu time using time command.
The results were (in seconds, averaged):
1024x1024 map with huge cities and lots of trains:
none: 113
mod1: 108
mod2: 100
both: 97
2048x2048 empty map with medium number of medium-sized cities:
none: 47
mod1: 46
mod2: 44
both: 43
2048x2048 empty map with lots of large cities:
none: 100
mod1: 94
mod2: 74
both: 73
My system configuration is:
Linux 2.6.33-ARCH #1 SMP PREEMPT Sun Apr 4 10:27:30 CEST 2010 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux
4GB RAM
All three savegames are included in the test tool (attachment 3). They are 1024x1024-huge.sav, 2048x2048-notPopulated.sav, 2048x2048-populated.sav respectively.
Here you can see the reports generated with attached test tool:
http://students.mimuw.edu.pl/~kn248436/ ... Report.pdf
http://students.mimuw.edu.pl/~kn248436/ ... Report.txt
Now a bit more detailed description of the patches.
1.
I have introduced some gcc builtin functions: __builtin_ffs, __builtin_popcount, __builtin_clz and used them to improve some functions in core/bitmath_func.hpp
I have also realized (using callgrind) that some of the functions called really often but being very short (like GetTileType) were not inlined. So I introduced a macro, OTTD_ALWAYS_INLINE that expands under gcc to
__attribute__((always_inline)) so as to raise the probability of inlining for some functions. I have also moved some functions to headers from .cpps.
Although new functions became inlined after my modifications the size of the binary has decreased on my system. Before modifications it has 5766276 bytes and after them - 5759997.
2.
I have realized that one of the most expensive functions in code being often called is FindStationsAroundTiles. It did a bit weird thing: it scanned the whole area of - for example - 10x10 tiles (or more), one by one, to get all stations there.
I have introduced some kind of a cache for stations positions so as to improve it.
At first I divide the whole map on parts sized by default (can be changed, but this seems to be the best choice for my computer) 2^6 x 2^6 tiles. For each part I store a list of TileAreas related to stations located there.
So when I need to scan an area for stations I take all map parts intersecting with this area, take all stations stored in list related to the part and check if they are located in requested area.
Despite this solution is not very sophisticated, it works very well, especially on maps without many stations (the structures with staiton areas are empty so the process is trivial), however on very crowded maps it also gives a significant performance gain without significant memory overhead. What's more, high setting of station catchment radius does not significantly affect the game speed.
The code that updates (when adding/removing/deleting station) the cache may look a bit primitive and like it could be done much faster - but the updating happens so rarely that I consider it good enough. Only lookup time seems for me to be performance-critical.
3.
I have created a tool that may be helpful when running performance tests and can be good for automated regression tests for example. See README inside for more details. I hope you will find it useful .
- Attachments
-
- ll.patch
- patch 1
- (15.34 KiB) Downloaded 218 times
-
- stationcache.patch
- patch 2
- (20.77 KiB) Downloaded 229 times
-
- OpenTTD Developer
- Posts: 351
- Joined: 03 Oct 2006 18:26
- Location: Prague, Czech Republic
- Contact:
Re: [PATCH] Several optimizations and performance test tool
Thank you for your effort! Those numbers are really impressive. That reminds me of two patches I have/had:
1) #define FORCEINLINE inline __attribute__((always_inline))
(similiar to your first patch, but it would need review of all header files, and applying FORCEINLINE where appropriate)
that patch is really outdated nowadays...
2) cache stations near given industry/town - similiar to your second patch, it would be interesting to compare those two patches. However, I lost that patch somewhere...
1) #define FORCEINLINE inline __attribute__((always_inline))
(similiar to your first patch, but it would need review of all header files, and applying FORCEINLINE where appropriate)
that patch is really outdated nowadays...
2) cache stations near given industry/town - similiar to your second patch, it would be interesting to compare those two patches. However, I lost that patch somewhere...
Re: [PATCH] Several optimizations and performance test tool
i think you're on the right track here. congratulations
Re: [PATCH] Several optimizations and performance test tool
Part 1 reminds me of FS#2266 (which was closed as "won't implement"). For that I determined that __builtin_ffs (if it is using the bitscan instructions) is only faster for newer Intel CPUs. AMD and older CPUs don't profit. Is __builtin_popcount using a generic implementation or is it on x86 just an intrinsic for the POPCNT instruction? I couldn't find clear docs on this. If yes, it will fail on older CPUs.
-- Michael Lutz
-- Michael Lutz
-
- Engineer
- Posts: 3
- Joined: 22 Aug 2010 22:04
Re: [PATCH] Several optimizations and performance test tool
Unfortunately I do not have any strong arguments (documentation) that gcc builtins have generic implementations. When I have a bit more time I can search for that (or look into assembly code). But I can say that I would be really, really surprised if something went wrong here... The only doc I have used was: http://gcc.gnu.org/onlinedocs/gcc-4.4.4 ... r-BuiltinsMichi_cc wrote:Part 1 reminds me of FS#2266 (which was closed as "won't implement"). For that I determined that __builtin_ffs (if it is using the bitscan instructions) is only faster for newer Intel CPUs. AMD and older CPUs don't profit. Is __builtin_popcount using a generic implementation or is it on x86 just an intrinsic for the POPCNT instruction? I couldn't find clear docs on this. If yes, it will fail on older CPUs.
-- Michael Lutz
And about generic implementations and failing on older cpus...: you already use builtin bswap32. Has anybody complained about it?
You write that __builtin_ffs is only faster for newer Intel CPUs. Ok.. but to be sure - is it faster only on that cpus than the implementation in core/bitmath_func?
Now let's take __builtin_popcount... And let's assume that there exist the generic implementations of the builtins. I have used it in one of the functions in core/bitmath_func because I'm sure that the function there was not the fastest possible implementation of popcount (just simple googling brings good examples of some magical implementations of it using max ~5 instructions - the function present in openttd was much longer). But instead of searching for these magical solutions and checking which is really fast (and correct), I have decided to rely on gcc developers - rather small probability of buggy implementation and large probability of getting one of the fastest implementation of this function possible (for specific architecture).
-
- Engineer
- Posts: 3
- Joined: 22 Aug 2010 22:04
Re: [PATCH] Several optimizations and performance test tool
How did you cache that? Did you cache nearby stations for every building in town? I have thought about keeping lists of nearby stations for all buildings and industries, but I dropped it and coded the present solution, because it might cost a bit more memory.SmatZ wrote:Thank you for your effort! Those numbers are really impressive. That reminds me of two patches I have/had:
2) cache stations near given industry/town - similiar to your second patch, it would be interesting to compare those two patches. However, I lost that patch somewhere...
Re: [PATCH] Several optimizations and performance test tool
No, and as this instruction is available from the 486 onwards, I don't expect that to happen in the future either.kamil.nowosad wrote:And about generic implementations and failing on older cpus...: you already use builtin bswap32. Has anybody complained about it?
When I compared it back then on an AMD system, no difference was measurable. This was using MSVC and not GCC though, so GCC might compile the current function worse and the builtin better. This is a general problem with these kinds of micro-optimization: The results are highly dependent on compiler version and CPU used. While something might give a huge boost with a specific compiler and CPU, a different compiler version and CPU might have the opposite result.kamil.nowosad wrote:You write that __builtin_ffs is only faster for newer Intel CPUs. Ok.. but to be sure - is it faster only on that cpus than the implementation in core/bitmath_func?
Note that all this only applies to patch 1, algorithmic improvements like patch 2 are much more useful because they help regardless of compiler and CPU.
-- Michael Lutz
-
- OpenTTD Developer
- Posts: 351
- Joined: 03 Oct 2006 18:26
- Location: Prague, Czech Republic
- Contact:
Re: [PATCH] Several optimizations and performance test tool
I did a simple test. With -msse4, popcnt instruction is generated. Without it, even at -O3 "call __popcountdi2" is generated.kamil.nowosad wrote: Now let's take __builtin_popcount... And let's assume that there exist the generic implementations of the builtins.
In the case of __builtin_bswap32, for -march=i386, bytes are swapped using rotations. For other architectures, bswap is generated.
Cache for every industry and town. eg. Industry::stations_near and Town::stations_near. One disadvantage was it had to be recomputed when a house was built/destroyed.kamil.nowosad wrote:How did you cache that? Did you cache nearby stations for every building in town?
Now I think I remember - the performance difference was very small in my testcases, and not worth the added code. However, so there is little improvement with your patch for my testcase... I hope I find that patch and test it with your testcases
I usually test in games with many vehicles, so GetClosestStation() is not a bottleneck.
-
- OpenTTD Developer
- Posts: 351
- Joined: 03 Oct 2006 18:26
- Location: Prague, Czech Republic
- Contact:
Re: [PATCH] Several optimizations and performance test tool
Your patch #2, StationCache, looks fine in theory, but it fails in real-life games, where performance is important.
For example, http://www.openttdcoop.org/files/public ... _Final.sav :
Median "user" time for 3 runs:
trunk: 3m36.150s
sc: 3m37.620s
There your patch makes the performance actually worse.
Don't forget to configure with --disable-assert --disable-ai (else the AI will cause random differences in measured times when it starts).
For example, http://www.openttdcoop.org/files/public ... _Final.sav :
Code: Select all
time taskset -c 3 bin/openttd -g ~/.openttd/save/openttdcoop/PublicServerGame_136_Final.sav -s null -m null -v null:ticks=10000
trunk: 3m36.150s
sc: 3m37.620s
There your patch makes the performance actually worse.
Don't forget to configure with --disable-assert --disable-ai (else the AI will cause random differences in measured times when it starts).
Who is online
Users browsing this forum: Ahrefs [Bot] and 41 guests