OpenTTD optimizations

Forum for technical discussions regarding development. If you have a general suggestion, problem or comment, please use one of the other forums.

Moderator: OpenTTD Developers

SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

OpenTTD optimizations

Post by SmatZ »

Hello there :-)

after some time looking at OTTD code, I must have one question:

How much is preferred fast code?

eg., after changing vehicle.h *GetVehicle function to:
return (Vehicle*)(_vehicle_pool.blocks[index >> VEHICLES_POOL_BLOCK_SIZE_BITS] + (index & ((1 << VEHICLES_POOL_BLOCK_SIZE_BITS) - 1)) * sizeof(Vehicle));

(eg., inlining GetItemFromPool and changing variables to known constants) speeds up my test game by 4% (runtime of one month shorts from 1:46 to 1:42 - there are many long trains - around 20 000 vehicles in total)

I am sure there are more ways to increase performance - maybe by dividing VehiclePool to 5 different pools, by better handling of pool for faster Adds to pool (duplicating my 1000th train takes like half a second) - this would need a bit deeper digging into code, but just storing the offset of first empty slot could help ; maybe by templating the whole pool to make possible using constants (given by enum or #define, not those const int)

Is there any reason not to do that? Worse reability of code maybe?

Thanks for reply!
User avatar
Darkvater
Tycoon
Tycoon
Posts: 3053
Joined: 24 Feb 2003 18:45
Location: Hong Kong

Post by Darkvater »

I highly doubt your results. Firstly, look at the current code for GetItemFromPool().

Code: Select all

static inline byte *GetItemFromPool(const MemoryPool *pool, uint index)
{
	assert(index < pool->total_items);
	return (pool->blocks[index >> pool->block_size_bits] + (index & ((1 << pool->block_size_bits) - 1)) * pool->item_size);
}
The function is already inlined and the when openttd is built with full optimizations the chance that it is actually getting inlined is about 99.9% (depending on how stupid your compiler is). So as long as you don't show me the assembly output proving the function wasn't inlined...I can only take your results with a grain of salt.
TrueLight: "Did you bother to read any of the replies, or you just pressed 'Reply' and started typing?"
<@[R-Dk]FoRbiDDeN> "HELP, this litte arrow thing keeps following my mouse, and I can't make it go away."
SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

Post by SmatZ »

Darkvater wrote: The function is already inlined and the when openttd is built with full optimizations the chance that it is actually getting inlined is about 99.9% (depending on how stupid your compiler is). So as long as you don't show me the assembly output proving the function wasn't inlined...I can only take your results with a grain of salt.
Yes, the function gets inlined. But the problem there is (after looking at assembled code) that there are more loads from memory and more computations in standard code:
load pool->block_size_bits, shifting by it twice (x86 architecture allows shifts by immediate value, but here must be used one register to store shift count)
load pool->item_size and multiplying by it (there can be multiple by constant or, better, shifts instead of multiply if the compiler know the size, because it is constant)
There are more free registers, so more things can be loaded only once and stored in register.


The 4% difference is more than I expected, but I doublechecked it.
First run, normal version - 1:46.5, second 1:44.9
Version with constants - 1:42.9 and 1:40.9

I am not used to linux debugging, but I know OllyDBG in Windows and there I could see how values are computed. (but I dont know how to disassemble my linux compiled programs)

edit 1:
I use the same Makefile.Config and openttd.cfg for both versions

edit 2:
Disassembled openttd.exe r6616, some inlined GetVehicle:

Code: Select all

lea edx,[ebx+1] ; ebx=index-1 (prev. iteration) -> edx=index
mov ecx,[0x507c50] ; ecx=pool->block_size_bits
movzx edx,dx ; erm... index is 16bit, isn't it?
mov eax,1 ; 1 to be shifted
shl eax,cl ; shift left by block_size_bits
mov ebx,edx
shr ebx,cl ; shift right by block_size_bits
lea ecx,[eax-1] ; 1<<bsb - 1
and ecx,edx ; (index & (1<<bsb - 1))
imul ecx,[0x507c54] ; * pool->item_size
mov eax,[0x507c68] ; pool->blocks
add ecx,[eax+ebx*4] ; finally, ecx=offset to Vehicle structure in memory..
Considering this happens for every iteration of FOR_ALL_VEHICLES, it takes a lot of time overhead.
Another good idea would be not to use indexes, but to use Vehicle* just in the main loop with taking care of end_of_block on my own - the compiler doesn't optimise this, and the IMUL takes a lot of time too in every iteration. We would jump the pointer by sizeof(Vehicle)... like
if (!end_of_this_block) veh+=1;

Are there any arguments against dividing VehiclePool into pools for road/rail/ship/air/special?
There are often loops looking only for vehicles of special type. So these loops could be made much faster.

The idea of faster searching for free slots is in some structure of pointers to vehicles. There could only one index be moved (the last index) onto the place of removed vehicle. Newly added vehicle would be added to the end of the list. Memory requirements are not heavy, sizeof(pointer)*(1<<block_size_bits) for every block. This would make indexing of Vehicles faster, too, because one would need not to use that (see above) code to compute offset of the Vehicle structure in memory.

This depends, however, how often removing/adding of standard vehicles happens. Maybe it would make code slower, too, and direct indexing of Vehicles array would be faster.

If I am mistaken, correct me please!
Last edited by SmatZ on 15 Oct 2006 14:18, edited 1 time in total.
TrueBrain
OpenTTD Developer
OpenTTD Developer
Posts: 1370
Joined: 31 May 2004 09:21

Post by TrueBrain »

B. N. SmatZ! wrote:(..)

Another good idea would be not to use indexes, but to use Vehicle* just in the main loop with taking care of end_of_block on my own - the compiler doesn't optimise this, and the IMUL takes a lot of time too in every iteration. We would jump the pointer by sizeof(Vehicle)... like
if (!end_of_this_block) veh+=1;

(..)
Lol :) You want to put all vehicles in one continues block? Then you can better revert to a version pre-mempools ;) Hehe :)
SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

Post by SmatZ »

TrueLight wrote: Lol :) You want to put all vehicles in one continues block? Then you can better revert to a version pre-mempools ;) Hehe :)
oh nono :) Sorry if I wrote it nonundestable. There is no problem with memory blocks, but the problem is its rather slow addressing.

The FOR_ALL_VEHICLES_FROM macro could be changed to:
Vehicle *v,**w;
for (w = _vehicle_pool.blocks[start>>VEHICLE_POOL_SIZE_BITS], w=start&(1<<VPSB-1); w < _vp.no_of_blocks; if ( ++v > (*w)+(1<<VPSB)) v=*(++w); : NULL) if (IsValidVehicle(v))

... it is a rough code, but I hope you understand what I mean :-)
SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

Post by SmatZ »

I did another test, with changed just one loop: vehicles.c of 0.4.8, proc CallVehicleTicks

Code: Select all

	int weh = 0; Vehicle *konec = _vehicle_pool.blocks[0]+(1<<VEHICLES_POOL_BLOCK_SIZE_BITS)*sizeof(Vehicle);
	Vehicle *v;

	_first_veh_in_depot_list = NULL;	// now we are sure it's initialized at the start of each tick

 	for (v = GetVehicle(0); v != NULL; )
//FOR_ALL_VEHICLES(v)
	{
		if (v->type != 0) {
			_vehicle_tick_procs[v->type - 0x10](v);
		}
 	if( (++v) == konec ) v = ( ++weh >= _vehicle_pool.current_blocks ? NULL : ( konec = _vehicle_pool.blocks[weh] + ( weh == _vehicle_pool.current_blocks - 1 ? 1 + (_vehicle_pool.total_items & ((1<<VEHICLES_POOL_BLOCK_SIZE_BITS)-1)) : (1<<VEHICLES_POOL_BLOCK_SIZE_BITS) ) * sizeof(Vehicle),_vehicle_pool.blocks[weh] ) ) ;
	}
Time 1:39.3, another 1% faster, 5% in total. If FOR_ALL_VEHICLES macro could be changed in general, there could be more imrovements...

If you like the OOP, you could create some struct/class like VehicleFinder...

Is it worth the work?
User avatar
prissi
Chief Executive
Chief Executive
Posts: 648
Joined: 15 Nov 2004 19:46
Location: Berlin, Germany
Contact:

Post by prissi »

The 5% faster will be very processor dependent. Moreover, it will be even compiler dependent. OOPs will not help there, especially with different vehicle types there would be virtual functions which involve a virtual function table, which is not too well on the current mostly used long pipelined processors.

OOP could help in making better readable code, though.

Moreover, for every display step this loop is called once. Therefore, the better place for optimisation is imho the tile display loop itself. Many tiles calculate the sprite they display on the fly, even the ground tiles do so. Either using static tables (especially for trees) or even better caching two sprite numbers per tiles instead recalculation might be a better way to speed up the loop.

Otherwise the next compiler revision might slow down your explicitely written out version, since the compiler finds some other clever way to handle registers or can use SSE registers (ok, not likely in this case, but just an example) or just reorders instruction so they better fit into the pipeline.

If the display loop goes up 10% on most machines it might be worthwile.
SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

Post by SmatZ »

prissi wrote:The 5% faster will be very processor dependent. Moreover, it will be even compiler dependent.
Why do you think it will be processor dependend? Removing of overhead caused by unneeded calculations of memory addesses and - forcing numeric constants there, where are now variables - reduces number of instruction executed. Yes, it is a bit compiler dependend (but only the first thing I mentioned, the latter cannot be), but if compiler is intelligent enought to optimize current code, the new code will be optimized for sure too.
prissi wrote: OOPs will not help there, especially with different vehicle types there would be virtual functions which involve a virtual function table, which is not too well on the current mostly used long pipelined processors. OOP could help in making better readable code, though.
Yes, I agree. I don't like misusing of OOP, too. It helps reducing developer time, but often makes code more complex for optimizer and slower.
prissi wrote: Moreover, for every display step this loop is called once. Therefore, the better place for optimisation is imho the tile display loop itself. Many tiles calculate the sprite they display on the fly, even the ground tiles do so. Either using static tables (especially for trees) or even better caching two sprite numbers per tiles instead recalculation might be a better way to speed up the loop.
A good idea, I will have a look at it!
prissi wrote: Otherwise the next compiler revision might slow down your explicitely written out version, since the compiler finds some other clever way to handle registers or can use SSE registers (ok, not likely in this case, but just an example) or just reorders instruction so they better fit into the pipeline.

If the display loop goes up 10% on most machines it might be worthwile.
Yes, there should be some limit how much machine-optimized code could be. For example, using 64bit variables makes code slower on 32bit machines.

Maybe I am a bit different, but for me, every % counts - and when it is as easy as implicitly rewriting two lines of code in vehicles.h, why not try it? Even when it comes at cost of skipping calls to pool.h, because they are rewritten for Vehicles given its constants...
SmatZ
OpenTTD Developer
OpenTTD Developer
Posts: 351
Joined: 03 Oct 2006 18:26
Location: Prague, Czech Republic
Contact:

Post by SmatZ »

Ok, I did another little modification, and the code is now ~ 10% faster against original (1:10,3 vs 1:17,4). Now it should work almost as fast as without mempools.

It is really ugly, but it is just for testing if this style is effective (and gives a lot of warnings for non-compliance, too)

Code: Select all

#define FOR_ALL_VEHICLES_FROM(v, start)  Vehicle * konec; int weh; for (v = GetVehicle(start),weh=start >> VEHICLES_POOL_BLOCK_SIZE_BITS,konec=_vehicle_pool.blocks[weh]+(( weh == _vehicle_pool.current_blocks - 1 ) ?( 1 + (_vehicle_pool.total_items & ((1<<VEHICLES_POOL_BLOCK_SIZE_BITS)-1))) : (1<<VEHICLES_POOL_BLOCK_SIZE_BITS) ) * sizeof(Vehicle); v != NULL; ((++v) == konec) ? v = ( ++weh >= _vehicle_pool.current_blocks ? NULL : ( konec = _vehicle_pool.blocks[weh] + (( weh == _vehicle_pool.current_blocks - 1 ) ?( 1 + (_vehicle_pool.total_items & ((1<<VEHICLES_POOL_BLOCK_SIZE_BITS)-1))) : (1<<VEHICLES_POOL_BLOCK_SIZE_BITS) ) * sizeof(Vehicle),_vehicle_pool.blocks[weh] ) ) : v )
As we were talking about mempools, what are its advantages? Are there any architectures that dont support virtual memory and paging, what is OTTD built at?

edit:
This code would need some more changes, because as it is now, there are some debug messages. Some procedures changes Vehicle *v inside the for loop. I will try to find them and correct them.

edit:
The problem was in AfterLoadVehicles. The second FOR_ALL_VEHICLES must use old style loop. Another probably problematic could be OldLoader.

I am just showing how things can be changed - I am not trying to do any major changes to code as I am too new here :-)
CobraA1
Route Supervisor
Route Supervisor
Posts: 480
Joined: 07 Nov 2003 17:52
Location: USA

Post by CobraA1 »

The 5% faster will be very processor dependent. Moreover, it will be even compiler dependent. OOPs will not help there, especially with different vehicle types there would be virtual functions which involve a virtual function table, which is not too well on the current mostly used long pipelined processors.
Most current processors are also overkill for OpenTTD, as it'll run on processors around (and even below) the 100 MHz mark. If you wish to optimize for something, optimize for mobile processors and other lower-end processors. Don't bother optimizing for the 2-4 GHz processors, as they can already handle OpenTTD quite well.
"If a man does not keep pace with his companions, perhaps it is because he hears a different drummer. Let him step to the music he hears, however measured or far away" --Henry David Thoreau
DaleStan
TTDPatch Developer
TTDPatch Developer
Posts: 10285
Joined: 18 Feb 2004 03:06
Contact:

Post by DaleStan »

Talk to the people on #openttd-coop about that one; I think you'll find that they disagree.
To get a good answer, ask a Smart Question. Similarly, if you want a bug fixed, write a Useful Bug Report. No TTDPatch crashlog? Then follow directions.
Projects: NFORenum (download) | PlaneSet (Website) | grfcodec (download) | grfdebug.log parser
User avatar
XeryusTC
Tycoon
Tycoon
Posts: 15415
Joined: 02 May 2005 11:05
Skype: XeryusTC
Location: localhost

Post by XeryusTC »

DaleStan wrote:Talk to the people on #openttd-coop about that one; I think you'll find that they disagree.
Indeed I/we do, the network code of OpenTTD uses too much CPU. It makes games with more than 400 trains quite unplayable, while my PC can handle up to 800-900 trains without a problem. And our server can only handle up to 400 trains too before the game starts eating too much CPU and other processes can't do a thing.
On this note, it would be very usefull for us to optimise pathfinding code and/or network code as those are the things that use most of our CPU time.
Don't panic - My YouTube channel - Follow me on twitter (@XeryusTC) - Play Tribes: Ascend - Tired of Dropbox? Try SpiderOak (use this link and we both get 1GB extra space)
Image
OpenTTD: manual #openttdcoop: blog | wiki | public server | NewGRF pack | DevZone
Image Image Image Image Image Image Image
KUDr
OpenTTD Developer
OpenTTD Developer
Posts: 219
Joined: 11 Jan 2006 21:36
Location: Czech Republic

Post by KUDr »

I don't think that optimizing PF can help any more. Just now it is not PF code what makes your CPU load (unless you use ships heavily).
User avatar
RiTi
Transport Coordinator
Transport Coordinator
Posts: 374
Joined: 23 Jun 2006 10:24

Post by RiTi »

KUDr wrote:I don't think that optimizing PF can help any more. Just now it is not PF code what makes your CPU load (unless you use ships heavily).
Just curious, what is in general the problem that makes heavy CPU load resulting in slow game play?
Keep life simple...
User avatar
Brianetta
Tycoon
Tycoon
Posts: 2567
Joined: 15 Oct 2003 22:00
Location: Jarrow, UK
Contact:

Post by Brianetta »

The game loaded the CPU on my server very heavily even when using NTP. I believe the drawing routines have a lot to answer for, since reducing the dedicated server's screen resolution helps a lot.

YAPF is good enough that it's useable, whereas NPF was not. One of the problems we used to have on the #openttdcoop sandbox server was that players would sometimes join and do nothing but clone trains, attempting to achieve some sort of record number of trains. We've managed to discourage this behaviour now. (-:

The current sandbox uses YAPF (without any ships) and has been seen with hundreds of trains. The CPU use by the openttd process on my server's P4 1800 usually hovers at around 20-40%. It rises noticably when ten people are connected.

So, it's better than it was, but it's not perfect. OpenTTD is a much more complex game than its 1990s appearance suggests.
PGP fingerprint: E66A 9D58 AA10 E967 41A6 474E E41D 10AE 082C F3ED
KUDr
OpenTTD Developer
OpenTTD Developer
Posts: 219
Joined: 11 Jan 2006 21:36
Location: Czech Republic

Post by KUDr »

RiTi:

If you want to see what causes CPU load, here is output from profiler (SP, YAPF, 120 trains, fast forward):

Code: Select all

Function                        %Incl.  %Excl.
-----------------------------------------------
@GameLoop@0                     64.305  0.006
@StateGameLoop@0                62.627  0.075
@Train_Tick@4                   42.918  2.634
TrainLocoHandler                40.284  1.308
@UpdateWindows@0                21.506  0.000
@DrawDirtyBlocks@0              21.418  0.390
@CallWindowEventNP@8            21.110  0.025
@RedrawScreenRect@16            21.091  0.019
@DrawOverlappedWindowForAll@16  21.010  0.000
DrawOverlappedWindow            21.010  0.069
TrainController                 19.495  1.408
GfxMainBlitter                  13.629  1.333
@DoDrawString@16                11.014  0.415
PlayerFinancesWndProc           10.731  0.038
UpdateTrainSpeed                10.134  0.446
GetTrainAcceleration             9.688  9.618
GfxBlitZoomInUncomp              8.946  8.946
@CallVehicleTicks@0              8.889  7.588
DrawPlayerEconomyStats           8.229  0.044
AfterSetTrainPos                 7.600  0.515
Proxy                            7.204  0.000
_deflateEnd@4                    7.192  6.877
WriteZlibLoop                    7.192  0.000
@RunTileLoop@0                   7.129  1.440
@DrawWindowViewport@4            6.620  0.019
ViewportDrawChk                  6.601  0.000
@ViewportDoDraw@20               6.601  0.082
NewsWindowProc                   6.287  0.019
@VehicleFromPos@12               5.463  2.754
@DrawString@16                   5.450  0.013
CheckTrainCollision              5.017  0.132
_WinMainCRTStartup               4.331  0.000
TrainCheckIfLineEnds             4.080  1.987
@GfxFillRect@20                  3.904  1.999
MainWindowWndProc                3.621  0.019
@TileLoopClearHelper@4           3.539  3.539
@DrawSprite@12                   3.489  0.075
GfxBlitTileZoomIn                3.351  3.351
@DrawStringCentered@16           3.338  0.013
ChooseTrainTrack                 3.300  0.019
@GetSlopeZ@8                     3.206  0.453
@DrawWindowWidgets@4             3.068  0.069
@CallLandscapeTick@0             2.678  0.006
@DrawStringRightAligned@16       2.533  0.019
@VehiclePositionChanged@4        2.521  1.697
TileLoop_Clear                   2.477  0.283
GetSlopeZ_Track                  2.433  0.597
ViewportAddLandscape             2.401  0.220
_memset                          2.332  2.332
@OnTick_Station@0                2.169  0.314
StatusBarWndProc                 2.169  0.057
@GetTileSlope@8                  2.081  2.081
FindTrainCollideEnum             2.005  2.005
GetVehicle                       1.930  1.930
I would like to try it again in multiplayer to see the network influence.
But it will need some server with like 500+ trains.

XeryusTC: as you can see, the pathfinder takes 3.3% while whole TrainLocoHandler() takes 40%.
DaleStan
TTDPatch Developer
TTDPatch Developer
Posts: 10285
Joined: 18 Feb 2004 03:06
Contact:

Post by DaleStan »

Well, in that case, why not just optimize GameLoop, since that function takes nearly two-thirds of the processor time?
To get a good answer, ask a Smart Question. Similarly, if you want a bug fixed, write a Useful Bug Report. No TTDPatch crashlog? Then follow directions.
Projects: NFORenum (download) | PlaneSet (Website) | grfcodec (download) | grfdebug.log parser
KUDr
OpenTTD Developer
OpenTTD Developer
Posts: 219
Joined: 11 Jan 2006 21:36
Location: Czech Republic

Post by KUDr »

DaleStan: kidding? GameLoop() calls everything else you see there. All the time is spent there and it is of course in FF mode. This function itself takes only 0.006% CPU.

Probably i should explain how profiling works. But sorry. Explanation of those two columns must be enough for now:

%Incl. = function itself plus all other functions called from it
%Excl. = function itself only
TrueBrain
OpenTTD Developer
OpenTTD Developer
Posts: 1370
Joined: 31 May 2004 09:21

Post by TrueBrain »

KUDr: bad profiling, autosave seems enabled. A lot of time is 'wasted' there, which in fact doesn't contribute in any way to the profiling itself. So please disable it, rerun, and repost ;)

Ps: you also might want to sort on 'Excl', as 'Incl' doesn't really give much information.
DaleStan
TTDPatch Developer
TTDPatch Developer
Posts: 10285
Joined: 18 Feb 2004 03:06
Contact:

Post by DaleStan »

Yes, KUDr. I was being sarcastic.

I'm well trained in the theory of "Lies, damn lies, and statistics". Use of a column of percentages that adds up to somewhere around 600% makes me apply that training in spades, whether appropriate or not.
To get a good answer, ask a Smart Question. Similarly, if you want a bug fixed, write a Useful Bug Report. No TTDPatch crashlog? Then follow directions.
Projects: NFORenum (download) | PlaneSet (Website) | grfcodec (download) | grfdebug.log parser
Post Reply

Return to “OpenTTD Development”

Who is online

Users browsing this forum: Semrush [Bot] and 2 guests