OpenTTD optimizations

SmatZ · Post by **SmatZ** » 15 Oct 2006 13:27

Hello there

after some time looking at OTTD code, I must have one question:

How much is preferred fast code?

eg., after changing vehicle.h *GetVehicle function to:
return (Vehicle*)(_vehicle_pool.blocks[index >> VEHICLES_POOL_BLOCK_SIZE_BITS] + (index & ((1 << VEHICLES_POOL_BLOCK_SIZE_BITS) - 1)) * sizeof(Vehicle));

(eg., inlining GetItemFromPool and changing variables to known constants) speeds up my test game by 4% (runtime of one month shorts from 1:46 to 1:42 - there are many long trains - around 20 000 vehicles in total)

I am sure there are more ways to increase performance - maybe by dividing VehiclePool to 5 different pools, by better handling of pool for faster Adds to pool (duplicating my 1000th train takes like half a second) - this would need a bit deeper digging into code, but just storing the offset of first empty slot could help ; maybe by templating the whole pool to make possible using constants (given by enum or #define, not those const int)

Is there any reason not to do that? Worse reability of code maybe?

Thanks for reply!

Darkvater · Post by **Darkvater** » 15 Oct 2006 13:30

I highly doubt your results. Firstly, look at the current code for GetItemFromPool().

Code: Select all

static inline byte *GetItemFromPool(const MemoryPool *pool, uint index)
{
	assert(index < pool->total_items);
	return (pool->blocks[index >> pool->block_size_bits] + (index & ((1 << pool->block_size_bits) - 1)) * pool->item_size);
}

The function is already inlined and the when openttd is built with full optimizations the chance that it is actually getting inlined is about 99.9% (depending on how stupid your compiler is). So as long as you don't show me the assembly output proving the function wasn't inlined...I can only take your results with a grain of salt.

SmatZ · Post by **SmatZ** » 15 Oct 2006 13:44

Darkvater wrote: The function is already inlined and the when openttd is built with full optimizations the chance that it is actually getting inlined is about 99.9% (depending on how stupid your compiler is). So as long as you don't show me the assembly output proving the function wasn't inlined...I can only take your results with a grain of salt.

Yes, the function gets inlined. But the problem there is (after looking at assembled code) that there are more loads from memory and more computations in standard code:
load pool->block_size_bits, shifting by it twice (x86 architecture allows shifts by immediate value, but here must be used one register to store shift count)
load pool->item_size and multiplying by it (there can be multiple by constant or, better, shifts instead of multiply if the compiler know the size, because it is constant)
There are more free registers, so more things can be loaded only once and stored in register.

The 4% difference is more than I expected, but I doublechecked it.
First run, normal version - 1:46.5, second 1:44.9
Version with constants - 1:42.9 and 1:40.9

I am not used to linux debugging, but I know OllyDBG in Windows and there I could see how values are computed. (but I dont know how to disassemble my linux compiled programs)

edit 1:
I use the same Makefile.Config and openttd.cfg for both versions

edit 2:
Disassembled openttd.exe r6616, some inlined GetVehicle:

Code: Select all

lea edx,[ebx+1] ; ebx=index-1 (prev. iteration) -> edx=index
mov ecx,[0x507c50] ; ecx=pool->block_size_bits
movzx edx,dx ; erm... index is 16bit, isn't it?
mov eax,1 ; 1 to be shifted
shl eax,cl ; shift left by block_size_bits
mov ebx,edx
shr ebx,cl ; shift right by block_size_bits
lea ecx,[eax-1] ; 1<<bsb - 1
and ecx,edx ; (index & (1<<bsb - 1))
imul ecx,[0x507c54] ; * pool->item_size
mov eax,[0x507c68] ; pool->blocks
add ecx,[eax+ebx*4] ; finally, ecx=offset to Vehicle structure in memory..

Considering this happens for every iteration of FOR_ALL_VEHICLES, it takes a lot of time overhead.
Another good idea would be not to use indexes, but to use Vehicle* just in the main loop with taking care of end_of_block on my own - the compiler doesn't optimise this, and the IMUL takes a lot of time too in every iteration. We would jump the pointer by sizeof(Vehicle)... like
if (!end_of_this_block) veh+=1;

Are there any arguments against dividing VehiclePool into pools for road/rail/ship/air/special?
There are often loops looking only for vehicles of special type. So these loops could be made much faster.

The idea of faster searching for free slots is in some structure of pointers to vehicles. There could only one index be moved (the last index) onto the place of removed vehicle. Newly added vehicle would be added to the end of the list. Memory requirements are not heavy, sizeof(pointer)*(1<<block_size_bits) for every block. This would make indexing of Vehicles faster, too, because one would need not to use that (see above) code to compute offset of the Vehicle structure in memory.

This depends, however, how often removing/adding of standard vehicles happens. Maybe it would make code slower, too, and direct indexing of Vehicles array would be faster.

If I am mistaken, correct me please!

Post by **TrueBrain** » 15 Oct 2006 14:15

B. N. SmatZ! wrote:(..)

Another good idea would be not to use indexes, but to use Vehicle* just in the main loop with taking care of end_of_block on my own - the compiler doesn't optimise this, and the IMUL takes a lot of time too in every iteration. We would jump the pointer by sizeof(Vehicle)... like
if (!end_of_this_block) veh+=1;

(..)

Lol

You want to put all vehicles in one continues block? Then you can better revert to a version pre-mempools

Hehe

SmatZ · Post by **SmatZ** » 15 Oct 2006 14:31

TrueLight wrote: Lol You want to put all vehicles in one continues block? Then you can better revert to a version pre-mempools Hehe

oh nono

Sorry if I wrote it nonundestable. There is no problem with memory blocks, but the problem is its rather slow addressing.

The FOR_ALL_VEHICLES_FROM macro could be changed to:
Vehicle *v,**w;
for (w = _vehicle_pool.blocks[start>>VEHICLE_POOL_SIZE_BITS], w=start&(1<<VPSB-1); w < _vp.no_of_blocks; if ( ++v > (*w)+(1<<VPSB)) v=*(++w); : NULL) if (IsValidVehicle(v))

... it is a rough code, but I hope you understand what I mean

SmatZ · Post by **SmatZ** » 15 Oct 2006 19:09

I did another test, with changed just one loop: vehicles.c of 0.4.8, proc CallVehicleTicks

Code: Select all

	int weh = 0; Vehicle *konec = _vehicle_pool.blocks[0]+(1<<VEHICLES_POOL_BLOCK_SIZE_BITS)*sizeof(Vehicle);
	Vehicle *v;

	_first_veh_in_depot_list = NULL;	// now we are sure it's initialized at the start of each tick

 	for (v = GetVehicle(0); v != NULL; )
//FOR_ALL_VEHICLES(v)
	{
		if (v->type != 0) {
			_vehicle_tick_procs[v->type - 0x10](v);
		}
 	if( (++v) == konec ) v = ( ++weh >= _vehicle_pool.current_blocks ? NULL : ( konec = _vehicle_pool.blocks[weh] + ( weh == _vehicle_pool.current_blocks - 1 ? 1 + (_vehicle_pool.total_items & ((1<<VEHICLES_POOL_BLOCK_SIZE_BITS)-1)) : (1<<VEHICLES_POOL_BLOCK_SIZE_BITS) ) * sizeof(Vehicle),_vehicle_pool.blocks[weh] ) ) ;
	}

Time 1:39.3, another 1% faster, 5% in total. If FOR_ALL_VEHICLES macro could be changed in general, there could be more imrovements...

If you like the OOP, you could create some struct/class like VehicleFinder...

Is it worth the work?

prissi · Post by **prissi** » 15 Oct 2006 21:43

The 5% faster will be very processor dependent. Moreover, it will be even compiler dependent. OOPs will not help there, especially with different vehicle types there would be virtual functions which involve a virtual function table, which is not too well on the current mostly used long pipelined processors.

OOP could help in making better readable code, though.

Moreover, for every display step this loop is called once. Therefore, the better place for optimisation is imho the tile display loop itself. Many tiles calculate the sprite they display on the fly, even the ground tiles do so. Either using static tables (especially for trees) or even better caching two sprite numbers per tiles instead recalculation might be a better way to speed up the loop.

Otherwise the next compiler revision might slow down your explicitely written out version, since the compiler finds some other clever way to handle registers or can use SSE registers (ok, not likely in this case, but just an example) or just reorders instruction so they better fit into the pipeline.

If the display loop goes up 10% on most machines it might be worthwile.

SmatZ · Post by **SmatZ** » 15 Oct 2006 22:33

prissi wrote:The 5% faster will be very processor dependent. Moreover, it will be even compiler dependent.

Why do you think it will be processor dependend? Removing of overhead caused by unneeded calculations of memory addesses and - forcing numeric constants there, where are now variables - reduces number of instruction executed. Yes, it is a bit compiler dependend (but only the first thing I mentioned, the latter cannot be), but if compiler is intelligent enought to optimize current code, the new code will be optimized for sure too.

prissi wrote: OOPs will not help there, especially with different vehicle types there would be virtual functions which involve a virtual function table, which is not too well on the current mostly used long pipelined processors. OOP could help in making better readable code, though.

Yes, I agree. I don't like misusing of OOP, too. It helps reducing developer time, but often makes code more complex for optimizer and slower.

prissi wrote: Moreover, for every display step this loop is called once. Therefore, the better place for optimisation is imho the tile display loop itself. Many tiles calculate the sprite they display on the fly, even the ground tiles do so. Either using static tables (especially for trees) or even better caching two sprite numbers per tiles instead recalculation might be a better way to speed up the loop.

A good idea, I will have a look at it!

prissi wrote: Otherwise the next compiler revision might slow down your explicitely written out version, since the compiler finds some other clever way to handle registers or can use SSE registers (ok, not likely in this case, but just an example) or just reorders instruction so they better fit into the pipeline.

If the display loop goes up 10% on most machines it might be worthwile.

Yes, there should be some limit how much machine-optimized code could be. For example, using 64bit variables makes code slower on 32bit machines.

Maybe I am a bit different, but for me, every % counts - and when it is as easy as implicitly rewriting two lines of code in vehicles.h, why not try it? Even when it comes at cost of skipping calls to pool.h, because they are rewritten for Vehicles given its constants...

SmatZ · Post by **SmatZ** » 17 Oct 2006 16:10

Ok, I did another little modification, and the code is now ~ 10% faster against original (1:10,3 vs 1:17,4). Now it should work almost as fast as without mempools.

It is really ugly, but it is just for testing if this style is effective (and gives a lot of warnings for non-compliance, too)

Code: Select all

#define FOR_ALL_VEHICLES_FROM(v, start)  Vehicle * konec; int weh; for (v = GetVehicle(start),weh=start >> VEHICLES_POOL_BLOCK_SIZE_BITS,konec=_vehicle_pool.blocks[weh]+(( weh == _vehicle_pool.current_blocks - 1 ) ?( 1 + (_vehicle_pool.total_items & ((1<<VEHICLES_POOL_BLOCK_SIZE_BITS)-1))) : (1<<VEHICLES_POOL_BLOCK_SIZE_BITS) ) * sizeof(Vehicle); v != NULL; ((++v) == konec) ? v = ( ++weh >= _vehicle_pool.current_blocks ? NULL : ( konec = _vehicle_pool.blocks[weh] + (( weh == _vehicle_pool.current_blocks - 1 ) ?( 1 + (_vehicle_pool.total_items & ((1<<VEHICLES_POOL_BLOCK_SIZE_BITS)-1))) : (1<<VEHICLES_POOL_BLOCK_SIZE_BITS) ) * sizeof(Vehicle),_vehicle_pool.blocks[weh] ) ) : v )

As we were talking about mempools, what are its advantages? Are there any architectures that dont support virtual memory and paging, what is OTTD built at?

edit:
This code would need some more changes, because as it is now, there are some debug messages. Some procedures changes Vehicle *v inside the for loop. I will try to find them and correct them.

edit:
The problem was in AfterLoadVehicles. The second FOR_ALL_VEHICLES must use old style loop. Another probably problematic could be OldLoader.

I am just showing how things can be changed - I am not trying to do any major changes to code as I am too new here

CobraA1 · Post by **CobraA1** » 17 Oct 2006 23:22

The 5% faster will be very processor dependent. Moreover, it will be even compiler dependent. OOPs will not help there, especially with different vehicle types there would be virtual functions which involve a virtual function table, which is not too well on the current mostly used long pipelined processors.

Most current processors are also overkill for OpenTTD, as it'll run on processors around (and even below) the 100 MHz mark. If you wish to optimize for something, optimize for mobile processors and other lower-end processors. Don't bother optimizing for the 2-4 GHz processors, as they can already handle OpenTTD quite well.

DaleStan · Post by **DaleStan** » 18 Oct 2006 03:00

Talk to the people on #openttd-coop about that one; I think you'll find that they disagree.

XeryusTC · Post by **XeryusTC** » 18 Oct 2006 05:59

DaleStan wrote:Talk to the people on #openttd-coop about that one; I think you'll find that they disagree.

Indeed I/we do, the network code of OpenTTD uses too much CPU. It makes games with more than 400 trains quite unplayable, while my PC can handle up to 800-900 trains without a problem. And our server can only handle up to 400 trains too before the game starts eating too much CPU and other processes can't do a thing.
On this note, it would be very usefull for us to optimise pathfinding code and/or network code as those are the things that use most of our CPU time.

KUDr · Post by **KUDr** » 18 Oct 2006 07:23

I don't think that optimizing PF can help any more. Just now it is not PF code what makes your CPU load (unless you use ships heavily).

RiTi · Post by **RiTi** » 18 Oct 2006 08:31

KUDr wrote:I don't think that optimizing PF can help any more. Just now it is not PF code what makes your CPU load (unless you use ships heavily).

Just curious, what is in general the problem that makes heavy CPU load resulting in slow game play?

Brianetta · Post by **Brianetta** » 18 Oct 2006 08:37

The game loaded the CPU on my server very heavily even when using NTP. I believe the drawing routines have a lot to answer for, since reducing the dedicated server's screen resolution helps a lot.

YAPF is good enough that it's useable, whereas NPF was not. One of the problems we used to have on the #openttdcoop sandbox server was that players would sometimes join and do nothing but clone trains, attempting to achieve some sort of record number of trains. We've managed to discourage this behaviour now. (-:

The current sandbox uses YAPF (without any ships) and has been seen with hundreds of trains. The CPU use by the openttd process on my server's P4 1800 usually hovers at around 20-40%. It rises noticably when ten people are connected.

So, it's better than it was, but it's not perfect. OpenTTD is a much more complex game than its 1990s appearance suggests.

KUDr · Post by **KUDr** » 18 Oct 2006 16:40

RiTi:

If you want to see what causes CPU load, here is output from profiler (SP, YAPF, 120 trains, fast forward):

Code: Select all

Function                        %Incl.  %Excl.
-----------------------------------------------
@GameLoop@0                     64.305  0.006
@StateGameLoop@0                62.627  0.075
@Train_Tick@4                   42.918  2.634
TrainLocoHandler                40.284  1.308
@UpdateWindows@0                21.506  0.000
@DrawDirtyBlocks@0              21.418  0.390
@CallWindowEventNP@8            21.110  0.025
@RedrawScreenRect@16            21.091  0.019
@DrawOverlappedWindowForAll@16  21.010  0.000
DrawOverlappedWindow            21.010  0.069
TrainController                 19.495  1.408
GfxMainBlitter                  13.629  1.333
@DoDrawString@16                11.014  0.415
PlayerFinancesWndProc           10.731  0.038
UpdateTrainSpeed                10.134  0.446
GetTrainAcceleration             9.688  9.618
GfxBlitZoomInUncomp              8.946  8.946
@CallVehicleTicks@0              8.889  7.588
DrawPlayerEconomyStats           8.229  0.044
AfterSetTrainPos                 7.600  0.515
Proxy                            7.204  0.000
_deflateEnd@4                    7.192  6.877
WriteZlibLoop                    7.192  0.000
@RunTileLoop@0                   7.129  1.440
@DrawWindowViewport@4            6.620  0.019
ViewportDrawChk                  6.601  0.000
@ViewportDoDraw@20               6.601  0.082
NewsWindowProc                   6.287  0.019
@VehicleFromPos@12               5.463  2.754
@DrawString@16                   5.450  0.013
CheckTrainCollision              5.017  0.132
_WinMainCRTStartup               4.331  0.000
TrainCheckIfLineEnds             4.080  1.987
@GfxFillRect@20                  3.904  1.999
MainWindowWndProc                3.621  0.019
@TileLoopClearHelper@4           3.539  3.539
@DrawSprite@12                   3.489  0.075
GfxBlitTileZoomIn                3.351  3.351
@DrawStringCentered@16           3.338  0.013
ChooseTrainTrack                 3.300  0.019
@GetSlopeZ@8                     3.206  0.453
@DrawWindowWidgets@4             3.068  0.069
@CallLandscapeTick@0             2.678  0.006
@DrawStringRightAligned@16       2.533  0.019
@VehiclePositionChanged@4        2.521  1.697
TileLoop_Clear                   2.477  0.283
GetSlopeZ_Track                  2.433  0.597
ViewportAddLandscape             2.401  0.220
_memset                          2.332  2.332
@OnTick_Station@0                2.169  0.314
StatusBarWndProc                 2.169  0.057
@GetTileSlope@8                  2.081  2.081
FindTrainCollideEnum             2.005  2.005
GetVehicle                       1.930  1.930

I would like to try it again in multiplayer to see the network influence.
But it will need some server with like 500+ trains.

XeryusTC: as you can see, the pathfinder takes 3.3% while whole TrainLocoHandler() takes 40%.

DaleStan · Post by **DaleStan** » 18 Oct 2006 17:31

Well, in that case, why not just optimize GameLoop, since that function takes nearly two-thirds of the processor time?

KUDr · Post by **KUDr** » 18 Oct 2006 17:47

DaleStan: kidding? GameLoop() calls everything else you see there. All the time is spent there and it is of course in FF mode. This function itself takes only 0.006% CPU.

Probably i should explain how profiling works. But sorry. Explanation of those two columns must be enough for now:

%Incl. = function itself plus all other functions called from it
%Excl. = function itself only

Post by **TrueBrain** » 18 Oct 2006 17:52

KUDr: bad profiling, autosave seems enabled. A lot of time is 'wasted' there, which in fact doesn't contribute in any way to the profiling itself. So please disable it, rerun, and repost

Ps: you also might want to sort on 'Excl', as 'Incl' doesn't really give much information.

DaleStan · Post by **DaleStan** » 18 Oct 2006 18:07

Yes, KUDr. I was being sarcastic.

I'm well trained in the theory of "Lies, damn lies, and statistics". Use of a column of percentages that adds up to somewhere around 600% makes me apply that training in spades, whether appropriate or not.

Transport Tycoon Forums

OpenTTD optimizations

OpenTTD optimizations

Who is online