Darkvater wrote:
The function is already inlined and the when openttd is built with full optimizations the chance that it is actually getting inlined is about 99.9% (depending on how stupid your compiler is). So as long as you don't show me the assembly output proving the function wasn't inlined...I can only take your results with a grain of salt.
Yes, the function gets inlined. But the problem there is (after looking at assembled code) that there are more loads from memory and more computations in standard code:
load pool->block_size_bits, shifting by it twice (x86 architecture allows shifts by immediate value, but here must be used one register to store shift count)
load pool->item_size and multiplying by it (there can be multiple by constant or, better, shifts instead of multiply if the compiler know the size, because it is constant)
There are more free registers, so more things can be loaded only once and stored in register.
The 4% difference is more than I expected, but I doublechecked it.
First run, normal version - 1:46.5, second 1:44.9
Version with constants - 1:42.9 and 1:40.9
I am not used to linux debugging, but I know OllyDBG in Windows and there I could see how values are computed. (but I dont know how to disassemble my linux compiled programs)
edit 1:
I use the same Makefile.Config and openttd.cfg for both versions
edit 2:
Disassembled openttd.exe r6616, some inlined GetVehicle:
Code: Select all
lea edx,[ebx+1] ; ebx=index-1 (prev. iteration) -> edx=index
mov ecx,[0x507c50] ; ecx=pool->block_size_bits
movzx edx,dx ; erm... index is 16bit, isn't it?
mov eax,1 ; 1 to be shifted
shl eax,cl ; shift left by block_size_bits
mov ebx,edx
shr ebx,cl ; shift right by block_size_bits
lea ecx,[eax-1] ; 1<<bsb - 1
and ecx,edx ; (index & (1<<bsb - 1))
imul ecx,[0x507c54] ; * pool->item_size
mov eax,[0x507c68] ; pool->blocks
add ecx,[eax+ebx*4] ; finally, ecx=offset to Vehicle structure in memory..
Considering this happens for every iteration of FOR_ALL_VEHICLES, it takes a lot of time overhead.
Another good idea would be not to use indexes, but to use Vehicle* just in the main loop with taking care of end_of_block on my own - the compiler doesn't optimise this, and the IMUL takes a lot of time too in every iteration. We would jump the pointer by sizeof(Vehicle)... like
if (!end_of_this_block) veh+=1;
Are there any arguments against dividing VehiclePool into pools for road/rail/ship/air/special?
There are often loops looking only for vehicles of special type. So these loops could be made much faster.
The idea of faster searching for free slots is in some structure of pointers to vehicles. There could only one index be moved (the last index) onto the place of removed vehicle. Newly added vehicle would be added to the end of the list. Memory requirements are not heavy, sizeof(pointer)*(1<<block_size_bits) for every block. This would make indexing of Vehicles faster, too, because one would need not to use that (see above) code to compute offset of the Vehicle structure in memory.
This depends, however, how often removing/adding of standard vehicles happens. Maybe it would make code slower, too, and direct indexing of Vehicles array would be faster.
If I am mistaken, correct me please!