Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

LeXa2 · Post by **LeXa2** » 01 Aug 2012 19:19

Last update: 2012/08/15 12:19 GMT+04
Last version: Git 043131c
Changelog for g043131c

Original description of the patch and "The Gory Details (TM)" could be found here.

What's this patch about in short? Take a look (pics are clickable):

Known 32bpp-anim-aa blitter versions:

2012/08/15 - Git 043131c
2012/08/03 #1 - Git 1ce8aac8
2012/08/03 #0 - Git 8fbe2f58
2012/08/01 - Git 6a63343

Blitter configuration:

Versions from 6a63343 and up to 043131c.
You could configure blitter through the openttd.cfg file.
To enable usage of the blitter put a line containing
Code: Select all
```
blitter = "32bpp-anim-aa"
```
into the [misc] section of the openttd.cfg file. Make sure you've got only one line starting with "blitter = " in this section.

You could change three parameters to tweak the blitter for your needs:

Anti-aliasing level for all sprite pixels except for palette-animated ones.
It is a main thumb to tweak. Higher level means higher quality but worse performance. It could be set by adding/changing the following line in the [misc] section of the openttd.cfg file:
Code: Select all
```
blitter-32bpp-aa-level = 4
```
Instead of 4 you could put any number you want chosen from 1, 2, 4, 8, 16 or 32. Putting any other number would have the same effect as putting nearest lower number from the above sequence.

Anti-aliasing level for palette-animated pixels.
It is a secondary thumb to tweak. Higher level means higher quality but worse performance. It could be set by adding/changing the following line in the [misc] section of the openttd.cfg file:
Code: Select all
```
blitter-32bpp-aa-anim-slots = 16
```
Instead of 16 you could put any number you want between 1 and (blitter-32bpp-aa-level * blitter-32bpp-aa-level). Putting any number higher than the squared value of the blitter-32bpp-aa-level would have the same effect as putting squared value of the blitter-32bpp-aa-level. Performance loss related to this one could be even higher than due to setting blitter-32bpp-aa-level too high depending on the screen/window resolution you play the game with, on the amount of palette-animated pixels that are visible at the given moment (a lot of palette-animated water coupled with high value for blitter-32bpp-aa-anim-slots => extremely low performance) and on whether you have multi-core CPU and enable multithreaded palette-animate by using the setting described next.

Amount of threads to use for updating palette-animated pixels.
It could be set by adding/changing the following line in the [misc] section of the openttd.cfg file:
Code: Select all
```
blitter-32bpp-aa-anim-threads = 8
```
Setting it to "-1" or "1" would disable usage of the threaded palette-animation. Setting it to "0" would instruct the blitter to try to determine the amount of cores your CPU has and use amount of threads that would suit best (two for dual-core CPU, 8xCPU cores for multicore CPU, no threads for single core CPU). Setting it to any positive integer number between 1 and 127 would instruct the blitter to use that amount of threads. If the blitter was instructed to auto-detect best amount of threads to use and failed to do so for some reason (it may fail on some platforms in rare cases) it would fall back into using "safe" default of 2 threads. If the threading is not available on the target platform or in case threads creation process fails for some reason (for example due to requesting too many threads to be handled on your platform) blitter tries to fall back into using non-threaded mode.

Known bugs:

Performance isn't stunning even with not so high AA levels like 2x or 4x when used with bare 8bpp GFX baseset. Easiest way to "fix" it is to install some GRFs that supply unmasked 32bpp sprites for tiles. Good idea would be to use "Ben Robbins Fields Ground with lines" and "Ben Robbins Ground with lines" GRFs or older "32bpp megapack" compiled into NewGRF. Using zBase would also do but you should expect visual glitches (as of zBase r123) due to the a small but nasty bug in this quickly emerging baseset.
Blitter could be incompatible with some 8bpp GRFs that use so-called recolour sprites for the palette ranges other than used by the original DOS/WIN or OpenGFX basesets. I hadn't seen any GRF to hit this bug while been doing "in-house" play testing of the blitter but it doesn't mean that such "incompatible" GRF does not exist in the entire universe.
Rendering produced by the blitter isn't as good as would be produced by real SSAA approach due to some tricks used to eliminate glitches that are warranted to happen if not targeted by these tricks. It could be mitigated by hand-crafting all sprites with yet another "special trick" like it was done for radio tower and oil refinery flame torch tower in the NewGRF attached to the second post in this forum thread.

Fixed bugs (most important, for more details take a look into changelog supplied with the source code patch ):

In g043131c:
- Threaded and even non-threaded palette-animation caused excess lag in mouse cursor updates. Palette-animation threading in the form that it was implemented in the older releases of the patch was wrong and extremely inefficient. But that wasn't the worst thing related to palette-animation in 32bpp-anim-aa blitter. Main flaw was that palette animation wasn't optimized to be fast enough even for "no antialiasing" case. Both problems were covered and with version g043131c it should be possible to perfectly play with 4 ANIM SLOTS without any performance-related problems on any dual-core CPU from the last 5 years (excluding netbook-targeted CPUs like Intel Atom and AMD Cxx/Exxx/Zxx). With fast multicore CPU you could use higher number of anim slots (16 anim slots had been tested to work with sufficiently well on 8 "core" AMD FX 8120 CPU) to gain additional increase in render quality.
- Do not use GCC-specific array allocation on stack that prevented blitter to be successfully compiled by MSVC. It should be noted that I do not use MSVC and I do not test blitter with it so it is unknown if it is possible to compile blitter with MSVC with this fixes in place. Reports are welcome.
- A lot of other fixes here and there to make this blitter release "the best blitter release ever (TM)".
In g1ce8aac8:
- Threading-related deadlocks with pthreads (affected platforms are: linux, freebsd).
- Division by zero (and eventual OTTD crash) when applying "transparency" effect to the palette animation buffer with the source pixel being fully transparent.
  Affected: unknown number of GRFs, was spotted when trying to play with zBase as a baseset without any other active GRFs and trying to enable "transparency" for trees.
  Note: current fix is actually a workaround for another possible bug in Encode() that still waiting to be pinpointed and - if it is really a bug - fixed.
- Fixed typo in one comparison to make it work as was initially intended. Typo caused comparison to always evaluate to "true" possibly making blitter a bit more slower that it is now with this typo fixed. Don't expect "magic hyperboost" though, difference should be really minor.
In g8fbe2f58:
- Patch updated to be compatible with both vanilla 1.2.1 sources and current trunk (as of r24450)

Future plans:

Add a mode that would allow for faster AA performance when blitting sprites in BM_NORMAL and BM_TRANSPARENT mode at the expense of the memory usage. Would greatly help for the use case when user have a "bare" 8bpp baseset without any additional 32bpp GRFs supplying "basement" sprites (I could bet that most of the currently played installations of the OpenTTD fell into this category).
Profile and optimize blitter even more. A lot had already been done on this front. I've spent about three weeks finding bottlenecks and then optimizing (sometimes - by means of a totall rewrite) palette animation and threading the code but there's still more to be done. ATM I'm working on extracting a subset of the OTTD codebase that would allow me to create an "isolated environment" where I could reliably benchmark Draw() and Encode() performance. I beleive that there's a huge field for optimizing there.
Try to implement multithreaded Draw() rotine and check if it helps to gain some more speed for "8xAA + 64 anim AA slots" and higher cases. I suspect that benefits would be less than the time spent on ITC and context switches but who knows? No test - no gain.
When and if multithreaded Draw() implementation would be tried: give compiler and platform-specific lockless threads ITC a chance, i.e. try to use atomic increments/decrements + spinlocks for synchronizing threads instead of relying on the OS and threading lib to do locks and signaling. I had already tried it on Win32/64 and on linux/pthreads when I've been reimplementing threading for palette animation and it had proved to be beneficial over using ITC through OS/pthreads services for cases when the synchronizing period is less than the thread execution slice (~10ms on Win32/64, varries greatly on the linux depending on the system load and kernel process scheduler). Downside is the CPU hogging due to using spinlocks and possibility to deadlock if threads affinity is changed in the unexpected way by the third party (taskset, windows task manager, e.t.c.).
I'm thinking about hacking in yet another anti-aliased blitter implementation which would use real SSAA instead of SBAA approach. Don't know if I would ever try to really implement it though.

Afterword:
Testing is needed.
Suggestions are welcome.
Bug reports are expected to be filed as replies to this thread. PMs would serve as well for this purpose.
Thanks for spending your time reading this and trying this blitter.

LeXa2 · Post by **LeXa2** » 01 Aug 2012 19:20

Original contents of the thread starting post:

Hello 2all,

When I was starting this project several I thought that when (and if) the RELEASE moment would come I would start a new thread with the phrase "What bothers me most in current OpenTTD is the way it handles zooming out." At that time I hadn't been aware about the existence of the patch available here: http://www.tt-forums.net/viewtopic.php?f=33&t=35311 .
Well, time passes by but some things doesn't get better than they were in old good nineties and not so old but still good 00ies

.

When the forum search had finally brought me to the topic above my implementation of the same "sprite-based SSAA" technique was already mostly done so for me it was more of a geek interest to took a lot at the sources of the patch TheBlasphemer posted there. Approaches we took in implementing basically the same thing turned out to be rather different, with his blitter being pretty simple (and, as a side results, code being pretty short and easily readable) and doing most of the work at draw time, while my implementation does as much work as it could at sprite loading (encoding) time. Another thing that my implementation does differently is that it supports palette animation and even is capable of doing SSAA for palette animated pixels. That comes with a price of reduced rendering speed and much more complicated code. Lastly, my implementation is capable of handling correctly 32bpp and "semi-transparent" sprites, while initial implementation of TheBlasphemer's blitter simply throws out any transparency it encounters on the way; later implementation tries to deal with the non-opaque sub-samples in a more gentle way but as far as I could tell basing on my knowledge of the topic, it should suffer from "visible seams at the edges of the ground sprites" problem. In any case, a lot of time had passed since 2007 and one would have next to none success in trying to use later implementation (Blitter_32bppOptimized internal encoding format for sprites storage had been changed in an incompatible way) and would suffer various transparency-related bugs with initial implementation.

Enough about past, let's get back to nowadays. What have I got to share for OpenTTD fans? As the topic SUBJ says it's yet another blitter for OpenTTD that uses sprite-based anti-aliasing technique for eye-candy purposes. What's sprite-based anti-aliasing? Well, it's the technique that tries to approximate the result of doing sprite-based rendering to Nx resolution and then linearly interpolating it down to the original resolution by offloading "linear interpolate down" step from the draw time to the sprite load time and performing it on individual sprites instead of entire image. De-facto it roughly_ the same as providing a separate version of the sprite for each existing zoom level (there are 6 zoom levels in OTTD now) with downsampling done by graphics designer in a way other than "nearest neighbour". Essentially almost everything that proposed blitter implementations does could be done by manipulating sprites in baseset and other used GRFs, but that would require a lot of hand-work to be done. SB AA blitter does this work for you in realtime but you pay for it with reduced rendering speed and enlarged memory footprint.

Let's dive deep into the gory details and implementation difficulties.

Why is it not possible to simply implement something more fancy than "nearest neighbour like" algo in engine if it in any case resize sprites down on load internally when GRF don't supply a version of the sprite for each existing zoom level? The answer is "it is possible in general but would only work for unmasked 32bpp sprites" (link points to a thread containing another path that implements exactly that).

Main problem with 8bpp spites is that they have to be converted into RGBA representation before performing linear-interpolated downsampling and that's not generally possible due to some legacy concerns. First of all, there are some palette indexes that are used for palette-animation stuff. There's no correct way to downsample 2x2 matrix of such pixels into one target pixel retaining its "palette-animated" nature. Secondly, any GRF could introduce a thing known as "recolour sprite". It is a thing that allows to change the colour of some pixels at Draw() time. It's feature used to "paint" vehicles into the company colour. It is a thing that is used to draw the same city buildings or bridges in a differing colours. It is a thing that is used to draw catchment area (while choosing a place for a station) with a cyan rectangles while original sprite is grayish white. Most recolour sprites in the game are defined by the baseset but nothing stops clever GRF designer to introduce yet another recolouring map in the NewGRF he/she creates. As color remapping happens at the draw time there's nothing to do about when doing downsampling at sprite load time. And as the engine sets almost no limits on color indexes that could be remapped we have to assume that any colour index is a remapped one to be on a safe side.

To a greater extent it is possible to step on a slim ice and assume that it's enough to only correctly handle recolour sprites introduced by the baseset but the truth is that this assumption won't help: some recolour maps in the baseset cover the entire pallete range. But it had turned out that properly handling only a part of the recolour sprites from the baseset could still result in a pretty close to correct behavior. Recolour ranges that are being recognized by current blitter implementation are: palette ranges used for "company colouring" and "buildings colouring" (leaving behind proper support for "bare land", "church" recolouring), "catchment area colouring" and "secondary company colours". Same ranges had been used in "downsampling done properly" patch and it is able to produce not so bad results for some sprites but yet some other sprites wouldn't benefit from this and in the end they would be even more "eye hurting" by the contrast they would made to the sprites that had been handled well. Example of sprites that can't be handled in this way which you would encounter most are any water sprites with palette-animate capable blitter (32bpp-anim).

Thus the summary is: to be able to apply linear-interpolated downsampling to all pixels of the sprite one would have to do it at draw time.

Second challenge on the way to implementing sprite-based anti-aliasing is to find a method to deal with seams at the tile sprite "edges" that are introduced as a natural consequence of performing linear-interpolated downsampling. To help you get an idea of what's is this buzz about here is an illustrational picture (click on thumb to get full-sized 147KB PNG):

When the engine is downsizing sprite using nearest neighbour alike approach each target pixel is based on one and only one source pixel. It means that in case when source sprite consists of only fully transparent and/or entirely opaque pixels - same would stand for downsized version of the sprite. There won't be semi-transparent pixels introduced as a product of the resizing sprite down. Tile sprite shapes were designed in a way so downsized sprites produced by original resize algo would opaquely cover the entire target area when laid out correctly (as demonstrated by the left column on the pix above). With linear-interpolation things are different: some pixels in the downsized sprite would be semi-transparent due to being blended from a number of opaque and a number of transparent pixels. The more sprite gets shrink the more severe the problem is. Click on the pic above and zoom in to the "full size" (browsers tend to downscale the pic so its width would fit into the client area of the window) and take a close look on the bottom part of the central column. You would easily notice that the checkerboard-like background patter "shines through" pixels comprising the downscaled sprite.

How could this problem be dealt with? Real answer is: there's no entirely correct way to do it automatically with current tile draw infrastructure.
What we want is to warrant that for sprites representing ground tiles (I would call sprites like this "basement sprites" from here on) downsized image would have exactly the same "form of the fully opaque and fully transparent parts" with the exact shape of the parts being specific to on the target zoom level and tile slope. For non-basement sprites this goal could be bypassed. Trouble is that there's no easy way at engine level to distinguish basement sprites from non basement ones (check this thread for discussion). Thus we have to treat all sprites as "basement".

Another problem on a way to "entirely correct" behavior is that we would have to hardcode in the engine a set of the "masks" representing the shape of the transparent/opaque parts for each tile slope direction and zoom level and then perform checks at draw time against this mask and alter the target pixels we produce basing on this mask. It would complicate and slow down the blitter greatly.

But there's a "clever hack" method to achieve virtually the same effect at downsample time without the need to use "shape masks": let's perform downsampling using the original algo as a first pass and temporarily store the result, then proceed with linear interpolated downsampling and the produce the resulting image using colour values of the second pass and the value of the alpha channel of the first pass. It's roughly the same as if we would create a layered image in the photo editing software of our choice and place as a "bottom layer" a sprite image that had been downsampled using nearest neighbour approach and then place on the top layer the same image downsized with the linear-interpolation. Take a look onto the third column on the picture above - it's done just like this. You could notice that using such technique produces "seamless ground sprite layout" and retain more details of the original picture - like linear downsampling does - at the same time.

If you're curios if there are any real visible in-game problems with the "straight linear-interpolated approach without the clever trick" above - here is the screenshot illustrating the problem:

It was captured with a special build of the engine that fill out every area it redraws with a magenta colour prior to drawing sprites. It could be seen that "seams between tiles" are easily visible here. Without the "development magenta floodfill" you would get "remains" of any colour that was at that place before as soon as you try to scroll the viewport, thy to zoom in or out, e.t.c. It might seem not to be "that visible" at a first glance but as soon as you would run into a green grid slipping between water tiles or a blue grid shining through the green land tiles you would be convinced that this is a real problem you have to deal with when implementing sprite-based anti-aliasing.

Let's move on to the next problem, the performance.
To characterize it in short: it sucks.

As a significant part of the downsampling process is had to be done in blitter due to reasons detailed above you might get the same (if not worse) performance you'd get with the direct per-pixel super-sampling approach (i.e. render viewport into Nx sized offscreen buffer using original_zoom-N zoom level and then downscale this buffer into screen resolution and blit it into the front buffer). Thus one might ask if there's any point in implementing SB AA blitter. Actually there is: the more 32bpp sprites we'd get the more performance SB AA blitter would gain. With the entire baseset made of 32bpp sprites and a minimum amount of "masked" sprites among them SB AA blitter would perform almost at the same speed as original 32bpp-anim blitter (but still would use a lot more memory for palette-animation tasks - anti-aliasing always come with a price). With the implementation attached to the first post of this thread it could be easily "felt on your own skin" by trying to scroll around with 32x zoom out using bare 8bpp baseset and then trying to do the same with installed/activated "Ben Robbins Ground with lines" and "Ben Robbins Fields Ground with lines" 32bpp NewGRFs (or with zBase baseset). You'd feel the difference especially if configure blitter to use higher level of AA for non-animated sprites (set "blitter-32bpp-aa-level" to 8, 16 or 32 in openttd.cfg in section "[misc]"; don't forget to increase "sprite_cache_size_px" to at least 128 while being there).

And coupled with the linear interpolated downsampling used for resizing down 32bpp sprites at load time it could cut AA costs to a negligible amount: you'd be able to set AA level to as low as 2xSSAA + 4 anim AA slots and still get pretty decent results.

Let's illustrate this theory with some screenshots. First of all, what are the gains for SB AA if the used baseset is 32bpp and mostly consists of unmasked sprites?
Take a look and judge:

This screenshot had been captured with a special development version of the blitter that highlight (with the red colour) pixels blended at Draw() time - i.e. not at the sprite load time. Blitter had been configured to use 4x AA for non-animated sprites which translates into 16 sub-samples max per one target pixel. In case the blitter had to blend together 16 possible sumsamples at Draw() time you'd get pure red in place of that pixel. If the blitter hadn't been forced to do any blending (i.e. required processing had been done at sprite load/encoding time) - a pixel is drawn as is without overlaying red. Draw-time blending for sub-samples count between 2 and 15 results in overlaying red in direct proportion to the sub-samples amount (dst = AlphaBlend[alpha = 255 * subsamples / 16, bright_red, dst]). Top part of the picture above was captured with "base 8bpp baseset" (i.e. no 32bpp GRFs active). You could see that for the most of the picture blitter had been performing downsampling at the draw time dropping the performance down to the unplayable level. Bottom part of the picture was captured during the same game session (i.e. without exiting the game and then starting it up back) but there I had activated 32bpp NewGRF based on the well-known "32bpp megapack" that was available for use with OpenTTD "extra zoom patch". As could be seen most pixels hadn't been blended by blitter at draw time making the performance sufficiently high for general gameplay while keeping the output quality at the very high level.

OK, but what is the general picture then? What you gain in quality by using higher AA levels? And would 2x AA be enough if using 32bpp baseset and "improved" resizing is in place? Here are the answers, they "speak" for themselves (pics are clickable as usual in this post):

And here are thumbs/links to the screenshots used to compose the above collage if you want to take a look at fullsized lossless-compressed originals:

What could be told by analyzing these? They prove the theories written above.

Let me summarize it here:

8x AA do not offer significantly better visuals compared to 4x AA (as expected: 16 subsamples vs. 64 is a cool thing but even 16 is enough for most uses).
Downsampling sprites using linear-interpolation at load time (to a possible extent) isn't effective with 8bpp sprites.
Downsampling sprites using linear-interpolation at load time is extremely effective for 32bpp unmasked sprites and even original 32bpp-anim "non anti-aliased" blitter produces wonderful results when coupled with it.
Using higher AA levels for 32bpp sprites coupled with the load-time linear-interpolated downsampling is a useless waste of performance (while my implementation does its best to reduce draw-time performance costs for that case). Sticking with 2xAA + 2 or 4 anim AA slots or 4xAA + 4, 8 or 16 anim AA slots (if you have decent multicore CPU and enable multithreading for palette animation) is enough. And you could even don't use anti-aliased blitter at all for this case and still get pretty-enough rendering with original 32bpp-optimized or 32bpp-anim blitters. It could be useful if you have a slow single-core CPU or are short on RAM.

Last thing I want to write about in this post are the consequences of the "hack" the blitter have to use to overcome "visible seams between tiles" problem. If you're patient enough to read up to this point (take a candy, drink a beer, make anything to make yourself feel as a wonderful person you really are) and had been following the text closely you could wonder if are there any negative impacts on the quality of the rendered picture related to "clever hack". I have the bad news to share: there are some bad consequences and they are visible.

Here, take a look:

Left part of the picture was composed from screenshots of the game when using unmodified 8bpp baseset. Take a close look on how does the radio tower and oil refinety flame torch look like at 32x zoom out level. Pretty pixelated and aliased, aren't they?

On the right side of the picture you could see the same places/objects rendered with a special NewGRF activated that contains a small "trick" to effectively turn "fix seams hack" off for these particular sprites. I would attach this NewGRF to the second post of the thread so you could decode it with grfcodec and take a look into internals. Trick used is extremely simple: 8bpp base sprite had been converted into masked 32bpp sprite with mask being filled with 0 color index except for pixels that should be palette-animated and/or color remapped. Base sprite has its alpha channel modified in a way that all pixels with alpha == 0 are made "almost entirely" transparent (alpha changed to 1) and all pixels that had been opaque (alpha == 255) had been made "almost" opaque (alpha changed to 254). My blitter implementation is aware of this trick and treat any pixel with alpha <= 1 or >= 254 as transparent or opaque resp, but does not apply "clever hack" for pixels having alpha other than 0 or 255. Blitters that are unaware of this "trick" would suffer with a (major in case of the most screen being filled with sprites utilizing this trick) performance loss and next to none visual glitches. Sprite designers could made a special version of their sprite packs utilizing this transpareny trick in case blitter derived from my implementation (or any other using the same technique) would land into OpenTTD trunk. It could easily be scripted in the GIMP to scale alpha channel of the entire sprite into 1..254 range and then scale it back to 0..255 range using the selection mask appropriate for that case if any. Mask should exclude "top part" of the sprite and operate only on the "bottom" side of the sprite (where the line serving for division between "top" and "bottom" should pass through west and east corners of the ground tile).

That's more than enough text IMO so here are "closing words":
Testing is needed. Suggestions are welcome. Bug reports are welcome to go as repplies to this thread and PMs would serve as well for this purpose.
Thanks for spending your time reading this and trying this blitter.

Michi_cc · Post by **Michi_cc** » 02 Aug 2012 00:31

Big patch to read through, but the pics really do look nice

One thing I noticed (when reading from the back, people usually get sloppy there

): why duplicate all those make colour functions just to change a / to a >>? Just change the function in the base class and add a comment like "Use >> instead of / to make stupid gcc happy".

Would the thread stuff be something the normal 32bpp blitter could profit from as well?

-- Michael Lutz

LeXa2 · Post by **LeXa2** » 02 Aug 2012 01:43

Michi_cc wrote:Big patch to read through, but the pics really do look nice

One thing I noticed (when reading from the back, people usually get sloppy there ): why duplicate all those make colour functions just to change a / to a >>? Just change the function in the base class and add a comment like "Use >> instead of / to make stupid gcc happy".

Would the thread stuff be something the normal 32bpp blitter could profit from as well?

Well, now-duplicated functions are one of the possible points where normal blitter could get some benefits in case profiling would prove that bloody mess produced by gcc auto-vectorizer is really faster than ordinary code produced with "-msse2 -O2". Another related patch that I'm going to release after a bit of refactoring and some additional testing would also affect all 32bpp blitters (it's the "Use linear-interpolated downsampling when resizing sprites if possible" patch that had been used when taking some of the screenshots above). Actually you never know if something you do in one place could give you some insight about possible improvements in other places.

One of the things that comes to mind quickly is a possibility to port multithreading support for palette animation implemented for anti-aliased buffer into non anti-aliased counterpart - it could give a nice speedup when playing on a multicore system at high resolution (like 1920x1200 I use to test blitters with) which could especially be useful for getting faster "fast-forward".

P.S. Duplicating functions was done to keep most of the changes in one place so less conflicts would be possible for cases when people apply some other patches on top of this one. In my local git development branch for this blitter changes are done to the functions in the base class, and copying them into blitter's own class was a part of the "pre-release clean up" stuff I've done in a separate "blitter-release" branch

.

Vaulter · Post by **Vaulter** » 02 Aug 2012 15:51

I had tried to update this cool patch to the trunk, but it deadlocks inside
Works only with blitter-32bpp-aa-anim-threads = -1

Are there any chances to fix this?

Thanks.

Lord Aro · Post by **Lord Aro** » 02 Aug 2012 19:16

Indeed, if you want this patch to get anywhere near trunk, you'll need to base against trunk, not a stable build as they are not as up-to-date

(http://wiki.openttd.org/FAQ_OpenTTD_versions)

LeXa2 · Post by **LeXa2** » 02 Aug 2012 19:56

Vaulter wrote:I had tried to update this cool patch to the trunk, but it deadlocks inside
Works only with blitter-32bpp-aa-anim-threads = -1

Are there any chances to fix this?

Things'd be much better if OpenTTD's platform-agnostic threading classes would support semaphores. For now I have to use tonns of mutexes to simulate what one semaphore would do and this simulation isn't perfect. Would you please post your platform and kernel+libc+libstdc++ versions (in case that's linux)? Also it's important to know what is the video driver you use and if that video driver is itself running in threaded mode.

Threaing issues are likely not to be related to the fact that the patch I posted is against 1.2.1 - accordingly to git logs last changes to /src/threads and /src/blitters in current trunk had been done before 1.2 had been branched (well, except one minor change in r24111 that does not affect my blitter in any way) thus it's something else, most probably simple lack of testing from my side on the platforms other than Win7 64.

P.S. I'd check today how does the things behave with trunk on my workstation and report back.

LeXa2 · Post by **LeXa2** » 02 Aug 2012 20:13

Lord Aro wrote:Indeed, if you want this patch to get anywhere near trunk, you'll need to base against trunk, not a stable build as they are not as up-to-date

(http://wiki.openttd.org/FAQ_OpenTTD_versions)

I had already answered this question before in details on these forums so here is quick summary:

1. I have no direct intent to get this patch into the trunk. It is a pretty big and complicated piece of code that could be hard to maintain without deep knowledge of the used algo and "clever tricks/hacks". It is possibly incompatible with some GRFs out there (and it even is not totally compatible with the existing basesets - some colour remaps are merely ignored with it) in its current form. This incompatibility could be fixed with a one line change but it would lead to a turtle-like performance with any 8bpp baseset out there making this blitter useless - and why would someone waht to have useless piece of code in trunk? This issue is something that I'm going to address in future releases (check "future plans" section in the second post of this thread) but it also would come with a price of even higher memory footprint. So, making this patch directly compatible with trunk so it could get in isn't a thing of a high priority for me, what I care more is to have this patch/blitter applying cleanly and working without problems with latest officially released version of OTTD - that one that people are most likely to use in real world.

2. I keep an eye on changes that keep landing into trunk and as far as I can tell there hadn't been any show-stopper changes since branching of 1.2 that should affect this blitter. It applies cleanly to the trunk (except for some trailing whitespace errors if you apply it using "git apply" instead of "patch -p1 <" - which is harmless) and should in general work the same as with 1.2.1 and a number of earlier versions.

LeXa2 · Post by **LeXa2** » 02 Aug 2012 23:52

LeXa2 wrote: ... (well, except one minor change in r24111 that does not affect my blitter in any way) ...

Well, apparently it really does. I've updated patch so it is now compatible with both trunk and 1.2.1, and would upload source for the new version g8fbe2f58 shortly as an attachment to a first post. There's no need to update binaries for now as the only change is that I copied some more functions from a base class into the blitter class so now it depends less on the protected and public interfaces of ancestors. In any case this changes do not interfere with anim threading in any major way thus deadlocks problem in unlikely to be caused by this. Time to fire up my secondary workstation with Mint 10 installation and check how does the patch behaves under normal POSIX-compatible OS with pthreads support in place.

Upd. So I tracked it down to be a problem related to the differences in signals delivering semantics between Win32/64 threads implementation and pthreads. Looks like I need to smoke in some mans on pthreads to get an idea what's really wrong and how to properly fix it. Grrrr, I really want semaphore there, it'd simplify things greatly.

Upd 2. Fixed. Pthreads mutex semantics require signaler to lock mutex (enter "critical section") before sending a signal, while WinXX "mutex" implementation - as done in OpenTTD - uses events as a signalling channel. I thought that OTTD's SendSignal implementation took care to lock the mutex before sending a signal and auto-unlock it immediately but I was wrong. I will upload a new version of the patch as soon as will fix yet another bug happening with zBase baseset + activating "transparency" for trees. BTW patch works wonderfully well WRT performance and visuals when used with zbase undel linux. I'm pretty impressed.

Michi_cc · Post by **Michi_cc** » 03 Aug 2012 10:48

LeXa2 wrote:Grrrr, I really want semaphore there, it'd simplify things greatly.

The other parts of the OTTD thread abstraction didn't fall form the sky, you know

You could simply add a semaphore there (don't worry about the OS/2 and Morphos stuff, they are probably suffering from severe bitrot anyway).

-- Michael Lutz

LeXa2 · Post by **LeXa2** » 03 Aug 2012 11:00

Michi_cc wrote:...(don't worry about the OS/2 and Morphos stuff, they are probably suffering from severe bitrot anyway)...

OS compatibility was the main reason I decided not to extend OTTD threading class with semaphore support. Linux and windows platforms are what I have "on hand" to develop on and test with, but some exotics like OS/2 or Morphos are not what I want to spend time dealing with. In case it's OK to implement semaphore support only for pthreads and win32/64 backends and that work would have chances being accepted into trunk - I'd consider doing that.

Vaulter · Post by **Vaulter** » 03 Aug 2012 13:12

Its better to think not about OS/2 or Morphos, but about iOS, Android and such

LeXa2 · Post by **LeXa2** » 03 Aug 2012 13:53

Vaulter wrote:Its better to think not about OS/2 or Morphos, but about iOS, Android and such

Sure, but these are not the case for current OTTD trunk

. What's about your problems with deadlock, had them been fixed for you by latest patch version?

Vaulter · Post by **Vaulter** » 03 Aug 2012 18:20

LeXa2 wrote:Sure, but these are not the case for current OTTD trunk . What's about your problems with deadlock, had them been fixed for you by latest patch version?

Yes! Thanx! Included to patchpack, but I cannot compile it by msvc with default options:

Code: Select all

6>..\src\blitter\32bpp_anim_aa.cpp(61): warning C4267: '=' : conversion from 'size_t' to 'uint', possible loss of data
6>..\src\blitter\32bpp_anim_aa.cpp(1303): error C2057: expected constant expression
6>..\src\blitter\32bpp_anim_aa.cpp(1303): error C2466: cannot allocate an array of constant size 0
6>..\src\blitter\32bpp_anim_aa.cpp(1303): error C2133: 'tmp_pix' : unknown size

But for GCC it compiles and runs fine. Because last one is not in pedantic mode by default

LeXa2 · Post by **LeXa2** » 04 Aug 2012 04:27

Vaulter wrote:
LeXa2 wrote:Sure, but these are not the case for current OTTD trunk :-). What's about your problems with deadlock, had them been fixed for you by latest patch version?
Yes! Thanx! Included to patchpack, but I cannot compile it by msvc with default options:
Code: Select all
6>..\src\blitter\32bpp_anim_aa.cpp(61): warning C4267: '=' : conversion from 'size_t' to 'uint', possible loss of data
6>..\src\blitter\32bpp_anim_aa.cpp(1303): error C2057: expected constant expression
6>..\src\blitter\32bpp_anim_aa.cpp(1303): error C2466: cannot allocate an array of constant size 0
6>..\src\blitter\32bpp_anim_aa.cpp(1303): error C2133: 'tmp_pix' : unknown size
But for GCC it compiles and runs fine. Because last one is not in pedantic mode by default ;)

Good to hear, thanks for reporting. Would fix both. First one is harmless - structure sizes there are warranted to be much less than ((1 << (sizeof(int) * 8)) - 1), but doing explicit type conversion there or changing target variable into size_t would make warning gone. Second one is simply an error of automation - initially I've been using preprocessor macros to set AA_LEVEL and other settings, and while been converting them into ini-configurable vars one of the things done was simple "Find/replace all". GCC have no problems with allocating array of dynamic size on stack while MSVC seems to be pedantic about that. No problems, would convert into malloc/free or new/delete to be on a safer side.

Michi_cc · Post by **Michi_cc** » 04 Aug 2012 13:32

LeXa2 wrote:GCC have no problems with allocating array of dynamic size on stack while MSVC seems to be pedantic about that.

Allocating on the stack is no problem with MSVC either, but you have to use alloca() instead of the non-standard GNU syntax.

-- Michael Lutz

LeXa2 · Post by **LeXa2** » 04 Aug 2012 14:14

Michi_cc wrote:
LeXa2 wrote:GCC have no problems with allocating array of dynamic size on stack while MSVC seems to be pedantic about that.
Allocating on the stack is no problem with MSVC either, but you have to use alloca() instead of the non-standard GNU syntax.

Good to know, thanks. In any case I had already rewritten it into malloc'ing required buffers in blitter constructor and freeing them in destructor. Performance difference should be next to none for this particular case (time spent on computations and control logic expected to be by several orders higher than time spent on memory accesses).

Vaulter · Post by **Vaulter** » 04 Aug 2012 21:24

LeXa2 wrote:Good to know, thanks. In any case I had already rewritten it into malloc'ing required buffers in blitter constructor and freeing them in destructor. Performance difference should be next to none for this particular case (time spent on computations and control logic expected to be by several orders higher than time spent on memory accesses).

Sounds good!
But where we can get it ??

LeXa2 · Post by **LeXa2** » 06 Aug 2012 07:35

Vaulter wrote:
LeXa2 wrote:Good to know, thanks. In any case I had already rewritten it into malloc'ing required buffers in blitter constructor and freeing them in destructor. Performance difference should be next to none for this particular case (time spent on computations and control logic expected to be by several orders higher than time spent on memory accesses).
Sounds good!
But where we can get it ??

Wait a bit, I'm in process of rewriting palette-animation threading (it's been mostly done on past weekend) and trying to catch a bug related to palette-animation in general that causes mouse cursor updated to be jerky in some cases even for non-threaded palette-animation mode with anti-aliasing turned off for palette-animated pixels. There's something I had done definitely wrong there but I hadn't figured out what exactly is it.

P.S. I'm thinking about creating a mirror of my local git repo I use for development on my github page but I don't know yet how to organize it so I won't be forced to import the entire OTTD source there. This task isn't high on my priority list though.

LeXa2 · Post by **LeXa2** » 14 Aug 2012 21:39

After a bunch of delays related to addressing issue "why does the rendering produced by 32bpp-anim-aa blitter at 32x zoom out level with 2x AA + 1 anim slot differ from rendering produced by original 32bpp-anim blitter when using linear sprites downsampling patch and zBase baseset" I'm finally almost ready to release next version of the 32bpp-anim-aa blitter.

This release is mostly a bug fix + speed optimization of the palette animation handling. Due to the limits set by the board admin I can no longer edit second message of this thread thus I'd reorganize it a bit soon moving "gory details" out of the way from the first post. Changelogs would be from here on posted as a separate messages in the trail of the thread with a link to them from the first post.
Below follows the changelog for the upcoming release which is 1-to-1 capture of the "git log" of my local development repo.
I expect to upload the patch release itself in the next few hours. Binary builds for Win32/64 would follow a bit later (I'm building/testing them manually so it would take some time).

Changes in 32bpp-anim-aa blitter version 043131c:

Code: Select all

Patch release branch:
==========================================================================
commit 9bfe5e3
CommitDate: Tue Aug 14 06:06:28 2012 +0400

    Port b0a61f from misc-changes branch so isRemappedColour() would be
    available in separate patch targeting OTTD 1.2.x and 1.3.x trunk.

commit b8793a8
CommitDate: Tue Aug 14 06:06:27 2012 +0400

    Copy some more procs from the base class so a patch would be compatible
    with both 1.2.1 and trunk (as of today).

Main development branch:
===========================================================================
commit 043131c CommitDate: Tue Aug 14 06:05:03 2012 +0400

    Fix arithmetic error in FORCE_XXX_HACK implementation that could
    potentially cause access violation for sprite width == 1 case.

    This bug also had a slight impact on the rendered picture
    potentionally forcing into being fully transparent pixels that
    shouldn't been made transparent and forcing into opaque pixels that
    shouldn't been made opaque.

commit 4405abd
CommitDate: Sun Aug 12 00:29:53 2012 +0400

    Use isRemappedColour() implementation from the base class instead of
    introducing our own.

    Blitter::isRemappedColour() was introduced in misc-changes local
    branch whis is now the "rebase base" for 32bpp-anim-aa blitter local
    development branch.

commit 4d78bc9
CommitDate: Sat Aug 11 23:08:05 2012 +0400

    Reimplement palette-animation in a much faster way basing on results
    of the extensive benchmarking session.

    Speed is still not as good as one might wish but at least now the
    game is playable on a middle-range 8 "core" AMD FX CPU with
    aa_anim_slots == 16. And palette animation no longer sucks badly for
    the rigid case of aa_anim_slots == 1.

    Some figures for 1920x1200 output resolution and AMD FX 8120 CPU:

    Original 32bpp_anim blitter perform one PalleteAnimate() call in
    ~4ms for the case of the entire screen filled with palette-animated
    pixels. With aa_anim_slots == 1 and anim_threads == 1 our older
    implementation sucked badly at around ~125ms per call. New
    implementation gives way better ~7ms for same case (yep, it's 17+
    times faster thanks to optimizations). With 2 threads it gives
    ~3.8ms per call - this is faster than the original blitter handling
    the same job. With threads amount between 4 and 64 it is possible to
    get per-call times at ~2.3-2.5ms range which is almost twice as fast
    compared to original and is in fact more than enough for smooth
    gameplay.

    With aa_anim_slots == 4 + single threaded mome + entire screen is
    filled with pallete-animated pixels + any zoom level other than
    "fully zoomed in" (so blitter is forced into converting 8bpp->32bpp
    and blending in 9.2MPixels per PaletteAnimate() call) average PA()
    call completes in 31ms. It is the "required minimum" performance
    level for the game to be "playable" - that's the ~32FPS level which
    is about the same interval the game uses for internal "world ticks".
    Using 2 threads doubles the performance to a ~16ms (62FPS) per call.
    Using 8 threads cuts that down to ~10ms (100FPS) and using 64
    threads (yes, it is profitable in some cases to use 4x, 8x or even
    16x "CPU cores" amount of threads with the syncronization scheme
    used) ends up with ~7ms (140FPS). In other words the game is
    perfectly playable with aa_anim_slots == 4 and anim_threads == 2 on
    any modern dual-core CPU (excluding Intel Atom and AMD Cxx/Exxx/Zxx)
    and is at "playable performance level" in single-threaded mode on a
    fast single-core CPU.

    Same case but with aa_anim_slots == 16 and 64 threads: PA() call
    typically takes ~22ms (45FPS) giving amazing ~1.6 GigaPixels/s
    processing performance. By increasing threads to 127 and with a bit
    of luck (OS scheduler should be in a good mood to interoperate well
    with the blitter) one could top at ~18ms per call for game
    running in "fast-forward" mode - and that is fantastic 2GPix/s
    performance!

    Next target to optimize is an overcomplicated Draw() call which most
    probably would become even more complicated (and, maybe, splitted
    into several parts) after optimizing.

commit 7b56787
CommitDate: Sat Aug 11 23:08:04 2012 +0400

    Remove turned out to be wrong comment about potentional
    threads-safety issue with anim buffer and remove some related 
    excessive code.

    Background: Inspecting existing video drivers showed that despite
    PaletteAnimate() being called by the thread other than one that
    calls Draw() for some drivers (at least win32_v does it when running
    in threaded mode) it is ensured that there's no concurrent access to
    the blitter and video buffer by threads thus it is not possible to
    have a race condition between PalleteAnimate() and Draw() as long as
    things are kept like this. Cache coherency and memory updates are
    also not a thing to worry about here because a PA() call is
    separated by lock in video driver from calls to D(). It implies full
    memory barrier forced by threading lib.

commit 15ada4d
CommitDate: Sat Aug 11 23:08:03 2012 +0400

    Slightly rework the way ini settings are handled in constructor to
    be more correct and improve auto-detected threads count to be
    more appropritate (basing on the the recent benchmarks I've done).

commit 4f6cde4
CommitDate: Sat Aug 11 23:08:03 2012 +0400

    Fix a dumb error made everywhere in comparisons detecting if pixels
    is from palette-animated range.

commit f61be1c
CommitDate: Sat Aug 11 23:08:02 2012 +0400

    Fix line spacing and add some handy comments here and there.

commit 54f5d5a
CommitDate: Sat Aug 11 23:08:01 2012 +0400

    Rework palette animation threading in a proper way.

    Previous implementation was:
    a) Merely wrong and bugged;
    b) Used much more resources than was really required;
    c) Most of the times could result in worse performance than the
    single-threaded codepath.

    New implementation is way better but still could result in a
    performance being less than for single-threaded case due to possible
    problems with OS threads scheduling granularity. Worst case would be
    three threads each completing its job in a timeslot that's slightly
    longer than OS threads scheduler time-slot. Bad luck strikes as with
    current speed of the doPaletteAnimate() implementation average
    per-thread job completion time would be at around 50-100ms (yeah,
    currently speed is THAT BAD; I'm working hard on improving it) for
    the case of 2 AA SLOTS + entire area filled with palette-animated
    pixels. It means that multithreading would only be profitable for
    two threads case (giving ~2x speed boost as expected) and with
    larger number of threads most probably performance would drop to
    more than 100ms per entire PaletteAnimate() call on Win32/64 with 
    its thread scheduler default time-slice of ~10ms. Sadly, benchmarking
    proves this. Reworking doPaletteAnimate() so it would give better
    speed is a key for multithreading to be worth using for 3 and more
    working threads case. Using spinlocks instead of OS scheduler driven
    services to synchronize threads in another possibility here but it
    has side-effects (portability, excessive CPU hogging) I don't want
    to deal with currently.

commit c70225a
CommitDate: Sat Aug 11 23:08:01 2012 +0400

    Impove sprite encoding so it would not classify fully transparent
    pixels as semi-opaque for some rare cases.

    This design deficiency was found during playtesting blitter with
    zBase baseset + 4xAA + 8 anim slots. One of the trees have a 4x4
    block of pixels with 15 among them being fully transparent and one
    having alpha == 7. Pre-commit algo treated this case as a "class 2"
    pixel  while in reality after scaling alpha to include 15
    transparent "samples" in the resulting pixel it ended up being fully
    transparent - as expected due to ((int)(7/16) == 0). Innocent Draw()
    code that wasn't expecting that "class 2" pixels could ever have
    zeroed alpha and hit DIV-BY-ZERO bug due to this. It was fixed in
    c32b706, but the fact that the situation happened in wild life with
    the existing Encode() implementation served as a signal to do
    investigation which ended up with this commit.

commit 0142d2a
CommitDate: Sat Aug 11 23:08:00 2012 +0400

    Fix memory leak introduced in eb17715 that is extremely unlikely to
    happen in real world use.

commit 85860f4
CommitDate: Sat Aug 11 23:07:59 2012 +0400

    Fix some more cases of wrongly used mutexex and fix "fallback to
    non-threaded mode" code that could cause a crash in the unlikely
    situation of threads creation failure.

commit 71b0450
CommitDate: Sat Aug 11 23:07:59 2012 +0400

    Refactor method of allocating temporary storage for anim threads not
    to use GCC-specific implicit invocation of alloca().

    Without these changes MSVC wasn't able to compile the code. Type
    change from int to size_t is there to silent MSVC warning about
    possible overflow which shouldn't ever happen for this particular
    case.

Transport Tycoon Forums

Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Re: Yet another 32bpp anti-aliased blitter for OTTD 1.2.1

Who is online