Wouldn't the most flexible solution be to have two separate graphics engines.

I appreciate this would (I'm assuming) take some major decoupling of the game engine and the graphics engine to the point that the game / screen area state is sent directly graphics engine and that decides exactly how to render it. E.g. A client / server model. So the game includes a graphics server with a defined interface and then a graphics client to actually display
The original / vanilla client would be the existing system (decoupled of course), supporting 8-bit colour and the original/existing graphics format. That way people on lower end machines are still ok.
The new engine would use a new graphics format for the input, and be generally lovely and extendible and nice. As such, the suggestions to use XML to tag the files is a good one, that way extra animations and other features etc may be easily catered for. The new graphics format could use sprites as big (resolution) as you want (though I'd tend to think that *4 current resolution would be plenty for most people).
Each object would be a tar file, containing a 'master' XML file describing what all the graphics in the archive are for, their size etc etc). The images could be further tar'ed within the object tar in order to keep them reasonably organised. The details could be handled by the xml file. The size of the object's tar file would be quite reasonable, since every graphic would be compacted in the png format and the size (memory) of graphics would be high compared to the uncompacted xml file. Therefore little efficiancy is lost by not compacting the xml.
The whole graphics set (collection of object tars) could be wrapped up in another .tar file. That solves the clutter problem.
Now onto the more technical suggestions:
The actual sprites:
32-bit (RGBA) PNG images seem to be the way to go as PNGs are both feature rich and compacted. Resolutions that are multiples (suggestions of 4* by default seem good) of the original graphic set are sensible. As has been said by many people, they can be scaled up or down as necessary for the game and anti-aliased as desired by the player.
The images needed per object:
1. A master image, which is a static picture of the object. A normal gray scale could be used for the parts that have to rendered in game with the player colour (see point 3).
2. Animation overlays - a set of images for each animation sequence that get sequentially overlayed on the main image - details such as animation speed / order etc can be specified by the XML file (or sub-file).
3. Player (or random for town buildings etc) colour overlays - PNG bitmask for every image in (1) and (2), specifying which bits should be coloured to the player colour. It may be possible to push this mask into each PNG image, I'm not up to speed on the full PNG spec and whether you can define in effect multiple layers. I think that it's important for the individual sprites to follow the PNG spec. But as long as it follows the PNG spec it should be fine either way.
Now when the graphics client gets game state information from the server, it can decide for itself how it should be rendered. The client should have the 'intelligence' to know (for example) that if a train is at a level crossing then the level crossing lights etc needs to be animated. I.e. the client should contain all the rules necessary for drawing the game based only on the current game state. Also, that way more animations can be added to an object and then the appropriate rules for using them can be embedded in either the client or the XML descriptor of the object. I'd tend to say the client should have the rules hardcoded for decent execution rate. Then it's up to the client to support as many animation 'events' as possible and up to the graphic artists to include whatever subset of those animations they wanted. If an animation is missing then it simply isn't displayed, for example, if the train engine has no spinning wheels animation then the client just moves the static sprite around the screen.
This is very much a brain dump of about half an hour and so I suspect it will be full of flaws. Please criticize gently

Edit: This is of course a very long term thing and I appreciate that.