Hi Andrew,
Just some comments about the "decompiled" code. Usually, it's easier to make it more readable as proper standard C code for the reader - using user defined structures, high-level functions (RtlUshortByteSwap), etc.
I attached an example below for the "stage1" function as an example. 
I also name structure as "MYSTRUCT1" when I don't know what they do or "UCHAR u0C[0x10];" when I know the offset (0xC) but don't know what they do either, until I figure out and rename them later. It's quite a useful methodology when working on large project, that's what makes the difference between a human and Hex-Rays :-)
typedef struct _DECRYPTDATA {
	UCHAR DecryptionRoutine[0x1f]; // Contains Stage1 code
	USHORT EncodedCode[0x900]; // Contains the "encoded" code to be executed
} DECRYPTDATA, *PDECRYPTDATA;
VOID
DecodeData(
)
{
	INT i;
	
	// PDECRYPTDATA Input points to the current address. Or the equivalent of &Stage1.
	
	for (i = 0; i < sizeof(Input->EncodedCode)); i++)
	{
		Input->EncodedCode[i] = RtlUshortByteSwap(Input->EncodedCode[i]);
	}
	
	//
	// What's following the decryption routine is the encoded code (Input->EncodedCode)
	// Soon to be decrypted, by the above routine.
	//
	_asm _emit 0xEE  
	_asm _emit 0xD9
	_asm _emit 0x74
	_asm _emit 0xD9
	// [...]
}