389 lines
15 KiB
Markdown
389 lines
15 KiB
Markdown
Introduction
|
|
============
|
|
This project aims to give a simple overview on how good various x64 hooking
|
|
engines (on windows) are. I'll try to write various functions, that are hard to
|
|
patch and then see how each hooking engine does.
|
|
|
|
I'll test:
|
|
|
|
* [EasyHook](https://easyhook.github.io/)
|
|
* [PolyHook](https://github.com/stevemk14ebr/PolyHook)
|
|
* [MinHook](https://www.codeproject.com/Articles/44326/MinHook-The-Minimalistic-x-x-API-Hooking-Libra)
|
|
* [Mhook](http://codefromthe70s.org/mhook24.aspx)
|
|
|
|
(I'd like to test detours, but I'm not willing to pay for it. So that isn't
|
|
tested :( )
|
|
|
|
There are multiple things that make hooking difficult. Maybe you want to patch
|
|
while the application is running -- in that case you might get race conditions,
|
|
as the application is executing your half finished hook. Maybe the software has
|
|
some self protection features (or other software on the system provides that,
|
|
e.g. Trustee Rapport)
|
|
|
|
Evaluating how the hooking engines stack up against that is not the goal here.
|
|
Neither are non-functional criteria, like how fast it is or how much memory it
|
|
needs for each hook. This is just about the challenges the function to be
|
|
hooked itself poses.
|
|
|
|
Namely:
|
|
|
|
* Are jumps relocated?
|
|
* What about RIP adressing?
|
|
* If there's a loop at the beginning / if it's a tail recurisve function, does
|
|
the hooking engine handle it?
|
|
* How good is the dissassembler, how many instructions does it know?
|
|
* Can it hook already hooked functions?
|
|
|
|
At first I will give a short walk through of the architecture, then quickly go
|
|
over the test cases. After that come the results and an evaluation for each
|
|
engine.
|
|
|
|
I think I found a flaw in all of them; I'll publish a small POC which should at
|
|
least detect the existence of problematic code.
|
|
|
|
**A word of caution**: my results are worse than expected, so do assume I have
|
|
made a mistake in using the libraries. I went into this expecting that some
|
|
engines at least would try to detect e.g. the loops back into the first few
|
|
bytes. But none did? That's gotta be wrong.
|
|
|
|
**Another word of caution**: parts of this are rushed and/or ugly. Please
|
|
double check parts that seem suspicious. And I'd love to get patches, even for
|
|
the most trivial things -- spelling mistakes? Yes please.
|
|
|
|
Architecture
|
|
============
|
|
This project is made up of two parts. A .DLL with the test cases and an .exe
|
|
that hooks those, tests whether they still work and prints the results.
|
|
|
|
(I could have done it all in the .exe but this makes it trivial to (at some
|
|
point) force the function to be hooked and the target function to be further
|
|
apart than 2GB. Just set fixed image bases in the project settings and you're
|
|
done)
|
|
|
|
My main concern was automatically identifying whether the hook worked. I
|
|
consider a hook to work if: a) the original function can still execute
|
|
successfully *and* b) the hook was called.
|
|
|
|
The criteria a) is really similar to a unit test. Verify that a function
|
|
returns what is expected. So for a) the .exe just runs unit tests after all the
|
|
hooks have been applied. Each failing function is reported (or the program
|
|
crashes and I can look at the callstack) so I can correlate that with which
|
|
hooking engine I'm currently testing and see where those fail. I've used
|
|
Catch2 for the unit tests, because I wanted to try it anyway.
|
|
|
|
From the get-to it was clear that I wanted to test multiple hooking engines.
|
|
And they all needed to do the same steps in the same order -- so I implemented
|
|
a basic AbstractHookingEngine with a boolean for every test case and make a
|
|
child class for each engine. The children classes have to overwrite `hook_all`
|
|
and `unhook_all`. Inbetween the calls to that, the unit tests run.
|
|
|
|
Test case: Small
|
|
================
|
|
This is just a very small function; it is smaller than the hook code will be -
|
|
so how does the library react?
|
|
|
|
|
|
_small:
|
|
xor eax, eax
|
|
ret
|
|
|
|
|
|
Test case: Branch
|
|
=================
|
|
Instead of the FASM code I'll show the disassembled version, so you can see the
|
|
instruction lengths & offsets.
|
|
|
|
|
|
0026 | 48 83 E0 01 | and rax,1
|
|
002A | 74 17 | je test_cases.0043 --+
|
|
002C | 48 31 C0 | xor rax,rax |
|
|
002F | 90 | nop |
|
|
0030 | 90 | nop |
|
|
... |
|
|
0041 | 90 | nop |
|
|
0042 | 90 | nop |
|
|
0043 | C3 | ret <----------------+
|
|
|
|
|
|
This function has a branch in the first 5 bytes. Hooking it detour-style isn't
|
|
possible without fixing that branch in the trampoline. The NOP sled is just so
|
|
the hooking engine can't cheat and just put the whole function into the
|
|
trampoline. Instead the jump in the trampoline needs to be modified so it jumps
|
|
back to the original destinations
|
|
|
|
Test case: RIP relative
|
|
=======================
|
|
One of the new things in AMD64 is RIP relative addressing. I guess the reason
|
|
to include it was to make it easier to generate PIC -- all references to data
|
|
can now be made relative, instead of absolute. So it doesn't matter anymore
|
|
where the program is loaded into memory and there's less need for the
|
|
relocation table.
|
|
|
|
A quick and dirty[1] test for this is re-implementing the well known C rand
|
|
function.
|
|
|
|
|
|
public _rip_relative
|
|
_rip_relative:
|
|
mov rax, qword[seed]
|
|
mov ecx, 214013
|
|
mul ecx
|
|
add eax, 2531011
|
|
mov [seed], eax
|
|
|
|
shr eax, 16
|
|
and eax, 0x7FFF
|
|
ret
|
|
|
|
seed dd 1
|
|
|
|
|
|
The very first instruction uses rip relative addressing, thus it needs to be
|
|
fixed in the trampoline.
|
|
|
|
Test case: AVX & RDRAND
|
|
=======================
|
|
|
|
The AMD64 instruction set is extended with every CPU generation. Becayse the
|
|
hooking engines need to know the instruction lengths and their side effects to
|
|
properly apply their hooks, they need to keep up.
|
|
|
|
The actual code in the test case is boring and doesn't matter. I'm sure there
|
|
are disagreements on whether I've picked good candidates of "exotic" or new
|
|
instructions, but those were the first that came to mind.
|
|
|
|
(It's also doubtful whether you'll ever encounter functions where the first
|
|
instructions are of this category, because most probably there's some setup
|
|
needed before, e.g. checking that adresses are aligned, initalizing loop
|
|
counters, yadda, yadda)
|
|
|
|
Test case: loop and TailRec
|
|
===========================
|
|
|
|
My hypothesis before starting this evaluation was that those two cases would
|
|
make most hooking engines fail. Back in the good ol' days of x86 detour hooking
|
|
didn't require any special thought because the prologue was exactly as big as
|
|
the hook itself -- 5 bytes for `PUSH ESP; MOV EBP, ESP` and 5 bytes for `JMP +-
|
|
2GB`[2]. That isn't so easy for AMD64: a) the hook sometimes needs to be *way*
|
|
bigger b) due to changes in the calling convention and the general architecture
|
|
of AMD64 there just isn't a common prologue, used for almost all functions,
|
|
anymore.
|
|
|
|
Those by itself arn't a problem, since the hooking engines can fix all the
|
|
instructions they would overwrite. However I hypothesized that only a few would
|
|
check whether the function contained a loop that jumps back into the
|
|
instructions that have been overwritten. Consider this:
|
|
|
|
public _loop
|
|
_loop:
|
|
mov rax, rcx
|
|
@loop_loop:
|
|
mul rcx
|
|
nop
|
|
nop
|
|
nop
|
|
loop @loop_loop ; lol
|
|
ret
|
|
|
|
There's only 3 bytes that can be safely overwritten. Right after that is the
|
|
destination of the jump backwards. This is a very simple (and kinda pointless)
|
|
function so detecting that the loop might lead to problems shouldn't be a
|
|
problem. But consider what happens with MHook (and all the others):
|
|
|
|
_loop original:
|
|
|
|
008C | 48 89 C8 | mov rax,rcx
|
|
008F | 48 F7 E1 | mul rcx
|
|
0092 | 90 | nop
|
|
0093 | 90 | nop
|
|
0094 | 90 | nop
|
|
0095 | E2 F8 | loop test_cases.008F
|
|
0097 | C3 | ret
|
|
|
|
_loop hooked:
|
|
|
|
008C | E9 0F 69 23 00 | jmp <MHook_Hooks::hookLoop>
|
|
0091 | E1 90 | loope test_cases.0023
|
|
0093 | 90 | nop
|
|
0094 | 90 | nop
|
|
0095 | E2 F8 | loop test_cases.008F
|
|
0097 | C3 | ret
|
|
|
|
trampoline:
|
|
|
|
00007FFF7CD200C0 | 48 89 C8 | mov rax,rcx
|
|
00007FFF7CD200C3 | 48 F7 E1 | mul rcx
|
|
00007FFF7CD200C6 | E9 C7 96 DC FF | jmp test_cases.0092
|
|
|
|
then executes:
|
|
|
|
0092 | 90 | nop
|
|
0093 | 90 | nop
|
|
0094 | 90 | nop
|
|
0095 | E2 F8 | loop test_cases.008F
|
|
|
|
But that jumps back into the middle of the jump and thus executes:
|
|
|
|
008F | 23 00 | and eax,dword ptr ds:[rax]
|
|
0091 | E1 90 | loope test_cases.0023
|
|
|
|
Which isn't right and will crash horribly.
|
|
|
|
(Preliminary) Results
|
|
=====================
|
|
|
|
+----------+-----+------+------------+---+------+----+-------+
|
|
| Name|Small|Branch|RIP Relative|AVX|RDRAND|Loop|TailRec|
|
|
+----------+-----+------+------------+---+------+----+-------+
|
|
| PolyHook| X | X | X | X | | | |
|
|
| MinHook| X | X | X | | | | X |
|
|
| MHook| | | X | | | | |
|
|
+----------+-----+------+------------+---+------+----+-------+
|
|
|
|
As expected nothing could correctly hook the loop. In fact I had to comment out
|
|
those parts because even Catch2 couldn't recover from the crashes generated by
|
|
the botched hooks. Some hooking engines are a bit lacking in their support for
|
|
newer instruction sets, but a simple update of the dissassembler library should
|
|
fix that.
|
|
|
|
I was pleasantly suprised by MinHook, both the general AIP and because it
|
|
managed to build a trampoline that worked perfectly even for the tail
|
|
recursion case. I'd recommend it, even though it seems theres no chance that
|
|
the dissassembler will ever be updated.
|
|
|
|
Detecting tail recursive functions / loops into overwritten code
|
|
================================================================
|
|
|
|
Back in 2015 I wanted to write my own hooking engine which would be able to
|
|
hook ALL THE FUNCTIONS! And I did actually start to write it and then
|
|
abandoded it, before I got to the interesting part. However since then I had
|
|
the basic idea down:
|
|
|
|
1) Find out how long the function is
|
|
2) Analyze it, by checking whether some jump could jump into the overwritten
|
|
instructions
|
|
3) Somehow fix that
|
|
|
|
Fixing that code probably means putting the whole function in the trampoline,
|
|
by definition there is no space where to put the additional/longer instructions.
|
|
|
|
However I think that hooking engines should at least fail fast if they can't
|
|
hook that function and give the user the ability to handle that error at that
|
|
stage instead of waiting for unpredictable crashes. I'll post example code
|
|
[here](https://git.free-hack.com/wacked/x64hook) and outline the general
|
|
technique below.
|
|
|
|
(My x64hook hooking engine doesn't work. There's literally two interesting
|
|
functions in it, and I give pseudocode for them below)
|
|
|
|
Estimate the length of a function
|
|
---------------------------------
|
|
|
|
Note: This is an estimation of the function length. There's various ways to go
|
|
about to do it, one way would be to search pro- and epilogue. Which would fail
|
|
for all functions that -- for whatever reason -- don't have that. I'm sure this
|
|
way also isn't perfect, but maybe it could be used as another source of
|
|
information[5].
|
|
|
|
Over the years I've seen various attempts at estimating the function length.
|
|
One of the top hits for my google history is a question on stackoverflow
|
|
which[3] uses the same technique that I've seen in various malware strains -
|
|
checking byte for byte until the RET opcode is found. Which won't work if
|
|
either:
|
|
|
|
1) The `RET imm16` opcode is used, which is often the case for __stdcall funcs.
|
|
2) There are multiple returns
|
|
3) The function doesn't actually return with the RET instruction. For example
|
|
if a function A at its end calls another function B, with A and B sharing the
|
|
same parameters and either A or B not modifying the stack pointer it is
|
|
perfectly possible to just jump to function B. Exectution will continue in B,
|
|
which ends with a normal RET.
|
|
4) The value 0xC3 appears for some other reason in the function.
|
|
|
|
4) can be easily solved by using a length disassember engine and just checking
|
|
the actual instruction byte. 1) and 3) aren't that hard either, you'll just
|
|
need to check for some additional opcodes. What about 2)?
|
|
|
|
The key insight I had was why a function might have multiple returns -- because
|
|
it needed to do additional work in some cases. Which meant that there had to be
|
|
branching, to sometimes skip some instructions or get to them.
|
|
|
|
If there is a branch backwards it's a loop. But a branch forwards means that
|
|
the function extends at least up to there[4]. Or in pseudocode:
|
|
|
|
offsetOfInstr = 0
|
|
funcLen = 0
|
|
furthestJump = 0
|
|
while(can dissasemble next instruction)
|
|
{
|
|
offsetOfInstr += funcLen;
|
|
|
|
|
|
op = getOpcode(instruction);
|
|
if(is_jump(op))
|
|
{
|
|
off = get_jump_offset(instruction);
|
|
if(off > furthestJump)
|
|
furthestJump = off;
|
|
}
|
|
|
|
if(is_end_of_function(op, furthestJump, offsetOfInstr))
|
|
{
|
|
break;
|
|
}
|
|
}
|
|
|
|
bool is_end_of_function(opc, furthestJump, instrOffset)
|
|
{
|
|
if(opc == RET && furthestJump <= instrOffset)
|
|
return true;
|
|
else if(opc == UD_Ijmp)
|
|
{
|
|
if(destination is IMM || destination is register)
|
|
return true;
|
|
}
|
|
|
|
return false;
|
|
}
|
|
|
|
|
|
Detecting loops to the start of a function
|
|
------------------------------------------
|
|
|
|
firstJumpOffset = MAX_INT
|
|
foreach(instruction in function)
|
|
if(instruction is a jump)
|
|
jumpOffset = getOffset(instruction) // relative to function start
|
|
|
|
/* jumps to exactly the start of a function are fine, since that is
|
|
where our overwritten code starts. Thus it doesn't jump into the middle
|
|
of an instruction */
|
|
if(jumpOffset == 0)
|
|
continue
|
|
|
|
if(jumpOffset < firstJumpOffset)
|
|
firstJumpOffset = jumpOffset;
|
|
|
|
return firstJumpOffset < lengthNeededForHook
|
|
------------
|
|
|
|
[1] This is one of the things that could easily be improved, but haven't been
|
|
because I just couldn't motivate myself. Putting the data right after the func
|
|
meant that a section containing code needed to be writable. Which is bad. Also
|
|
I load the seed DWORD as a QWORD -- which only works because the upper half is
|
|
then thrown away by the multiplication. It's shitty code is what I'm saying.
|
|
|
|
In retrospect I should have used a jump table like a switch-case could be
|
|
compiled into. That would be read only data. Oh well.
|
|
|
|
[2] And Microsoft decided at some point to make it even easier for their code
|
|
with the advent of hotpatching.
|
|
|
|
[3] https://stackoverflow.com/questions/8705215/get-the-size-length-of-a-c-function
|
|
|
|
[4] With some caveats, e.g. one could assume that no function is longer than
|
|
512 bytes. And obviously keeping in mind point 3
|
|
|
|
[5] Another heuristic would be to check for the next slide of filler
|
|
instructions, such as INT3 or NOP. Some compilers align functions on 16byte
|
|
boundarys and fill the gaps with those |