Lets look at the algorithm in the question:
[(a + b) + abs(b - a)]/2
This has an addition and subtraction stages which are then fed to a second stage addition. The divide by 2 is trivial in hardware, it can be done by removing the LSB. However the two-stage full-adder/subtractor is pretty slow and gate-intensive, especially if you are cascading multiple caparisons like you are.
Building off of Wouter van Ooijen's answer, the generalized structure is a digital comparator feeding the select signal of a mux:

simulate this circuit – Schematic created using CircuitLab
The above schematic is for:
(A > B) ? A : B
but notice that it can be easily reconfigured for any comparison between the two inputs by making different logical connections between the comparator outputs and the mux select.
So if we know how to formulate the three outputs from the comparator, we can implement any comparison in hardware. Comparator logic is well described here. To optimize the hardware, we would just remove the logic driving the unused comparator outputs.
But in the end, if its going to hardware, it has to go through synthesis. So you shouldn't obsess over which gate-level scheme is optimal. Instead, optimize your code and algorithms so that you at least are not forcing the synthesizer to produce an inefficient result. "With some clever tricking the checking of the bit pairs can be combined with the muxer for the same bit pair," and the easiest way to perform this optimization is with synthesis.