This answer has a code example, which implements the algorithm described in the paper Practical In-Place Merging by Bing-Chao Huang and Michael A. Langston. I have to admit that I do not understand the details, but the given complexity of the merge step is O(n).
From a practical perspective, there is evidence that pure in-place implementations are not performing better in real world scenarios. For example, the C++ standard defines std::inplace_merge, which is as the name implies an in-place merge operation.
Assuming that C++ libraries are typically very well optimized, it is interesting to see how it is implemented:
The implementation delegates to __inplace_merge, which dodges the problem by trying to allocate a temporary buffer:
typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
_TmpBuf __buf(__first, __len1 + __len2);
if (__buf.begin() == 0)
std::__merge_without_buffer
(__first, __middle, __last, __len1, __len2, __comp);
else
std::__merge_adaptive
(__first, __middle, __last, __len1, __len2, __buf.begin(),
_DistanceType(__buf.size()), __comp);
Otherwise, it falls back to an implementation (__merge_without_buffer), which requires no extra memory, but no longer runs in O(n) time.
Looks similar. It delegates to a function, which also tries to allocate a buffer. Depending on whether it got enough elements, it will choose the implementation. The constant-memory fallback function is called __buffered_inplace_merge.
Maybe even the fallback is still O(n) time, but the point is that they do not use the implementation if temporary memory is available.
Note that the C++ standard explicitly gives implementations the freedom to choose this approach by lowering the required complexity from O(n) to O(N log N):
Complexity: Exactly N-1 comparisons if enough additional memory is available. If the memory is insufficient, O(N log N) comparisons.
Of course, this cannot be taken as a proof that constant space in-place merges in O(n) time should never be used. On the other hand, if it would be faster, the optimized C++ libraries would probably switch to that type of implementation.