Not knowing time cost of shared_copy copy operation where atomic increment and decrement is in, I suffered from much higher CPU usage problem. I never expected atomic increment and decrement may take so much cost.
Following my test result, int32 atomic increment and decrement takes 2 or 40 times than non-atomic increment and decrement. I got it on 3GHz Core i7 with Windows 8.1. The former result comes out when no contention occurs, the latter when high possibility of contention occurs. I keep in mind that atomic operations are at last hardware based lock. Lock is lock. Bad to performance when contention occurs.
Experiencing this, I always use byref(const shared_ptr&) than byval(shared_ptr).