Tight loops, like when playing with images, since an image may cosist of millions of pixels. Sitting down and figuring out how to make best use of the limited number of processor registers can make a difference. Here's a real life sample:
http://danbystrom.se/2008/12/22/optimizing-away-ii/
Then often processors have some esoteric instructions which are too specialized for a compiler to bother with, but on occasion an assembler programmer can make good use of them. Take the XLAT instruction for example. Really great if you need to do table look-ups in a loop and the table is limited to 256 bytes!
Updated: Oh, just come to think of what's most crucial when we speak of loops in general: the compiler has often no clue on how many iterations that will be the common case! Only the programmer know that a loop will be iterated MANY times and that it therefore will be beneficial to prepare for the loop with some extra work, or if it will be iterated so few times that the set-up actually will take longer than the iterations expected.