In my own work, I've realized that every time I've wanted something like variable-length automatic arrays or alloca(), I didn't really care that the memory was physically located on the cpu stack, just that it came from some stack allocator that didn't incur slow trips to the general heap. So I have a per-thread object that owns some memory from which it can push/pop variable sized buffers. On some platforms I allow this to grow via mmu. Other platforms have a fixed size (usually accompanied by a fixed size cpu stack as well because no mmu). One platform I work with (a handheld game console) has precious little cpu stack anyway because it resides in scarce, fast memory.
I'm not saying that pushing variable-sized buffers onto the cpu stack is never needed. Honestly I was surprised back when I discovered this wasn't standard, as it certainly seems like the concept fits into the language well enough. For me though, the requirements "variable size" and "must be physically located on the cpu stack" have never come up together. It's been about speed, so I made my own sort of "parallel stack for data buffers".