Faster way to zero memory than with memset

Question

I learned that memset ptr  0  nbytes  is really fast  but is there a faster way  at least on x86    I assume that memset uses mov  however when zeroing memory most compilers use xor as it s faster  correct  edit1  Wrong  as GregS pointed out that only works with registers  What was I thinking   Also I asked a person who knew of assembler more than me to look at the stdlib  and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers  However at that time I was very tired  so I m not quite sure I understood it correctly   edit2  I revisited this issue and did a little testing  Here is what I tested        include  lt stdio h gt       include  lt malloc h gt       include  lt string h gt       include  lt sys time h gt        define TIME body  do                                                                 struct timeval t1  t2  double elapsed                                             gettimeofday  amp t1  NULL                                                            body                                                                              gettimeofday  amp t2  NULL                                                            elapsed    t2 tv sec - t1 tv sec    1000 0    t2 tv usec - t1 tv usec    1000 0            printf   s n ---  f --- n    body  elapsed     while 0                           define SIZE 0x1000000      void zero 1 void  buff  size t size                size t i          char  foo   buff          for  i   0  i  lt  size  i                foo i    0                 I foolishly assume size t has register width        void zero sizet void  buff  size t size                size t i          char  bar          size t  foo   buff          for  i   0  i  lt  size   sizeof size t   i                foo i    0              fixes bug pointed out by tristopia         bar    char  buff   size - size   sizeof size t           for  i   0  i  lt  size   sizeof size t   i                bar i    0             int main                 char  buffer   malloc SIZE           TIME              memset buffer  0  SIZE                      TIME              zero 1 buffer  SIZE                      TIME              zero sizet buffer  SIZE                      return 0          results    zero 1 is the slowest  except for -O3  zero sizet is the fastest with roughly equal performance across -O1  -O2 and -O3  memset was always slower than zero sizet   twice as slow for -O3   one thing of interest is that at -O3 zero 1 was equally fast as zero sizet  however the disassembled function had roughly four times as many instructions  I think caused by loop unrolling   Also  I tried optimizing zero sizet further  but the compiler always outdid me  but no surprise here    For now memset wins  previous results were distorted by CPU cache   all tests were run on Linux  Further testing needed  I ll try assembler next     edit3  fixed bug in test code  test results are not affected   edit4  While poking around the disassembled VS2010 C runtime  I noticed that memset has a SSE optimized routine for zero  It will be hard to beat this

User · Accepted Answer

x86 is rather broad range of devices   For totally generic x86 target  an assembly block with  rep movsd  could blast out zeros to memory 32-bits at time  Try to make sure the bulk of this work is DWORD aligned   For chips with mmx  an assembly loop with movq could hit 64bits at a time   You might be able to get a C C   compiler to use a 64-bit write with a pointer to a long long or  m64  Target must be 8 byte aligned for the best performance   for chips with sse  movaps is fast  but only if the address is 16 byte aligned  so use a movsb until aligned  and then complete your clear with a loop of movaps  Win32 has  ZeroMemory     but I forget if thats a macro to memset  or an actual  good  implementation

User · Answer

The memset function is designed to be flexible and simple  even at the expense of speed   In many implementations  it is a simple while loop that copies the specified value one byte at a time over the given number of bytes   If you are wanting a faster memset  or memcpy  memmove  etc   it is almost always possible to code one up yourself   The simplest customization would be to do single-byte  set  operations until the destination address is 32- or 64-bit aligned  whatever matches your chip s architecture  and then start copying a full CPU register at a time   You may have to do a couple of single-byte  set  operations at the end if your range doesn t end on an aligned address   Depending on your particular CPU  you might also have some streaming SIMD instructions that can help you out   These will typically work better on aligned addresses  so the above technique for using aligned addresses can be useful here as well   For zeroing out large sections of memory  you may also see a speed boost by splitting the range into sections and processing each section in parallel  where number of sections is the same as your number or cores hardware threads    Most importantly  there s no way to tell if any of this will help unless you try it   At a minimum  take a look at what your compiler emits for each case   See what other compilers emit for their standard  memset  as well  their implementation might be more efficient than your compiler s

User · Answer

memset is generally designed to be very very fast general-purpose setting zeroing code  It handles all cases with different sizes and alignments  which affect the kinds of instructions you can use to do your work  Depending on what system you re on  and what vendor your stdlib comes from   the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are  It might also have internal special cases to handle the case of zeroing  versus setting some other value     That said  if you have very specific  very performance critical memory zeroing to do  it s certainly possible that you could beat a specific memset implementation by doing it yourself  memset and its friends in the standard library are always fun targets for one-upmanship programming

User · Answer

Nowadays your compiler should do all the work for you  At least of what I know gcc is very efficient in optimizing calls to memset away  better check the assembler  though     Then also  avoid memset if you don t have to    use calloc for heap memory use proper initialization          0    for stack memory   And for really large chunks use mmap if you have it  This just gets zero initialized memory from the system  for free

User · Answer

memset could be inlined by compiler as a series of efficient opcodes  unrolled for a few cycles  For very large memory blocks  like 4000x2000 64bit framebuffer  you can try optimizing it across several threads  which you prepare for that sole task   each setting its own part  Note that there is also bzero    but it is more obscure  and less likely to be as optimized as memset  and the compiler will surely notice you pass 0   What compiler usually assumes  is that you memset large blocks  so for smaller blocks it would likely be more efficient to just do   uint64 t  p   0  if you init large number of small objects   Generally  all x86 CPUs are different  unless you compile for some standardized platform   and something you optimize for Pentium 2 will behave differently on Core Duo or i486  So if you really into it and want to squeeze the last few bits of toothpaste  it makes sense to ship several versions your exe compiled and optimized for different popular CPU models  From personal experience Clang -march native boosted my game s FPS from 60 to 65  compared to no -march

User · Answer

There is one fatal flaw in this otherwise great and helpful test  As memset is the first instruction  there seems to be some  memory overhead  or so which makes it extremely slow  Moving the timing of memset to second place and something else to first place or simply timing memset twice makes memset the fastest with all compile switches

User · Answer

That s an interesting question  I made this implementation that is just slightly faster  but hardly measurable  when 32-bit release compiling on VC   2012  It probably can be improved on a lot  Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with memset   in multithreaded scenarios      MemsetSpeedTest cpp   Defines the entry point for the console application       include  stdafx h   include  lt iostream gt   include  Windows h   include  lt time h gt    pragma comment lib   Winmm lib    using namespace std       a signed 64-bit integer value type     define  INT64   int64      a signed 32-bit integer value type     define  INT32   int32      a signed 16-bit integer value type     define  INT16   int16      a signed 8-bit integer value type     define  INT8   int8      an unsigned 64-bit integer value type     define  UINT64 unsigned  INT64      an unsigned 32-bit integer value type     define  UINT32 unsigned  INT32      an unsigned 16-bit integer value type     define  UINT16 unsigned  INT16      an unsigned 8-bit integer value type     define  UINT8 unsigned  INT8      maximum allo  wed value in an unsigned 64-bit integer value type         define  UINT64 MAX 18446744073709551615ULL   ifdef  WIN32      Use to init the clock     define TIMER INIT LARGE INTEGER frequency LARGE INTEGER t1  t2 double elapsedTime QueryPerformanceFrequency  amp frequency        Use to start the performance timer     define TIMER START QueryPerformanceCounter  amp t1        Use to stop the performance timer and output the result to the standard stream  Less verbose than  c TIMER STOP VERBOSE     define TIMER STOP QueryPerformanceCounter  amp t2  elapsedTime  t2 QuadPart-t1 QuadPart  1000 0 frequency QuadPart wcout lt  lt elapsedTime lt  lt L  ms   lt  lt endl   else     Use to init the clock     define TIMER INIT clock t start double diff       Use to start the performance timer     define TIMER START start clock         Use to stop the performance timer and output the result to the standard stream  Less verbose than  c TIMER STOP VERBOSE     define TIMER STOP diff  clock  -start   double CLOCKS PER SEC wcout lt  lt fixed lt  lt diff lt  lt endl   endif       void  MemSet void  dest   UINT8 c  size t count        size t blockIdx      size t blocks   count  gt  gt  3      size t bytesLeft   count -  blocks  lt  lt  3        UINT64 cUll            c                UINT64 c   lt  lt  8                 UINT64 c   lt  lt  16                 UINT64 c   lt  lt  24                 UINT64 c   lt  lt  32                 UINT64 c   lt  lt  40                 UINT64 c   lt  lt  48                 UINT64 c   lt  lt  56          UINT64  destPtr8     UINT64  dest      for  blockIdx   0  blockIdx  lt  blocks  blockIdx    destPtr8 blockIdx    cUll       if   bytesLeft  return dest       blocks   bytesLeft  gt  gt  2      bytesLeft   bytesLeft -  blocks  lt  lt  2         UINT32  destPtr4     UINT32   amp destPtr8 blockIdx       for  blockIdx   0  blockIdx  lt  blocks  blockIdx    destPtr4 blockIdx      UINT32 cUll       if   bytesLeft  return dest       blocks   bytesLeft  gt  gt  1      bytesLeft   bytesLeft -  blocks  lt  lt  1         UINT16  destPtr2     UINT16   amp destPtr4 blockIdx       for  blockIdx   0  blockIdx  lt  blocks  blockIdx    destPtr2 blockIdx      UINT16 cUll       if   bytesLeft  return dest        UINT8  destPtr1     UINT8   amp destPtr2 blockIdx       for  blockIdx   0  blockIdx  lt  bytesLeft  blockIdx    destPtr1 blockIdx      UINT8 cUll       return dest     int  tmain int argc   TCHAR  argv          TIMER INIT      const size t n   10000000      const  UINT64 m    UINT64 MAX      const  UINT64 o   1      char test n                 cout  lt  lt   memset     lt  lt  endl          TIMER START           for  int i   0  i  lt  m   i                for  int j   0  j  lt  o   j                    memset  void  test  0  n              TIMER STOP                      cout  lt  lt   MemSet   took    lt  lt  endl          TIMER START           for  int i   0  i  lt  m   i                for  int j   0  j  lt  o   j                    MemSet  void  test  0  n            TIMER STOP             cout  lt  lt   Done   lt  lt  endl      int wait      cin  gt  gt  wait      return 0      Output is as follows when release compiling for 32-bit systems   memset   took  5 569000 MemSet   took  5 544000 Done   Output is as follows when release compiling for 64-bit systems   memset   took  2 781000 MemSet   took  2 765000 Done   Here you can find the source code Berkley s memset    which I think is the most common implementation

User · Answer

If I remember correctly  from a couple of years ago   one of the senior developers was talking about a fast way to bzero   on PowerPC  specs said we needed to zero almost all the memory on power up    It might not translate well  if at all  to x86  but it could be worth exploring   The idea was to load a data cache line  clear that data cache line  and then write the cleared data cache line back to memory   For what it is worth  I hope it helps

User · Answer

Unless you have specific needs or know that your compiler stdlib is sucky  stick with memset  It s general-purpose  and should have decent performance in general  Also  compilers might have an easier time optimizing inlining memset   because it can have intrinsic support for it   For instance  Visual C   will often generate inline versions of memcpy memset that are as small as a call to the library function  thus avoiding push call ret overhead  And there s further possible optimizations when the size parameter can be evaluated at compile-time   That said  if you have specific needs  where size will always be tiny  or  huge   you can gain speed boosts by dropping down to assembly level  For instance  using write-through operations for zeroing huge chunks of memory without polluting your L2 cache   But it all depends - and for normal stuff  please stick to memset memcpy

[c] Faster way to zero memory than with memset?

Examples related to c

Examples related to libc