[c] How to allocate aligned memory only using the standard library?

I just finished a test as part of a job interview, and one question stumped me, even using Google for reference. I'd like to see what the StackOverflow crew can do with it:

The memset_16aligned function requires a 16-byte aligned pointer passed to it, or it will crash.

a) How would you allocate 1024 bytes of memory, and align it to a 16 byte boundary?
b) Free the memory after the memset_16aligned has executed.

{    
   void *mem;
   void *ptr;

   // answer a) here

   memset_16aligned(ptr, 0, 1024);

   // answer b) here    
}

This question is related to c memory-management

The answer is


I'm surprised noone's voted up Shao's answer that, as I understand it, it is impossible to do what's asked in standard C99, since converting a pointer to an integral type formally is undefined behavior. (Apart from the standard allowing conversion of uintptr_t <-> void*, but the standard does not seem to allow doing any manipulations of the uintptr_t value and then converting it back.)


On the 16 vs 15 byte-count padding front, the actual number you need to add to get an alignment of N is max(0,N-M) where M is the natural alignment of the memory allocator (and both are powers of 2).

Since the minimal memory alignment of any allocator is 1 byte, 15=max(0,16-1) is a conservative answer. However, if you know your memory allocator is going to give you 32-bit int aligned addresses (which is fairly common), you could have used 12 as a pad.

This isn't important for this example but it might be important on an embedded system with 12K of RAM where every single int saved counts.

The best way to implement it if you're actually going to try to save every byte possible is as a macro so you can feed it your native memory alignment. Again, this is probably only useful for embedded systems where you need to save every byte.

In the example below, on most systems, the value 1 is just fine for MEMORY_ALLOCATOR_NATIVE_ALIGNMENT, however for our theoretical embedded system with 32-bit aligned allocations, the following could save a tiny bit of precious memory:

#define MEMORY_ALLOCATOR_NATIVE_ALIGNMENT    4
#define ALIGN_PAD2(N,M) (((N)>(M)) ? ((N)-(M)) : 0)
#define ALIGN_PAD(N) ALIGN_PAD2((N), MEMORY_ALLOCATOR_NATIVE_ALIGNMENT)

long add;   
mem = (void*)malloc(1024 +15);
add = (long)mem;
add = add - (add % 16);//align to 16 byte boundary
ptr = (whatever*)(add);

For the solution i used a concept of padding which aligns the memory and do not waste the memory of a single byte .

If there are constraints that, you cannot waste a single byte. All pointers allocated with malloc are 16 bytes aligned.

C11 is supported, so you can just call aligned_alloc (16, size).

void *mem = malloc(1024+16);
void *ptr = ((char *)mem+16) & ~ 0x0F;
memset_16aligned(ptr, 0, 1024);
free(mem);

long add;   
mem = (void*)malloc(1024 +15);
add = (long)mem;
add = add - (add % 16);//align to 16 byte boundary
ptr = (whatever*)(add);

You can also add some 16 bytes and then push the original ptr to 16bit aligned by adding the (16-mod) as below the pointer :

main(){
void *mem1 = malloc(1024+16);
void *mem = ((char*)mem1)+1; // force misalign ( my computer always aligns)
printf ( " ptr = %p \n ", mem );
void *ptr = ((long)mem+16) & ~ 0x0F;
printf ( " aligned ptr = %p \n ", ptr );

printf (" ptr after adding diff mod %p (same as above ) ", (long)mem1 + (16 -((long)mem1%16)) );


free(mem1);
}

Perhaps they would have been satisfied with a knowledge of memalign? And as Jonathan Leffler points out, there are two newer preferable functions to know about.

Oops, florin beat me to it. However, if you read the man page I linked to, you'll most likely understand the example supplied by an earlier poster.


You could also try posix_memalign() (on POSIX platforms, of course).


size =1024;
alignment = 16;
aligned_size = size +(alignment -(size %  alignment));
mem = malloc(aligned_size);
memset_16aligned(mem, 0, 1024);
free(mem);

Hope this one is the simplest implementation, let me know your comments.


Here's an alternate approach to the 'round up' part. Not the most brilliantly coded solution but it gets the job done, and this type of syntax is a bit easier to remember (plus would work for alignment values that aren't a power of 2). The uintptr_t cast was necessary to appease the compiler; pointer arithmetic isn't very fond of division or multiplication.

void *mem = malloc(1024 + 15);
void *ptr = (void*) ((uintptr_t) mem + 15) / 16 * 16;
memset_16aligned(ptr, 0, 1024);
free(mem);

Here's an alternate approach to the 'round up' part. Not the most brilliantly coded solution but it gets the job done, and this type of syntax is a bit easier to remember (plus would work for alignment values that aren't a power of 2). The uintptr_t cast was necessary to appease the compiler; pointer arithmetic isn't very fond of division or multiplication.

void *mem = malloc(1024 + 15);
void *ptr = (void*) ((uintptr_t) mem + 15) / 16 * 16;
memset_16aligned(ptr, 0, 1024);
free(mem);

You could also try posix_memalign() (on POSIX platforms, of course).


We do this sort of thing all the time for Accelerate.framework, a heavily vectorized OS X / iOS library, where we have to pay attention to alignment all the time. There are quite a few options, one or two of which I didn't see mentioned above.

The fastest method for a small array like this is just stick it on the stack. With GCC / clang:

 void my_func( void )
 {
     uint8_t array[1024] __attribute__ ((aligned(16)));
     ...
 }

No free() required. This is typically two instructions: subtract 1024 from the stack pointer, then AND the stack pointer with -alignment. Presumably the requester needed the data on the heap because its lifespan of the array exceeded the stack or recursion is at work or stack space is at a serious premium.

On OS X / iOS all calls to malloc/calloc/etc. are always 16 byte aligned. If you needed 32 byte aligned for AVX, for example, then you can use posix_memalign:

void *buf = NULL;
int err = posix_memalign( &buf, 32 /*alignment*/, 1024 /*size*/);
if( err )
   RunInCirclesWaivingArmsWildly();
...
free(buf);

Some folks have mentioned the C++ interface that works similarly.

It should not be forgotten that pages are aligned to large powers of two, so page-aligned buffers are also 16 byte aligned. Thus, mmap() and valloc() and other similar interfaces are also options. mmap() has the advantage that the buffer can be allocated preinitialized with something non-zero in it, if you want. Since these have page aligned size, you will not get the minimum allocation from these, and it will likely be subject to a VM fault the first time you touch it.

Cheesy: Turn on guard malloc or similar. Buffers that are n*16 bytes in size such as this one will be n*16 bytes aligned, because VM is used to catch overruns and its boundaries are at page boundaries.

Some Accelerate.framework functions take in a user supplied temp buffer to use as scratch space. Here we have to assume that the buffer passed to us is wildly misaligned and the user is actively trying to make our life hard out of spite. (Our test cases stick a guard page right before and after the temp buffer to underline the spite.) Here, we return the minimum size we need to guarantee a 16-byte aligned segment somewhere in it, and then manually align the buffer afterward. This size is desired_size + alignment - 1. So, In this case that is 1024 + 16 - 1 = 1039 bytes. Then align as so:

#include <stdint.h>
void My_func( uint8_t *tempBuf, ... )
{
    uint8_t *alignedBuf = (uint8_t*) 
                          (((uintptr_t) tempBuf + ((uintptr_t)alignment-1)) 
                                        & -((uintptr_t) alignment));
    ...
}

Adding alignment-1 will move the pointer past the first aligned address and then ANDing with -alignment (e.g. 0xfff...ff0 for alignment=16) brings it back to the aligned address.

As described by other posts, on other operating systems without 16-byte alignment guarantees, you can call malloc with the larger size, set aside the pointer for free() later, then align as described immediately above and use the aligned pointer, much as described for our temp buffer case.

As for aligned_memset, this is rather silly. You only have to loop in up to 15 bytes to reach an aligned address, and then proceed with aligned stores after that with some possible cleanup code at the end. You can even do the cleanup bits in vector code, either as unaligned stores that overlap the aligned region (providing the length is at least the length of a vector) or using something like movmaskdqu. Someone is just being lazy. However, it is probably a reasonable interview question if the interviewer wants to know whether you are comfortable with stdint.h, bitwise operators and memory fundamentals, so the contrived example can be forgiven.


The first thing that popped into my head when reading this question was to define an aligned struct, instantiate it, and then point to it.

Is there a fundamental reason I'm missing since no one else suggested this?

As a sidenote, since I used an array of char (assuming the system's char is 8 bits (i.e. 1 byte)), I don't see the need for the __attribute__((packed)) necessarily (correct me if I'm wrong), but I put it in anyway.

This works on two systems I tried it on, but it's possible that there is a compiler optimization that I'm unaware of giving me false positives vis-a-vis the efficacy of the code. I used gcc 4.9.2 on OSX and gcc 5.2.1 on Ubuntu.

#include <stdio.h>
#include <stdlib.h>

int main ()
{

   void *mem;

   void *ptr;

   // answer a) here
   struct __attribute__((packed)) s_CozyMem {
       char acSpace[16];
   };

   mem = malloc(sizeof(struct s_CozyMem));
   ptr = mem;

   // memset_16aligned(ptr, 0, 1024);

   // Check if it's aligned
   if(((unsigned long)ptr & 15) == 0) printf("Aligned to 16 bytes.\n");
   else printf("Rubbish.\n");

   // answer b) here
   free(mem);

   return 1;
}

You can also add some 16 bytes and then push the original ptr to 16bit aligned by adding the (16-mod) as below the pointer :

main(){
void *mem1 = malloc(1024+16);
void *mem = ((char*)mem1)+1; // force misalign ( my computer always aligns)
printf ( " ptr = %p \n ", mem );
void *ptr = ((long)mem+16) & ~ 0x0F;
printf ( " aligned ptr = %p \n ", ptr );

printf (" ptr after adding diff mod %p (same as above ) ", (long)mem1 + (16 -((long)mem1%16)) );


free(mem1);
}

MacOS X specific:

  1. All pointers allocated with malloc are 16 bytes aligned.
  2. C11 is supported, so you can just call aligned_malloc (16, size).

  3. MacOS X picks code that is optimised for individual processors at boot time for memset, memcpy and memmove and that code uses tricks that you've never heard of to make it fast. 99% chance that memset runs faster than any hand-written memset16 which makes the whole question pointless.

If you want a 100% portable solution, before C11 there is none. Because there is no portable way to test alignment of a pointer. If it doesn't have to be 100% portable, you can use

char* p = malloc (size + 15);
p += (- (unsigned int) p) % 16;

This assumes that the alignment of a pointer is stored in the lowest bits when converting a pointer to unsigned int. Converting to unsigned int loses information and is implementation defined, but that doesn't matter because we don't convert the result back to a pointer.

The horrible part is of course that the original pointer must be saved somewhere to call free () with it. So all in all I would really doubt the wisdom of this design.


usage of memalign, Aligned-Memory-Blocks might be a good solution for the problem.


Three slightly different answers depending how you look at the question:

1) Good enough for the exact question asked is Jonathan Leffler's solution, except that to round up to 16-aligned, you only need 15 extra bytes, not 16.

A:

/* allocate a buffer with room to add 0-15 bytes to ensure 16-alignment */
void *mem = malloc(1024+15);
ASSERT(mem); // some kind of error-handling code
/* round up to multiple of 16: add 15 and then round down by masking */
void *ptr = ((char*)mem+15) & ~ (size_t)0x0F;

B:

free(mem);

2) For a more generic memory allocation function, the caller doesn't want to have to keep track of two pointers (one to use and one to free). So you store a pointer to the 'real' buffer below the aligned buffer.

A:

void *mem = malloc(1024+15+sizeof(void*));
if (!mem) return mem;
void *ptr = ((char*)mem+sizeof(void*)+15) & ~ (size_t)0x0F;
((void**)ptr)[-1] = mem;
return ptr;

B:

if (ptr) free(((void**)ptr)[-1]);

Note that unlike (1), where only 15 bytes were added to mem, this code could actually reduce the alignment if your implementation happens to guarantee 32-byte alignment from malloc (unlikely, but in theory a C implementation could have a 32-byte aligned type). That doesn't matter if all you do is call memset_16aligned, but if you use the memory for a struct then it could matter.

I'm not sure off-hand what a good fix is for this (other than to warn the user that the buffer returned is not necessarily suitable for arbitrary structs) since there's no way to determine programatically what the implementation-specific alignment guarantee is. I guess at startup you could allocate two or more 1-byte buffers, and assume that the worst alignment you see is the guaranteed alignment. If you're wrong, you waste memory. Anyone with a better idea, please say so...

[Added: The 'standard' trick is to create a union of 'likely to be maximally aligned types' to determine the requisite alignment. The maximally aligned types are likely to be (in C99) 'long long', 'long double', 'void *', or 'void (*)(void)'; if you include <stdint.h>, you could presumably use 'intmax_t' in place of long long (and, on Power 6 (AIX) machines, intmax_t would give you a 128-bit integer type). The alignment requirements for that union can be determined by embedding it into a struct with a single char followed by the union:

struct alignment
{
    char     c;
    union
    {
        intmax_t      imax;
        long double   ldbl;
        void         *vptr;
        void        (*fptr)(void);
    }        u;
} align_data;
size_t align = (char *)&align_data.u.imax - &align_data.c;

You would then use the larger of the requested alignment (in the example, 16) and the align value calculated above.

On (64-bit) Solaris 10, it appears that the basic alignment for the result from malloc() is a multiple of 32 bytes.
]

In practice, aligned allocators often take a parameter for the alignment rather than it being hardwired. So the user will pass in the size of the struct they care about (or the least power of 2 greater than or equal to that) and all will be well.

3) Use what your platform provides: posix_memalign for POSIX, _aligned_malloc on Windows.

4) If you use C11, then the cleanest - portable and concise - option is to use the standard library function aligned_alloc that was introduced in this version of the language specification.


Perhaps they would have been satisfied with a knowledge of memalign? And as Jonathan Leffler points out, there are two newer preferable functions to know about.

Oops, florin beat me to it. However, if you read the man page I linked to, you'll most likely understand the example supplied by an earlier poster.


Three slightly different answers depending how you look at the question:

1) Good enough for the exact question asked is Jonathan Leffler's solution, except that to round up to 16-aligned, you only need 15 extra bytes, not 16.

A:

/* allocate a buffer with room to add 0-15 bytes to ensure 16-alignment */
void *mem = malloc(1024+15);
ASSERT(mem); // some kind of error-handling code
/* round up to multiple of 16: add 15 and then round down by masking */
void *ptr = ((char*)mem+15) & ~ (size_t)0x0F;

B:

free(mem);

2) For a more generic memory allocation function, the caller doesn't want to have to keep track of two pointers (one to use and one to free). So you store a pointer to the 'real' buffer below the aligned buffer.

A:

void *mem = malloc(1024+15+sizeof(void*));
if (!mem) return mem;
void *ptr = ((char*)mem+sizeof(void*)+15) & ~ (size_t)0x0F;
((void**)ptr)[-1] = mem;
return ptr;

B:

if (ptr) free(((void**)ptr)[-1]);

Note that unlike (1), where only 15 bytes were added to mem, this code could actually reduce the alignment if your implementation happens to guarantee 32-byte alignment from malloc (unlikely, but in theory a C implementation could have a 32-byte aligned type). That doesn't matter if all you do is call memset_16aligned, but if you use the memory for a struct then it could matter.

I'm not sure off-hand what a good fix is for this (other than to warn the user that the buffer returned is not necessarily suitable for arbitrary structs) since there's no way to determine programatically what the implementation-specific alignment guarantee is. I guess at startup you could allocate two or more 1-byte buffers, and assume that the worst alignment you see is the guaranteed alignment. If you're wrong, you waste memory. Anyone with a better idea, please say so...

[Added: The 'standard' trick is to create a union of 'likely to be maximally aligned types' to determine the requisite alignment. The maximally aligned types are likely to be (in C99) 'long long', 'long double', 'void *', or 'void (*)(void)'; if you include <stdint.h>, you could presumably use 'intmax_t' in place of long long (and, on Power 6 (AIX) machines, intmax_t would give you a 128-bit integer type). The alignment requirements for that union can be determined by embedding it into a struct with a single char followed by the union:

struct alignment
{
    char     c;
    union
    {
        intmax_t      imax;
        long double   ldbl;
        void         *vptr;
        void        (*fptr)(void);
    }        u;
} align_data;
size_t align = (char *)&align_data.u.imax - &align_data.c;

You would then use the larger of the requested alignment (in the example, 16) and the align value calculated above.

On (64-bit) Solaris 10, it appears that the basic alignment for the result from malloc() is a multiple of 32 bytes.
]

In practice, aligned allocators often take a parameter for the alignment rather than it being hardwired. So the user will pass in the size of the struct they care about (or the least power of 2 greater than or equal to that) and all will be well.

3) Use what your platform provides: posix_memalign for POSIX, _aligned_malloc on Windows.

4) If you use C11, then the cleanest - portable and concise - option is to use the standard library function aligned_alloc that was introduced in this version of the language specification.


I'm surprised noone's voted up Shao's answer that, as I understand it, it is impossible to do what's asked in standard C99, since converting a pointer to an integral type formally is undefined behavior. (Apart from the standard allowing conversion of uintptr_t <-> void*, but the standard does not seem to allow doing any manipulations of the uintptr_t value and then converting it back.)


Perhaps they would have been satisfied with a knowledge of memalign? And as Jonathan Leffler points out, there are two newer preferable functions to know about.

Oops, florin beat me to it. However, if you read the man page I linked to, you'll most likely understand the example supplied by an earlier poster.


Here's an alternate approach to the 'round up' part. Not the most brilliantly coded solution but it gets the job done, and this type of syntax is a bit easier to remember (plus would work for alignment values that aren't a power of 2). The uintptr_t cast was necessary to appease the compiler; pointer arithmetic isn't very fond of division or multiplication.

void *mem = malloc(1024 + 15);
void *ptr = (void*) ((uintptr_t) mem + 15) / 16 * 16;
memset_16aligned(ptr, 0, 1024);
free(mem);

Unfortunately, in C99 it seems pretty tough to guarantee alignment of any sort in a way which would be portable across any C implementation conforming to C99. Why? Because a pointer is not guaranteed to be the "byte address" one might imagine with a flat memory model. Neither is the representation of uintptr_t so guaranteed, which itself is an optional type anyway.

We might know of some implementations which use a representation for void * (and by definition, also char *) which is a simple byte address, but by C99 it is opaque to us, the programmers. An implementation might represent a pointer by a set {segment, offset} where offset could have who-knows-what alignment "in reality." Why, a pointer could even be some form of hash table lookup value, or even a linked-list lookup value. It could encode bounds information.

In a recent C1X draft for a C Standard, we see the _Alignas keyword. That might help a bit.

The only guarantee C99 gives us is that the memory allocation functions will return a pointer suitable for assignment to a pointer pointing at any object type. Since we cannot specify the alignment of objects, we cannot implement our own allocation functions with responsibility for alignment in a well-defined, portable manner.

It would be good to be wrong about this claim.


On the 16 vs 15 byte-count padding front, the actual number you need to add to get an alignment of N is max(0,N-M) where M is the natural alignment of the memory allocator (and both are powers of 2).

Since the minimal memory alignment of any allocator is 1 byte, 15=max(0,16-1) is a conservative answer. However, if you know your memory allocator is going to give you 32-bit int aligned addresses (which is fairly common), you could have used 12 as a pad.

This isn't important for this example but it might be important on an embedded system with 12K of RAM where every single int saved counts.

The best way to implement it if you're actually going to try to save every byte possible is as a macro so you can feed it your native memory alignment. Again, this is probably only useful for embedded systems where you need to save every byte.

In the example below, on most systems, the value 1 is just fine for MEMORY_ALLOCATOR_NATIVE_ALIGNMENT, however for our theoretical embedded system with 32-bit aligned allocations, the following could save a tiny bit of precious memory:

#define MEMORY_ALLOCATOR_NATIVE_ALIGNMENT    4
#define ALIGN_PAD2(N,M) (((N)>(M)) ? ((N)-(M)) : 0)
#define ALIGN_PAD(N) ALIGN_PAD2((N), MEMORY_ALLOCATOR_NATIVE_ALIGNMENT)

We do this sort of thing all the time for Accelerate.framework, a heavily vectorized OS X / iOS library, where we have to pay attention to alignment all the time. There are quite a few options, one or two of which I didn't see mentioned above.

The fastest method for a small array like this is just stick it on the stack. With GCC / clang:

 void my_func( void )
 {
     uint8_t array[1024] __attribute__ ((aligned(16)));
     ...
 }

No free() required. This is typically two instructions: subtract 1024 from the stack pointer, then AND the stack pointer with -alignment. Presumably the requester needed the data on the heap because its lifespan of the array exceeded the stack or recursion is at work or stack space is at a serious premium.

On OS X / iOS all calls to malloc/calloc/etc. are always 16 byte aligned. If you needed 32 byte aligned for AVX, for example, then you can use posix_memalign:

void *buf = NULL;
int err = posix_memalign( &buf, 32 /*alignment*/, 1024 /*size*/);
if( err )
   RunInCirclesWaivingArmsWildly();
...
free(buf);

Some folks have mentioned the C++ interface that works similarly.

It should not be forgotten that pages are aligned to large powers of two, so page-aligned buffers are also 16 byte aligned. Thus, mmap() and valloc() and other similar interfaces are also options. mmap() has the advantage that the buffer can be allocated preinitialized with something non-zero in it, if you want. Since these have page aligned size, you will not get the minimum allocation from these, and it will likely be subject to a VM fault the first time you touch it.

Cheesy: Turn on guard malloc or similar. Buffers that are n*16 bytes in size such as this one will be n*16 bytes aligned, because VM is used to catch overruns and its boundaries are at page boundaries.

Some Accelerate.framework functions take in a user supplied temp buffer to use as scratch space. Here we have to assume that the buffer passed to us is wildly misaligned and the user is actively trying to make our life hard out of spite. (Our test cases stick a guard page right before and after the temp buffer to underline the spite.) Here, we return the minimum size we need to guarantee a 16-byte aligned segment somewhere in it, and then manually align the buffer afterward. This size is desired_size + alignment - 1. So, In this case that is 1024 + 16 - 1 = 1039 bytes. Then align as so:

#include <stdint.h>
void My_func( uint8_t *tempBuf, ... )
{
    uint8_t *alignedBuf = (uint8_t*) 
                          (((uintptr_t) tempBuf + ((uintptr_t)alignment-1)) 
                                        & -((uintptr_t) alignment));
    ...
}

Adding alignment-1 will move the pointer past the first aligned address and then ANDing with -alignment (e.g. 0xfff...ff0 for alignment=16) brings it back to the aligned address.

As described by other posts, on other operating systems without 16-byte alignment guarantees, you can call malloc with the larger size, set aside the pointer for free() later, then align as described immediately above and use the aligned pointer, much as described for our temp buffer case.

As for aligned_memset, this is rather silly. You only have to loop in up to 15 bytes to reach an aligned address, and then proceed with aligned stores after that with some possible cleanup code at the end. You can even do the cleanup bits in vector code, either as unaligned stores that overlap the aligned region (providing the length is at least the length of a vector) or using something like movmaskdqu. Someone is just being lazy. However, it is probably a reasonable interview question if the interviewer wants to know whether you are comfortable with stdint.h, bitwise operators and memory fundamentals, so the contrived example can be forgiven.


size =1024;
alignment = 16;
aligned_size = size +(alignment -(size %  alignment));
mem = malloc(aligned_size);
memset_16aligned(mem, 0, 1024);
free(mem);

Hope this one is the simplest implementation, let me know your comments.


Three slightly different answers depending how you look at the question:

1) Good enough for the exact question asked is Jonathan Leffler's solution, except that to round up to 16-aligned, you only need 15 extra bytes, not 16.

A:

/* allocate a buffer with room to add 0-15 bytes to ensure 16-alignment */
void *mem = malloc(1024+15);
ASSERT(mem); // some kind of error-handling code
/* round up to multiple of 16: add 15 and then round down by masking */
void *ptr = ((char*)mem+15) & ~ (size_t)0x0F;

B:

free(mem);

2) For a more generic memory allocation function, the caller doesn't want to have to keep track of two pointers (one to use and one to free). So you store a pointer to the 'real' buffer below the aligned buffer.

A:

void *mem = malloc(1024+15+sizeof(void*));
if (!mem) return mem;
void *ptr = ((char*)mem+sizeof(void*)+15) & ~ (size_t)0x0F;
((void**)ptr)[-1] = mem;
return ptr;

B:

if (ptr) free(((void**)ptr)[-1]);

Note that unlike (1), where only 15 bytes were added to mem, this code could actually reduce the alignment if your implementation happens to guarantee 32-byte alignment from malloc (unlikely, but in theory a C implementation could have a 32-byte aligned type). That doesn't matter if all you do is call memset_16aligned, but if you use the memory for a struct then it could matter.

I'm not sure off-hand what a good fix is for this (other than to warn the user that the buffer returned is not necessarily suitable for arbitrary structs) since there's no way to determine programatically what the implementation-specific alignment guarantee is. I guess at startup you could allocate two or more 1-byte buffers, and assume that the worst alignment you see is the guaranteed alignment. If you're wrong, you waste memory. Anyone with a better idea, please say so...

[Added: The 'standard' trick is to create a union of 'likely to be maximally aligned types' to determine the requisite alignment. The maximally aligned types are likely to be (in C99) 'long long', 'long double', 'void *', or 'void (*)(void)'; if you include <stdint.h>, you could presumably use 'intmax_t' in place of long long (and, on Power 6 (AIX) machines, intmax_t would give you a 128-bit integer type). The alignment requirements for that union can be determined by embedding it into a struct with a single char followed by the union:

struct alignment
{
    char     c;
    union
    {
        intmax_t      imax;
        long double   ldbl;
        void         *vptr;
        void        (*fptr)(void);
    }        u;
} align_data;
size_t align = (char *)&align_data.u.imax - &align_data.c;

You would then use the larger of the requested alignment (in the example, 16) and the align value calculated above.

On (64-bit) Solaris 10, it appears that the basic alignment for the result from malloc() is a multiple of 32 bytes.
]

In practice, aligned allocators often take a parameter for the alignment rather than it being hardwired. So the user will pass in the size of the struct they care about (or the least power of 2 greater than or equal to that) and all will be well.

3) Use what your platform provides: posix_memalign for POSIX, _aligned_malloc on Windows.

4) If you use C11, then the cleanest - portable and concise - option is to use the standard library function aligned_alloc that was introduced in this version of the language specification.


Unfortunately, in C99 it seems pretty tough to guarantee alignment of any sort in a way which would be portable across any C implementation conforming to C99. Why? Because a pointer is not guaranteed to be the "byte address" one might imagine with a flat memory model. Neither is the representation of uintptr_t so guaranteed, which itself is an optional type anyway.

We might know of some implementations which use a representation for void * (and by definition, also char *) which is a simple byte address, but by C99 it is opaque to us, the programmers. An implementation might represent a pointer by a set {segment, offset} where offset could have who-knows-what alignment "in reality." Why, a pointer could even be some form of hash table lookup value, or even a linked-list lookup value. It could encode bounds information.

In a recent C1X draft for a C Standard, we see the _Alignas keyword. That might help a bit.

The only guarantee C99 gives us is that the memory allocation functions will return a pointer suitable for assignment to a pointer pointing at any object type. Since we cannot specify the alignment of objects, we cannot implement our own allocation functions with responsibility for alignment in a well-defined, portable manner.

It would be good to be wrong about this claim.


If there are constraints that, you cannot waste a single byte, then this solution works: Note: There is a case where this may be executed infinitely :D

   void *mem;  
   void *ptr;
try:
   mem =  malloc(1024);  
   if (mem % 16 != 0) {  
       free(mem);  
       goto try;
   }  
   ptr = mem;  
   memset_16aligned(ptr, 0, 1024);

Three slightly different answers depending how you look at the question:

1) Good enough for the exact question asked is Jonathan Leffler's solution, except that to round up to 16-aligned, you only need 15 extra bytes, not 16.

A:

/* allocate a buffer with room to add 0-15 bytes to ensure 16-alignment */
void *mem = malloc(1024+15);
ASSERT(mem); // some kind of error-handling code
/* round up to multiple of 16: add 15 and then round down by masking */
void *ptr = ((char*)mem+15) & ~ (size_t)0x0F;

B:

free(mem);

2) For a more generic memory allocation function, the caller doesn't want to have to keep track of two pointers (one to use and one to free). So you store a pointer to the 'real' buffer below the aligned buffer.

A:

void *mem = malloc(1024+15+sizeof(void*));
if (!mem) return mem;
void *ptr = ((char*)mem+sizeof(void*)+15) & ~ (size_t)0x0F;
((void**)ptr)[-1] = mem;
return ptr;

B:

if (ptr) free(((void**)ptr)[-1]);

Note that unlike (1), where only 15 bytes were added to mem, this code could actually reduce the alignment if your implementation happens to guarantee 32-byte alignment from malloc (unlikely, but in theory a C implementation could have a 32-byte aligned type). That doesn't matter if all you do is call memset_16aligned, but if you use the memory for a struct then it could matter.

I'm not sure off-hand what a good fix is for this (other than to warn the user that the buffer returned is not necessarily suitable for arbitrary structs) since there's no way to determine programatically what the implementation-specific alignment guarantee is. I guess at startup you could allocate two or more 1-byte buffers, and assume that the worst alignment you see is the guaranteed alignment. If you're wrong, you waste memory. Anyone with a better idea, please say so...

[Added: The 'standard' trick is to create a union of 'likely to be maximally aligned types' to determine the requisite alignment. The maximally aligned types are likely to be (in C99) 'long long', 'long double', 'void *', or 'void (*)(void)'; if you include <stdint.h>, you could presumably use 'intmax_t' in place of long long (and, on Power 6 (AIX) machines, intmax_t would give you a 128-bit integer type). The alignment requirements for that union can be determined by embedding it into a struct with a single char followed by the union:

struct alignment
{
    char     c;
    union
    {
        intmax_t      imax;
        long double   ldbl;
        void         *vptr;
        void        (*fptr)(void);
    }        u;
} align_data;
size_t align = (char *)&align_data.u.imax - &align_data.c;

You would then use the larger of the requested alignment (in the example, 16) and the align value calculated above.

On (64-bit) Solaris 10, it appears that the basic alignment for the result from malloc() is a multiple of 32 bytes.
]

In practice, aligned allocators often take a parameter for the alignment rather than it being hardwired. So the user will pass in the size of the struct they care about (or the least power of 2 greater than or equal to that) and all will be well.

3) Use what your platform provides: posix_memalign for POSIX, _aligned_malloc on Windows.

4) If you use C11, then the cleanest - portable and concise - option is to use the standard library function aligned_alloc that was introduced in this version of the language specification.


Perhaps they would have been satisfied with a knowledge of memalign? And as Jonathan Leffler points out, there are two newer preferable functions to know about.

Oops, florin beat me to it. However, if you read the man page I linked to, you'll most likely understand the example supplied by an earlier poster.


The first thing that popped into my head when reading this question was to define an aligned struct, instantiate it, and then point to it.

Is there a fundamental reason I'm missing since no one else suggested this?

As a sidenote, since I used an array of char (assuming the system's char is 8 bits (i.e. 1 byte)), I don't see the need for the __attribute__((packed)) necessarily (correct me if I'm wrong), but I put it in anyway.

This works on two systems I tried it on, but it's possible that there is a compiler optimization that I'm unaware of giving me false positives vis-a-vis the efficacy of the code. I used gcc 4.9.2 on OSX and gcc 5.2.1 on Ubuntu.

#include <stdio.h>
#include <stdlib.h>

int main ()
{

   void *mem;

   void *ptr;

   // answer a) here
   struct __attribute__((packed)) s_CozyMem {
       char acSpace[16];
   };

   mem = malloc(sizeof(struct s_CozyMem));
   ptr = mem;

   // memset_16aligned(ptr, 0, 1024);

   // Check if it's aligned
   if(((unsigned long)ptr & 15) == 0) printf("Aligned to 16 bytes.\n");
   else printf("Rubbish.\n");

   // answer b) here
   free(mem);

   return 1;
}

If there are constraints that, you cannot waste a single byte, then this solution works: Note: There is a case where this may be executed infinitely :D

   void *mem;  
   void *ptr;
try:
   mem =  malloc(1024);  
   if (mem % 16 != 0) {  
       free(mem);  
       goto try;
   }  
   ptr = mem;  
   memset_16aligned(ptr, 0, 1024);

usage of memalign, Aligned-Memory-Blocks might be a good solution for the problem.


Here's an alternate approach to the 'round up' part. Not the most brilliantly coded solution but it gets the job done, and this type of syntax is a bit easier to remember (plus would work for alignment values that aren't a power of 2). The uintptr_t cast was necessary to appease the compiler; pointer arithmetic isn't very fond of division or multiplication.

void *mem = malloc(1024 + 15);
void *ptr = (void*) ((uintptr_t) mem + 15) / 16 * 16;
memset_16aligned(ptr, 0, 1024);
free(mem);

For the solution i used a concept of padding which aligns the memory and do not waste the memory of a single byte .

If there are constraints that, you cannot waste a single byte. All pointers allocated with malloc are 16 bytes aligned.

C11 is supported, so you can just call aligned_alloc (16, size).

void *mem = malloc(1024+16);
void *ptr = ((char *)mem+16) & ~ 0x0F;
memset_16aligned(ptr, 0, 1024);
free(mem);

MacOS X specific:

  1. All pointers allocated with malloc are 16 bytes aligned.
  2. C11 is supported, so you can just call aligned_malloc (16, size).

  3. MacOS X picks code that is optimised for individual processors at boot time for memset, memcpy and memmove and that code uses tricks that you've never heard of to make it fast. 99% chance that memset runs faster than any hand-written memset16 which makes the whole question pointless.

If you want a 100% portable solution, before C11 there is none. Because there is no portable way to test alignment of a pointer. If it doesn't have to be 100% portable, you can use

char* p = malloc (size + 15);
p += (- (unsigned int) p) % 16;

This assumes that the alignment of a pointer is stored in the lowest bits when converting a pointer to unsigned int. Converting to unsigned int loses information and is implementation defined, but that doesn't matter because we don't convert the result back to a pointer.

The horrible part is of course that the original pointer must be saved somewhere to call free () with it. So all in all I would really doubt the wisdom of this design.