Here's another version for us Framework users abandoned by Microsoft. It is 4 times as fast as Array.Clear
and faster than Panos Theof's solution and Eric J's and Petar Petrov's parallel one - up to two times as fast for large arrays.
First I want to present you the function's ancestor, because that makes it easier to understand the code. Performance-wise this is pretty much on par with Panos Theof's code, and for some things that may already be enough:
public static void Fill<T> (T[] array, int count, T value, int threshold = 32)
{
if (threshold <= 0)
throw new ArgumentException("threshold");
int current_size = 0, keep_looping_up_to = Math.Min(count, threshold);
while (current_size < keep_looping_up_to)
array[current_size++] = value;
for (int at_least_half = (count + 1) >> 1; current_size < at_least_half; current_size <<= 1)
Array.Copy(array, 0, array, current_size, current_size);
Array.Copy(array, 0, array, current_size, count - current_size);
}
As you can see, this is based on repeated doubling of the already-initialised part. This is simple and efficient, but it runs afoul of modern memory architectures. Hence was born a version that uses doubling only to create a cache-friendly seed block, which is then blasted iteratively over the target area:
const int ARRAY_COPY_THRESHOLD = 32; // 16 ... 64 work equally well for all tested constellations
const int L1_CACHE_SIZE = 1 << 15;
public static void Fill<T> (T[] array, int count, T value, int element_size)
{
int current_size = 0, keep_looping_up_to = Math.Min(count, ARRAY_COPY_THRESHOLD);
while (current_size < keep_looping_up_to)
array[current_size++] = value;
int block_size = L1_CACHE_SIZE / element_size / 2;
int keep_doubling_up_to = Math.Min(block_size, count >> 1);
for ( ; current_size < keep_doubling_up_to; current_size <<= 1)
Array.Copy(array, 0, array, current_size, current_size);
for (int enough = count - block_size; current_size < enough; current_size += block_size)
Array.Copy(array, 0, array, current_size, block_size);
Array.Copy(array, 0, array, current_size, count - current_size);
}
Note: the earlier code needed (count + 1) >> 1
as limit for the doubling loop to ensure that the final copy operation has enough fodder to cover all that is left. This would not be the case for odd counts if count >> 1
were to be used instead. For the current version this is of no significance since the linear copy loop will pick up any slack.
The size of an array cell must be passed as a parameter because - the mind boggles - generics aren't allowed to use sizeof
unless they use a constraint (unmanaged
) that may or may not become available in the future. Wrong estimates are not a big deal but performance is best if the value is accurate, for the following reasons:
Underestimating the element size can lead to block sizes greater than half the L1 cache, hence increasing the likelihood of copy source data getting evicted from L1 and having to be re-fetched from slower cache levels.
Overestimating the element size results in under-utilisation of the CPU's L1 cache, meaning the linear block copy loop gets executed more often than it would be with optimal utilisation. Thus, more of the fixed loop/call overhead is incurred than strictly necessary.
Here's a benchmark pitting my code against Array.Clear
and the other three solutions mentioned previously. The timings are for filling integer arrays (Int32[]
) of the given sizes. In order to reduce variation caused by cache vagaries etc. each test was executed twice, back to back, and the timings were taken for the second execution.
array size Array.Clear Eric J. Panos Theof Petar Petrov Darth Gizka
-------------------------------------------------------------------------------
1000: 0,7 µs 0,2 µs 0,2 µs 6,8 µs 0,2 µs
10000: 8,0 µs 1,4 µs 1,2 µs 7,8 µs 0,9 µs
100000: 72,4 µs 12,4 µs 8,2 µs 33,6 µs 7,5 µs
1000000: 652,9 µs 135,8 µs 101,6 µs 197,7 µs 71,6 µs
10000000: 7182,6 µs 4174,9 µs 5193,3 µs 3691,5 µs 1658,1 µs
100000000: 67142,3 µs 44853,3 µs 51372,5 µs 35195,5 µs 16585,1 µs
Should the performance of this code not be sufficient then a promising avenue would be parallelising the linear copy loop (with all threads using the same source block), or our good old friend P/Invoke.
Note: clearing and filling of blocks is normally done by runtime routines that branch to highly specialised code using MMX/SSE instructions and whatnot, so in any decent environment one would simply call the respective moral equivalent of std::memset
and be assured of professional performance levels. IOW, by rights the library function Array.Clear
should leave all our hand-rolled versions in the dust. The fact that it is the other way around shows how far out of whack things really are. Same goes for having to roll one's own Fill<>
in the first place, because it is still only in Core and Standard but not in the Framework. .NET has been around for almost twenty years now and we still have to P/Invoke left and right for the most basic stuff or roll our own...