[algorithm] How can building a heap be O(n) time complexity?

Can someone help explain how can building a heap be O(n) complexity?

Inserting an item into a heap is O(log n), and the insert is repeated n/2 times (the remainder are leaves, and can't violate the heap property). So, this means the complexity should be O(n log n), I would think.

In other words, for each item we "heapify", it has the potential to have to filter down once for each level for the heap so far (which is log n levels).

What am I missing?

This question is related to algorithm heap complexity-theory construction

The answer is


It would be O(n log n) if you built the heap by repeatedly inserting elements. However, you can create a new heap more efficiently by inserting the elements in arbitrary order and then applying an algorithm to "heapify" them into the proper order (depending on the type of heap of course).

See http://en.wikipedia.org/wiki/Binary_heap, "Building a heap" for an example. In this case you essentially work up from the bottom level of the tree, swapping parent and child nodes until the heap conditions are satisfied.


Lets suppose you have N elements in a heap. Then its height would be Log(N)

Now you want to insert another element, then the complexity would be : Log(N), we have to compare all the way UP to the root.

Now you are having N+1 elements & height = Log(N+1)

Using induction technique it can be proved that the complexity of insertion would be ?logi.

Now using

log a + log b = log ab

This simplifies to : ?logi=log(n!)

which is actually O(NlogN)

But

we are doing something wrong here, as in all the case we do not reach at the top. Hence while executing most of the times we may find that, we are not going even half way up the tree. Whence, this bound can be optimized to have another tighter bound by using mathematics given in answers above.

This realization came to me after a detail though & experimentation on Heaps.


Basically, work is done only on non-leaf nodes while building a heap...and the work done is the amount of swapping down to satisfy heap condition...in other words (in worst case) the amount is proportional to the height of the node...all in all the complexity of the problem is proportional to the sum of heights of all the non-leaf nodes..which is (2^h+1 - 1)-h-1=n-h-1= O(n)


As we know the height of a heap is log(n), where n is the total number of elements.Lets represent it as h
   When we perform heapify operation, then the elements at last level(h) won't move even a single step.
   The number of elements at second last level(h-1) is 2h-1 and they can move at max 1 level(during heapify).
   Similarly, for the ith, level we have 2i elements which can move h-i positions.

Therefore total number of moves=S= 2h*0+2h-1*1+2h-2*2+...20*h

                                               S=2h {1/2 + 2/22 + 3/23+ ... h/2h} -------------------------------------------------1
this is AGP series, to solve this divide both sides by 2
                                               S/2=2h {1/22 + 2/23+ ... h/2h+1} -------------------------------------------------2
subtracting equation 2 from 1 gives
                                               S/2=2h {1/2+1/22 + 1/23+ ...+1/2h+ h/2h+1}
                                               S=2h+1 {1/2+1/22 + 1/23+ ...+1/2h+ h/2h+1}
now 1/2+1/22 + 1/23+ ...+1/2h is decreasing GP whose sum is less than 1 (when h tends to infinity, the sum tends to 1). In further analysis, let's take an upper bound on the sum which is 1.
This gives S=2h+1{1+h/2h+1}
                    =2h+1+h
                    ~2h+h
as h=log(n), 2h=n

Therefore S=n+log(n)
T(C)=O(n)


In case of building the heap, we start from height, logn -1 (where logn is the height of tree of n elements). For each element present at height 'h', we go at max upto (logn -h) height down.

    So total number of traversal would be:-
    T(n) = sigma((2^(logn-h))*h) where h varies from 1 to logn
    T(n) = n((1/2)+(2/4)+(3/8)+.....+(logn/(2^logn)))
    T(n) = n*(sigma(x/(2^x))) where x varies from 1 to logn
     and according to the [sources][1]
    function in the bracket approaches to 2 at infinity.
    Hence T(n) ~ O(n)

Proof of O(n)

The proof isn't fancy, and quite straightforward, I only proved the case for a full binary tree, the result can be generalized for a complete binary tree.


Intuitively:

"The complexity should be O(nLog n)... for each item we "heapify", it has the potential to have to filter down once for each level for the heap so far (which is log n levels)."

Not quite. Your logic does not produce a tight bound -- it over estimates the complexity of each heapify. If built from the bottom up, insertion (heapify) can be much less than O(log(n)). The process is as follows:

( Step 1 ) The first n/2 elements go on the bottom row of the heap. h=0, so heapify is not needed.

( Step 2 ) The next n/22 elements go on the row 1 up from the bottom. h=1, heapify filters 1 level down.

( Step i ) The next n/2i elements go in row i up from the bottom. h=i, heapify filters i levels down.

( Step log(n) ) The last n/2log2(n) = 1 element goes in row log(n) up from the bottom. h=log(n), heapify filters log(n) levels down.

NOTICE: that after step one, 1/2 of the elements (n/2) are already in the heap, and we didn't even need to call heapify once. Also, notice that only a single element, the root, actually incurs the full log(n) complexity.


Theoretically:

The Total steps N to build a heap of size n, can be written out mathematically.

At height i, we've shown (above) that there will be n/2i+1 elements that need to call heapify, and we know heapify at height i is O(i). This gives:

enter image description here

The solution to the last summation can be found by taking the derivative of both sides of the well known geometric series equation:

enter image description here

Finally, plugging in x = 1/2 into the above equation yields 2. Plugging this into the first equation gives:

enter image description here

Thus, the total number of steps is of size O(n)


Successive insertions can be described by:

T = O(log(1) + log(2) + .. + log(n)) = O(log(n!))

By starling approximation, n! =~ O(n^(n + O(1))), therefore T =~ O(nlog(n))

Hope this helps, the optimal way O(n) is using the build heap algorithm for a given set (ordering doesn't matter).


While building a heap, lets say you're taking a bottom up approach.

  1. You take each element and compare it with its children to check if the pair conforms to the heap rules. So, therefore, the leaves get included in the heap for free. That is because they have no children.
  2. Moving upwards, the worst case scenario for the node right above the leaves would be 1 comparison (At max they would be compared with just one generation of children)
  3. Moving further up, their immediate parents can at max be compared with two generations of children.
  4. Continuing in the same direction, you'll have log(n) comparisons for the root in the worst case scenario. and log(n)-1 for its immediate children, log(n)-2 for their immediate children and so on.
  5. So summing it all up, you arrive on something like log(n) + {log(n)-1}*2 + {log(n)-2}*4 + ..... + 1*2^{(logn)-1} which is nothing but O(n).

"The linear time bound of build Heap, can be shown by computing the sum of the heights of all the nodes in the heap, which is the maximum number of dashed lines. For the perfect binary tree of height h containing N = 2^(h+1) – 1 nodes, the sum of the heights of the nodes is N – H – 1. Thus it is O(N)."


We get the runtime for the heap build by figuring out the maximum move each node can take. So we need to know how many nodes are in each row and how far from their can each node go.

Starting from the root node each next row has double the nodes than the previous row has, so by answering how often can we double the number of nodes until we don't have any nodes left we get the height of the tree. Or in mathematical terms the height of the tree is log2(n), n being the length of the array.

To calculate the nodes in one row we start from the back, we know n/2 nodes are at the bottom, so by dividing by 2 we get the previous row and so on.

Based on this we get this formula for the Siftdown approach: (0 * n/2) + (1 * n/4) + (2 * n/8) + ... + (log2(n) * 1)

The term in the last paranthesis is the height of the tree multiplied by the one node that is at the root, the term in the first paranthesis are all the nodes in the bottom row multiplied by the length they can travel,0. Same formula in smart: enter image description here

Math

Bringing the n back in we have 2 * n, 2 can be discarded because its a constant and tada we have the worst case runtime of the Siftdown approach: n.


I really like explanation by Jeremy west.... another approach which is really easy for understanding is given here http://courses.washington.edu/css343/zander/NotesProbs/heapcomplexity

since, buildheap depends using depends on heapify and shiftdown approach is used which depends upon sum of the heights of all nodes. So, to find the sum of height of nodes which is given by S = summation from i = 0 to i = h of (2^i*(h-i)), where h = logn is height of the tree solving s, we get s = 2^(h+1) - 1 - (h+1) since, n = 2^(h+1) - 1 s = n - h - 1 = n- logn - 1 s = O(n), and so complexity of buildheap is O(n).


@bcorso has already demonstrated the proof of the complexity analysis. But for the sake of those still learning complexity analysis, I have this to add:

The basis of your original mistake is due to a misinterpretation of the meaning of the statement, "insertion into a heap takes O(log n) time". Insertion into a heap is indeed O(log n), but you have to recognise that n is the size of the heap during the insertion.

In the context of inserting n objects into a heap, the complexity of the ith insertion is O(log n_i) where n_i is the size of the heap as at insertion i. Only the last insertion has a complexity of O (log n).


Your analysis is correct. However, it is not tight.

It is not really easy to explain why building a heap is a linear operation, you should better read it.

A great analysis of the algorithm can be seen here.


The main idea is that in the build_heap algorithm the actual heapify cost is not O(log n)for all elements.

When heapify is called, the running time depends on how far an element might move down in tree before the process terminates. In other words, it depends on the height of the element in the heap. In the worst case, the element might go down all the way to the leaf level.

Let us count the work done level by level.

At the bottommost level, there are 2^(h)nodes, but we do not call heapify on any of these, so the work is 0. At the next to level there are 2^(h - 1) nodes, and each might move down by 1 level. At the 3rd level from the bottom, there are 2^(h - 2) nodes, and each might move down by 2 levels.

As you can see not all heapify operations are O(log n), this is why you are getting O(n).


There are already some great answers but I would like to add a little visual explanation

enter image description here

Now, take a look at the image, there are
n/2^1 green nodes with height 0 (here 23/2 = 12)
n/2^2 red nodes with height 1 (here 23/4 = 6)
n/2^3 blue node with height 2 (here 23/8 = 3)
n/2^4 purple nodes with height 3 (here 23/16 = 2)
so there are n/2^(h+1) nodes for height h
To find the time complexity lets count the amount of work done or max no of iterations performed by each node
now it can be noticed that each node can perform(atmost) iterations == height of the node

Green  = n/2^1 * 0 (no iterations since no children)  
red    = n/2^2 * 1 (heapify will perform atmost one swap for each red node)  
blue   = n/2^3 * 2 (heapify will perform atmost two swaps for each blue node)  
purple = n/2^4 * 3 (heapify will perform atmost three swaps for each purple node)   

so for any nodes with height h maximum work done is n/2^(h+1) * h

Now total work done is

->(n/2^1 * 0) + (n/2^2 * 1)+ (n/2^3 * 2) + (n/2^4 * 3) +...+ (n/2^(h+1) * h)  
-> n * ( 0 + 1/4 + 2/8 + 3/16 +...+ h/2^(h+1) ) 

now for any value of h, the sequence

-> ( 0 + 1/4 + 2/8 + 3/16 +...+ h/2^(h+1) ) 

will never exceed 1
Thus the time complexity will never exceed O(n) for building heap


think you're making a mistake. Take a look at this: http://golang.org/pkg/container/heap/ Building a heap isn'y O(n). However, inserting is O(lg(n). I'm assuming initialization is O(n) if you set a heap size b/c the heap needs to allocate space and set up the data structure. If you have n items to put into the heap then yes, each insert is lg(n) and there are n items, so you get n*lg(n) as u stated


Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to heap

Android Gradle Could not reserve enough space for object heap How to increase application heap size in Eclipse? Increase JVM max heap size for Eclipse Command-line Tool to find Java Heap Size and Memory Used (Linux)? how to increase java heap memory permanently? Finding the median of an unsorted array Find running median from a stream of integers Object creation on the stack/heap? How can building a heap be O(n) time complexity? Why should C++ programmers minimize use of 'new'?

Examples related to complexity-theory

Differences between time complexity and space complexity? Determining complexity for recursive functions (Big O notation) How to find time complexity of an algorithm How can building a heap be O(n) time complexity? HashMap get/put complexity Big-oh vs big-theta Is log(n!) = T(n·log(n))? Time complexity of accessing a Python dict What are the differences between NP, NP-Complete and NP-Hard? What's the fastest algorithm for sorting a linked list?

Examples related to construction

How can building a heap be O(n) time complexity?