What is the fastest way to transpose a matrix in C

Question

I have a matrix  relatively big  that I need to transpose  For example assume that my matrix is  a b c d e f g h i j k l m n o p q r    I want the result be as follows   a g m b h n c I o d j p e k q f l r   What is the fastest way to do this

User · Answer

I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)

User · Answer

template  lt class T gt  void transpose  const std  vector lt  std  vector lt T gt   gt   amp  a  std  vector lt  std  vector lt T gt   gt   amp  b  int width  int height        for  int i   0  i  lt  width  i                  for  int j   0  j  lt  height  j                          b j  i    a i  j

User · Answer

If the size of the arrays are known prior then we could use the union to our help  Like this-   include  lt bits stdc   h gt  using namespace std   union ua      int arr 2  3       int brr 3  2       int main         union ua uav      int karr 2  3      1 2 3   4 5 6        memcpy uav arr karr sizeof karr        for  int i 0 i lt 3 i                  for  int j 0 j lt 2 j                cout lt  lt uav brr i  j  lt  lt              cout lt  lt   n              return 0

User · Answer

This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up  then you do not have to actually move any data

User · Answer

my answer is transposed of 3x3 matrix    include lt iostream h gt    include lt math h gt    main     int a 3  3   int b 3   cout lt  lt  You must give us an array 3x3 and then we will give you Transposed it   lt  lt endl  for int i 0 i lt 3 i          for int j 0 j lt 3 j      cout lt  lt  Enter a   lt  lt i lt  lt      lt  lt j lt  lt         cin gt  gt a i  j         cout lt  lt  Matrix you entered is    lt  lt endl    for  int e   0   e  lt  3   e            for   int f   0   f  lt  3   f              cout  lt  lt  a e  f   lt  lt    t         cout  lt  lt  endl           cout lt  lt   nTransposed of matrix you entered is    lt  lt endl   for  int c   0   c  lt  3   c           for   int d   0   d  lt  3   d             cout  lt  lt  a d  c   lt  lt    t        cout  lt  lt  endl         return 0

User · Answer

transposing without any overhead  class not complete    class Matrix     double  data    suppose this will point to data    double  get1 int i  int j  return data i M j      used to access normally    double  get2 int i  int j  return data j N i      used when transposed     public     int M  N    dimensions    double   get p  int  int     functor to access elements      Matrix int  M int  N  M  M   N  N          allocate data      get p  amp Matrix   get1     initialised with normal access             double get int i  int j          there should be a way to directly use get p to call  but i think even this        doesnt incur overhead because it is inline and the compiler should be intelligent        enough to remove the extra call      return  this- gt  get p  i j            void transpose      twice transpose gives the original      if get p   amp Matrix  get1  get p  amp Matrix   get2       else get p   amp Matrix   get1        swap M N              can be used like this   Matrix M 100 200   double x M get 17 45   M transpose    x M get 17 45        original M 45 17    of course I didn t bother with the memory management here  which is crucial but different topic

User · Answer

Modern linear algebra libraries include optimized versions of the most common operations  Many of them include dynamic CPU dispatch  which chooses the best implementation for the hardware at program execution time  without compromising on portability     This is commonly a better alternative to performing manual optimization of your functinos via vector extensions intrinsic functions  The latter will tie your implementation to a particular hardware vendor and model  if you decide to swap to a different vendor  e g  Power  ARM  or to a newer vector extensions  e g  AVX512   you will need to re-implement it again to get the most of them   MKL transposition  for example  includes the BLAS extensions function imatcopy  You can find it in other implementations such as OpenBLAS as well    include  lt mkl h gt   void transpose  float  a  int n  int m         const char row major    R       const char transpose    T       const float alpha   1 0f      mkl simatcopy  row major  transpose  n  m  alpha  a  n  n       For a C   project  you can make use of the Armadillo C      include  lt armadillo gt   void transpose  arma  mat  amp matrix         arma  inplace trans matrix

User · Answer

Some details about transposing 4x4 square float  I will discuss 32-bit integer later  matrices with x86 hardware  It s helpful to start here in order to transpose larger square matrices such as 8x8 or 16x16     MM TRANSPOSE4 PS r0  r1  r2  r3  is implemented differently by different compilers   GCC and ICC  I have not checked Clang  use unpcklps  unpckhps  unpcklpd  unpckhpd whereas MSVC uses only shufps  We can actually combine these two approaches together like this   t0    mm unpacklo ps r0  r1   t1    mm unpackhi ps r0  r1   t2    mm unpacklo ps r2  r3   t3    mm unpackhi ps r2  r3    r0    mm shuffle ps t0 t2  0x44   r1    mm shuffle ps t0 t2  0xEE   r2    mm shuffle ps t1 t3  0x44   r3    mm shuffle ps t1 t3  0xEE     One interesting observation is that two shuffles can be converted to one shuffle and two blends  SSE4 1  like this   t0    mm unpacklo ps r0  r1   t1    mm unpackhi ps r0  r1   t2    mm unpacklo ps r2  r3   t3    mm unpackhi ps r2  r3    v     mm shuffle ps t0 t2  0x4E   r0    mm blend ps t0 v  0xC   r1    mm blend ps t2 v  0x3   v     mm shuffle ps t1 t3  0x4E   r2    mm blend ps t1 v  0xC   r3    mm blend ps t3 v  0x3     This effectively converted 4 shuffles into 2 shuffles and 4 blends  This uses 2 more instructions than the implementation of GCC  ICC  and MSVC  The advantage is that it reduces port pressure which may have a benefit in some circumstances  Currently all the shuffles and unpacks can go only to one particular port whereas the blends can go to either of two different ports   I tried using 8 shuffles like MSVC and converting that into 4 shuffles   8 blends but it did not work  I still had to use 4 unpacks   I used this same technique for a 8x8 float transpose  see towards the end of that answer   https   stackoverflow com a 25627536 2542702  In that answer I still had to use 8 unpacks but I manged to convert the 8 shuffles into 4 shuffles and 8 blends   For 32-bit integers there is nothing like shufps  except for 128-bit shuffles with AVX512  so it can only be implemented with unpacks which I don t think can be convert to blends  efficiently    With AVX512 vshufi32x4 acts effectively like shufps except for 128-bit lanes of 4 integers instead of 32-bit floats so this same technique might be possibly with vshufi32x4 in some cases  With Knights Landing shuffles are four times slower  throughput  than blends

User · Answer

This is a good question   There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates  e g   in matrix multiplication and Gaussian smearing   First let me list one of the functions I use for the transpose  EDIT  please see the end of my answer where I found a much faster solution   void transpose float  src  float  dst  const int N  const int M         pragma omp parallel for     for int n   0  n lt N M  n              int i   n N          int j   n N          dst n    src M j   i             Now let s see why the transpose is useful   Consider matrix multiplication C   A B   We could do it this way   for int i 0  i lt N  i          for int j 0  j lt K  j              float tmp   0          for int l 0  l lt M  l                  tmp    A M i l  B K l j                     C K i   j    tmp            That way  however  is going to have a lot of cache misses   A much faster solution is to take the transpose of B first  transpose B   for int i 0  i lt N  i          for int j 0  j lt K  j              float tmp   0          for int l 0  l lt M  l                  tmp    A M i l  B K j l                     C K i   j    tmp          transpose B     Matrix multiplication is O n 3  and the transpose is O n 2   so taking the transpose should have a negligible effect on the computation time  for large n    In matrix multiplication loop tiling is even more effective than taking the transpose but that s much more complicated     I wish I knew a faster way to do the transpose  Edit  I found a faster solution  see the end of my answer    When Haswell AVX2 comes out in a few weeks it will have a gather function   I don t know if that will be helpful in this case but I could image gathering a column and writing out a row   Maybe it will make the transpose unnecessary   For Gaussian smearing what you do is smear horizontally and then smear vertically   But smearing vertically has the cache problem so what you do is   Smear image horizontally transpose output  Smear output horizontally transpose output   Here is a paper by Intel explaining that http   software intel com en-us articles iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions  Lastly  what I actually do in matrix multiplication  and in Gaussian smearing  is not take exactly the transpose but take the transpose in widths of a certain vector size  e g  4 or 8 for SSE AVX    Here is the function I use  void reorder matrix const float  A  float  B  const int N  const int M  const int vec size         pragma omp parallel for     for int n 0  n lt M N  n              int k   vec size  n N vec size           int i    n vec size  N          int j   n vec size          B n    A M i   k   j             EDIT   I tried several function to find the fastest transpose for large matrices   In the end the fastest result is to use loop blocking with block size 16  Edit  I found a faster solution using SSE and loop blocking - see below    This code works for any NxM matrix  i e  the matrix does not have to be square    inline void transpose scalar block float  A  float  B  const int lda  const int ldb  const int block size         pragma omp parallel for     for int i 0  i lt block size  i              for int j 0  j lt block size  j                  B j ldb   i    A i lda  j                      inline void transpose block float  A  float  B  const int n  const int m  const int lda  const int ldb  const int block size         pragma omp parallel for     for int i 0  i lt n  i  block size            for int j 0  j lt m  j  block size                transpose scalar block  amp A i lda  j    amp B j ldb   i   lda  ldb  block size                       The values lda and ldb are the width of the matrix   These need to be multiples of the block size   To find the values and allocate the memory for e g  a 3000x1001 matrix I do something like this   define ROUND UP x  s     x    s -1    amp  - s   const int n   3000  const int m   1001  int lda   ROUND UP m  16   int ldb   ROUND UP n  16    float  A    float   mm malloc sizeof float  lda ldb  64   float  B    float   mm malloc sizeof float  lda ldb  64     For 3000x1001 this returns  ldb   3008 and  lda   1008  Edit   I found an even faster solution using SSE intrinsics   inline void transpose4x4 SSE float  A  float  B  const int lda  const int ldb          m128 row1    mm load ps  amp A 0 lda          m128 row2    mm load ps  amp A 1 lda          m128 row3    mm load ps  amp A 2 lda          m128 row4    mm load ps  amp A 3 lda          MM TRANSPOSE4 PS row1  row2  row3  row4         mm store ps  amp B 0 ldb   row1         mm store ps  amp B 1 ldb   row2         mm store ps  amp B 2 ldb   row3         mm store ps  amp B 3 ldb   row4      inline void transpose block SSE4x4 float  A  float  B  const int n  const int m  const int lda  const int ldb  const int block size         pragma omp parallel for     for int i 0  i lt n  i  block size            for int j 0  j lt m  j  block size                int max i2   i block size  lt  n   i   block size   n              int max j2   j block size  lt  m   j   block size   m              for int i2 i  i2 lt max i2  i2  4                    for int j2 j  j2 lt max j2  j2  4                        transpose4x4 SSE  amp A i2 lda  j2    amp B j2 ldb   i2   lda  ldb

User · Answer

intel mkl suggests in-place and out-of-place transposition copying matrices  here is the link to the documentation   I would recommend trying out of place implementation as faster ten in-place and into the documentation of the latest version of mkl contains some mistakes

User · Answer

Consider each row as a column  and each column as a row    use j i instead of i j  demo  http   ideone com lvsxKZ   include  lt iostream gt   using namespace std   int main          char A  3  3                     a    b    c                d    e    f                g    h    i                cout  lt  lt   A      lt  lt  endl  lt  lt  endl          print matrix A     for  int i 0  i lt 3  i                  for  int j 0  j lt 3  j    cout  lt  lt  A i  j           cout  lt  lt  endl             cout  lt  lt  endl  lt  lt   A transpose      lt  lt  endl  lt  lt  endl          print A transpose     for  int i 0  i lt 3  i                  for  int j 0  j lt 3  j    cout  lt  lt  A j  i           cout  lt  lt  endl             return 0

[c++] What is the fastest way to transpose a matrix in C++?

Examples related to c++

Examples related to algorithm

Examples related to matrix

Examples related to transpose