An efficient way to transpose a file in Bash

Question

I have a huge tab-separated file formatted like this  X column1 column2 column3 row1 0 1 2 row2 3 4 5 row3 6 7 8 row4 9 10 11   I would like to transpose it in an efficient way using only bash commands  I could write a ten or so lines Perl script to do that  but it should be slower to execute than the native bash functions   So the output should look like  X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11   I thought of a solution like this  cols  head -n 1 input   wc -w  for    i 1  i  lt    cols  i     do cut -f  i input   tr    n     t    sed -e  s  t   n g   gt  gt  output done   But it s slow and doesn t seem the most efficient solution  I ve seen a solution for vi in this post  but it s still over-slow  Any thoughts suggestions brilliant ideas   -

User · Answer

There is a purpose built utility for this,

GNU datamash utility

apt install datamash  

datamash transpose < yourfile

Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods

User · Answer

A Python solution   python -c  import sys  print   n  join     join c  for c in zip   l split   for l in sys stdin readlines   if l strip         lt  input  gt  output   The above is based on the following   import sys  for c in zip   l split   for l in sys stdin readlines   if l strip          print     join c     This code does assume that every line has the same number of columns  no padding is performed

User · Answer

the transpose project on sourceforge is a coreutil-like C program for exactly that   gcc transpose c -o transpose   transpose -t input  gt  output  works with stdin  too

User · Answer

rs rs comes with BSDs and macOS  but it is available from package managers on other platforms  It is named after the  quot reshape quot  function in APL  Use sequences of spaces and tabs as column separator  rs -T  Use tab as column separator  rs -c -C -T  Use comma as column separator  rs -c  -C  -T  -c changes the input column separator and -C changes the output column separator  -c or -C alone sets the separator to tab  -T transposes rows and columns  Do not use -t instead of -T  because it uses an automatically selected number of columns that is usually incorrect  because the number of columns is selected so that the output rows fill the width of the display  which is 80 characters by default but which can be changed with -w   One caveat is that when an output column separator is specified using -C  an extra column separator character is added to the end of each row  but you can remove the extra character using something like sed  s          seq 4 paste -d  - - rs -c  -C  -T 1 3  2 4    seq 4 paste -d  - - rs -c  -C  -T sed  s       1 3 2 4  A second caveat is that this fails with tables where the first line contains one or more empty columns at the end  because the number of columns is determined based on the number of columns on the first row    rs -C  -c  -T lt  lt  lt   1  n3 4  1 3 4   Ruby   ruby -e puts readlines map  x x chomp split  quot   quot  -1   transpose map  x x  quot   quot    lt  lt  lt   1  n3 4  1 3  4  The -1 argument to split doesn t discard empty fields at the end    ruby -e p quot a   quot  split  quot   quot      quot a quot     ruby -e p quot a   quot  split  quot   quot  -1     quot a quot    quot  quot    quot  quot    Function form    tp    ruby -e puts STDIN read split  quot  n quot   map  x x split ARGV 0  -1   transpose map  x x ARGV 0    --  quot   1-   t   quot      seq 4 paste - - tp sed -n l 1 t3  2 t4   jq jq -R   jq -sr  map    quot  t quot   transpose map join  quot  t quot        jq -R   prints each input line as a JSON string literal  -s  --slurp  creates an array for the input lines after parsing each line as JSON  and -r  --raw-output  outputs the contents of strings instead of JSON string literals  The   operator is overloaded to split strings  Function form  tp    jq -R   jq --arg x  quot   1-   t   quot  -sr  map    x  transpose map join  x

User · Answer

Not very elegant  but this  single-line  command solves the problem quickly   cols 4  for  i 1 i lt   cols i      do               awk   print    i    input   tr   n       echo            done   Here cols is the number of columns  where you can replace 4 by head -n 1 input   wc -w

User · Answer

I was just looking for similar bash tranpose but with support for padding  Here is the script I wrote based on fgm s solution  that seem to work  If it can be of help        bin bash  declare -a array                            we build a 1-D-array declare -a ncols                            we build a 1-D-array containing number of elements of each row  SEPARATOR   t   PADDING     MAXROWS 0  index 0 indexCol 0 while read -a line  do     ncols  indexCol     line        indexCol     if      line     -gt   MAXROWS        then          MAXROWS    line         fi         for    COUNTER 0  COUNTER lt    line      COUNTER       do         array  index    line  COUNTER             index          done done  lt    1   for    ROW   0  ROW  lt  MAXROWS  ROW       do   COUNTER  ROW    for    indexCol 0  indexCol  lt     ncols      indexCol       do if    ROW -ge   ncols indexCol         then       printf  PADDING     else   printf   s    array  COUNTER   fi if      indexCol 1   -lt    ncols       then   printf  SEPARATOR     fi     COUNTER     COUNTER   ncols indexCol       done   printf   n   done

User · Answer

An awk solution that store the whole array in memory      awk   0           i                      split  0 arr FS                     for  j in arr                          out i j  arr j                         if  maxr lt j   maxr j        max number of output rows                                        END           maxc i                   max number of output columns          for      j 1  j lt  maxr  j                  for  i 1  i lt  maxc  i                      printf    s    out i j                             printf    s n                        infile   But we may  walk  the file as many times as output rows are needed      bin bash maxf    awk   if  mf lt NF   mf NF   END print mf   infile   rowcount maxf for    i 1  i lt  rowcount  i       do     awk -v i   i  -F       printf   s t     i    infile     echo done   Which  for a low count of output rows is faster than the previous code

User · Answer

Simple 4 line answer  keep it readable  col  quot   head -1 file txt   wc -w  quot  for i in   seq 1  col   do     awk    print    i     file txt   paste -s -d  quot  t quot  done

User · Answer

Here s a Haskell solution   When compiled with -O2  it runs slightly faster than ghostdog s awk and slightly slower than Stephan s thinly wrapped c python on my machine for repeated  Hello world  input lines  Unfortunately GHC s support for passing command line code is non-existent as far as I can tell  so you will have to write it to a file yourself  It will truncate the rows to the length of the shortest row   transpose      a   - gt    a   transpose   foldr  zipWith       repeat      main    IO    main   interact   unlines   map unwords   transpose   map words   lines

User · Answer

I used fgm s solution  thanks fgm    but needed to eliminate the tab characters at the end of each row  so modified the script thus      bin bash  declare -a array                            we build a 1-D-array  read -a line  lt    1                          read the headline  COLS    line                                save number of columns  index 0 while read -a line  do     for    COUNTER 0  COUNTER lt    line      COUNTER       do         array  index    line  COUNTER             index         done done  lt    1   for    ROW   0  ROW  lt  COLS  ROW       do   for    COUNTER   ROW  COUNTER  lt     array      COUNTER    COLS     do     printf   s    array  COUNTER       if    COUNTER -lt        array     -  COLS          then         printf   t      fi   done   printf   n   done

User · Answer

bin bash  aline    head -n 1 file txt   set --  aline colNum      set -x while read line  do   set --  line   for i in   seq  colNum   do     eval col i      col i    i      done done  lt  file txt  for i in   seq  colNum   do   eval echo    col i  done   another version with set eval

User · Answer

GNU datamash is perfectly suited for this problem  with only one line of code and potentially arbitrarily large filesize   datamash -W transpose infile  gt  outfile

User · Answer

Another awk solution and limited input with the size of memory you have   awk    for  i 1  i lt  NF  i    RtoC i    RtoC i   RtoC i  FS  i   i        END  for  i in RtoC  print RtoC i     infile   This joins each same filed number positon into together and in END prints the result that would be first row in first column  second row in second column  etc   Will output   X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11

User · Answer

I was looking for a solution to transpose any kind of matrix  nxn or mxn  with any kind of data  numbers or data  and got the following solution   Row2Trans number1 Col2Trans number2  for   i 1   i  lt   Line2Trans  i     do     for   j 1   j  lt  Col2Trans   j     do         awk -v var1   i  -v var2   j   BEGIN   FS            NR  var1  print    var2       ARCHIVO  gt  gt  Column  i     done done  paste -d     ls -mv Column     sed  s    g    gt  gt   ARCHIVO

User · Answer

I ve used below two scripts to do similar operations before  The first is in awk which is a lot faster than the second which is in  pure  bash  You might be able to adapt it to your own application    awk         for  i   1  i  lt   NF  i              s i    s i  s i  FS  i  i         END       for  i in s            print s i           file txt   declare -a arr  while IFS  read -r line do     i 0     for word in  line     do              arr  i       amp  amp  arr  i     arr  i    word     arr  i   word           i         done done  lt  file txt  for   i 0  i  lt     arr      i     do     echo   arr i   done

User · Answer

Another bash variant    cat file  XXXX    col1    col2    col3 row1    0       1       2 row2    3       4       5 row3    6       7       8 row4    9       10      11   Script     bin bash  I 0 while read line  do     i 0     for item in  line    printf -v A I  i   item    i              I     done  lt  file indexes   seq 0  i   for i in  indexes        J 0     while   J lt I    do         arr  A J  i           printf     arr  t            J         done     echo     Output      test  XXXX    row1    row2    row3    row4     col1    0       3       6       9    col2    1       4       7       10   col3    2       5       8       11

User · Answer

Some  nix standard util one-liners  no temp files needed   NB  the OP wanted an efficient fix   i e  faster   and the top answers are usually faster than this answer   These one-liners are for those who like  nix software tools  for whatever reasons   In rare cases   e g  scarce IO  amp  memory   these snippets can actually be faster than some of the top answers   Call the input file foo    If we know foo has four columns   for f in 1 2 3 4   do cut -d     -f  f foo   xargs echo   done  If we don t know how many columns foo has   n   head -n 1 foo   wc -w  for f in   seq 1  n    do cut -d     -f  f foo   xargs echo   done   xargs has a size limit and therefore would make incomplete work with a long file   What size limit is system dependent  e g       timeout   01  xargs --show-limits     2 gt  amp 1   grep Max      Maximum length of command we could actually use  2088944  tr  amp  echo   for f in 1 2 3 4  do cut -d     -f  f foo   tr   n        echo  done      or if the   of columns are unknown   n   head -n 1 foo   wc -w  for f in   seq 1  n   do      cut -d     -f  f foo   tr   n        echo done  Using set  which like xargs  has similar command line size based limitations   for f in 1 2 3 4   do set -   cut -d     -f  f foo    echo      done

User · Answer

awk          for  i 1  i lt  NF  i               a NR i     i         NF gt p   p   NF   END           for j 1  j lt  p  j              str a 1 j          for i 2  i lt  NR  i                 str str   a i j                     print str          file   output    more file 0 1 2 3 4 5 6 7 8 9 10 11      shell sh 0 3 6 9 1 4 7 10 2 5 8 11   Performance against Perl solution by Jonathan on a 10000 lines file    head -5 file 1 0 1 2 2 3 4 5 3 6 7 8 4 9 10 11 1 0 1 2     wc -l  lt  file 10000    time perl test pl file  gt  dev null  real    0m0 480s user    0m0 442s sys     0m0 026s    time awk -f test awk file  gt  dev null  real    0m0 382s user    0m0 367s sys     0m0 011s    time perl test pl file  gt  dev null  real    0m0 481s user    0m0 431s sys     0m0 022s    time awk -f test awk file  gt  dev null  real    0m0 390s user    0m0 370s sys     0m0 010s   EDIT by Ed Morton   ghostdog74 feel free to delete if you disapprove    Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing  It also uses tabs as the separator which the OP had originally asked for so it d handle empty fields and it coincidentally pretties-up the output a bit for this particular case     cat tst awk BEGIN   FS OFS   t          for  rowNr 1 rowNr lt  NF rowNr              cell rowNr NR     rowNr           maxRows    NF  gt  maxRows   NF   maxRows      maxCols   NR   END       for  rowNr 1 rowNr lt  maxRows rowNr              for  colNr 1 colNr lt  maxCols colNr                  printf   s s   cell rowNr colNr    colNr  lt  maxCols   OFS   ORS                       awk -f tst awk file X       row1    row2    row3    row4 column1 0       3       6       9 column2 1       4       7       10 column3 2       5       8       11   The above solutions will work in any awk  except old  broken awk of course - there YMMV    The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this     cat tst awk BEGIN   FS OFS   t      printf   s s    FNR gt 1   OFS         ARGIND   ENDFILE       print        if  ARGIND  lt  NF            ARGV ARGC    FILENAME         ARGC             awk -f tst awk file X       row1    row2    row3    row4 column1 0       3       6       9 column2 1       4       7       10 column3 2       5       8       11   which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory  It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR  1 and END

User · Answer

I normally use this little awk snippet for this requirement     awk   for  i 1  i lt  NF  i    a i NR   i         max  max lt NF NF max           END  for  i 1  i lt  max  i                   for  j 1  j lt  NR  j                       printf   s s   a i j    j  NR RS FS                             file   This just loads all the data into a bidimensional array a line column  and then prints it back as a column line   so that it transposes the given input    This needs to keep track of the maximum amount of columns the initial file has  so that it is used as the number of rows to print back

User · Answer

Assuming all your rows have the same number of fields  this awk program solves the problem    for  f 1 f lt  NF f    col f    col f     f  END  for  f 1 f lt  NF f    print col f     In words  as you loop over the rows  for every field f grow a    -separated string col f  containing the elements of that field  After you are done with all the rows  print each one of those strings in a separate line  You can then substitute     for the separator you want  say  a space  by piping the output through tr           Example     echo  1 2 3 n4 5 6  1 2 3 4 5 6    echo  1 2 3 n4 5 6    awk   for  f 1 f lt  NF f    col f    col f     f  END  for  f 1 f lt  NF f    print col f      tr          1 4  2 5  3 6

User · Answer

Here is a Bash one-liner that is based on simply converting each line to a column and paste-ing them together   echo     gt  tmp1     cat m txt   while read l                 do    paste tmp1  lt  echo  l   tr -s       n   gt  tmp2                      cp tmp2 tmp1                done    cat tmp1   m txt   0 1 2 4 5 6 7 8 9 10 11 12    creates tmp1 file so it s not empty  reads each line and transforms it into a column using tr pastes the new column to the tmp1 file copies result back into tmp1    PS  I really wanted to use io-descriptors but couldn t get them to work

User · Answer

A oneliner using R       cat file   Rscript -e  d  lt - read table file  stdin    sep      row names 1  header T   write table t d   file stdout    quote F  col names NA

User · Answer

A hackish perl solution can be like this  It s nice because it doesn t load all the file in memory  prints intermediate temp files  and then uses the all-wonderful paste     usr bin perl use warnings  use strict   my  counter  open INPUT    lt  ARGV 0   or die   Unable to open input file     while  my  line    lt INPUT gt         chomp  line      my  array   split    t   line       open OUTPUT    gt temp    or die   unable to open output file         print OUTPUT join    n   array       close OUTPUT       counter       close INPUT     paste files together my  execute    paste    foreach  1   counter         execute   temp counter       execute    gt   ARGV 1    system  execute

User · Answer

Have a look at GNU datamash which can be used like datamash transpose  A future version will also support cross tabulation  pivot tables  Here is how you would do it with space separated columns  datamash transpose -t      lt  file  gt  transposed file

User · Answer

If you have sc installed  you can do   psc -r  lt  inputfile   sc -W  -  gt  outputfile

User · Answer

If you only want to grab a single  comma delimited  line  N out of a file and turn it into a column   head - N file   tail -1   tr       n

User · Answer

Here is a moderately solid Perl script to do the job   There are many structural analogies with  ghostdog74 s awk solution      bin perl -w     SO 1729824  use strict   my  data              main storage my  maxcol    0  my  rownum    0  while   lt  gt         my  row    split   s        my  colnum    0      foreach my  val   row                 data  rownum   colnum       val             rownum         maxcol    colnum if  colnum  gt   maxcol     my  maxrow    rownum  for  my  col   0   col  lt   maxcol   col          for  my  row   0   row  lt   maxrow   row                  printf   s s     row    0           t                   defined  data  row   col     data  row   col                  print   n       With the sample data size  the performance difference between perl and awk was negligible  1 millisecond out of 7 total    With a larger data set  100x100 matrix  entries 6-8 characters each   perl slightly outperformed awk - 0 026s vs 0 042s   Neither is likely to be a problem     Representative timings for Perl 5 10 1  32-bit  vs awk  version 20040207 when given  -V   vs gawk 3 1 7  32-bit  on MacOS X 10 5 8 on a file containing 10 000 lines with 5 columns per line   Osiris JL  time gawk -f tr awk xxx   gt   dev null  real    0m0 367s user    0m0 279s sys 0m0 085s Osiris JL  time perl -f transpose pl xxx  gt   dev null  real    0m0 138s user    0m0 128s sys 0m0 008s Osiris JL  time awk -f tr awk xxx   gt   dev null  real    0m1 891s user    0m0 924s sys 0m0 961s Osiris-2 JL     Note that gawk is vastly faster than awk on this machine  but still slower than perl   Clearly  your mileage will vary

User · Answer

Pure BASH  no additional process  A nice exercise   declare -a array                            we build a 1-D-array  read -a line  lt    1                          read the headline  COLS    line                                save number of columns  index 0 while read -a line   do     for    COUNTER 0  COUNTER lt    line      COUNTER       do         array  index    line  COUNTER             index         done done  lt    1   for    ROW   0  ROW  lt  COLS  ROW       do   for    COUNTER   ROW  COUNTER  lt     array      COUNTER    COLS     do     printf   s t    array  COUNTER     done   printf   n   done

User · Answer

The only improvement I can see to your own example is using awk which will reduce the number of  processes that are run and the amount of data that is piped between them    bin rm output 2 gt   dev null  cols  head -n 1 input   wc -w   for    i 1  i  lt    cols  i     do   awk   printf    s s   tab     i    tab   t   END  print      input done  gt  gt  output

[bash] An efficient way to transpose a file in Bash

Examples related to bash

Examples related to parsing

Examples related to unix

Examples related to transpose