[bash] An efficient way to transpose a file in Bash

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I thought of a solution like this

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

This question is related to bash parsing unix transpose

The answer is


awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

output

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s

EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

The above solutions will work in any awk (except old, broken awk of course - there YMMV).

The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
    print ""
    if (ARGIND < NF) {
        ARGV[ARGC] = FILENAME
        ARGC++
    }
}
$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.


Pure BASH, no additional process. A nice exercise:

declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line ; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s\t" ${array[$COUNTER]}
  done
  printf "\n" 
done

If you have sc installed, you can do:

psc -r < inputfile | sc -W% - > outputfile

I've used below two scripts to do similar operations before. The first is in awk which is a lot faster than the second which is in "pure" bash. You might be able to adapt it to your own application.

awk '
{
    for (i = 1; i <= NF; i++) {
        s[i] = s[i]?s[i] FS $i:$i
    }
}
END {
    for (i in s) {
        print s[i]
    }
}' file.txt
declare -a arr

while IFS= read -r line
do
    i=0
    for word in $line
    do
        [[ ${arr[$i]} ]] && arr[$i]="${arr[$i]} $word" || arr[$i]=$word
        ((i++))
    done
done < file.txt

for ((i=0; i < ${#arr[@]}; i++))
do
    echo ${arr[i]}
done

rs

rs comes with BSDs and macOS, but it is available from package managers on other platforms. It is named after the "reshape" function in APL.

Use sequences of spaces and tabs as column separator:

rs -T

Use tab as column separator:

rs -c -C -T

Use comma as column separator:

rs -c, -C, -T

-c changes the input column separator and -C changes the output column separator. -c or -C alone sets the separator to tab. -T transposes rows and columns.

Do not use -t instead of -T, because it uses an automatically selected number of columns that is usually incorrect, because the number of columns is selected so that the output rows fill the width of the display (which is 80 characters by default but which can be changed with -w).

One caveat is that when an output column separator is specified using -C, an extra column separator character is added to the end of each row, but you can remove the extra character using something like sed 's/.$//':

$ seq 4|paste -d, - -|rs -c, -C, -T
1,3,
2,4,
$ seq 4|paste -d, - -|rs -c, -C, -T|sed 's/.$//'
1,3
2,4

A second caveat is that this fails with tables where the first line contains one or more empty columns at the end, because the number of columns is determined based on the number of columns on the first row:

$ rs -C, -c, -T<<<$'1,\n3,4'
1,3,4,

Ruby

$ ruby -e'puts readlines.map{|x|x.chomp.split(",",-1)}.transpose.map{|x|x*","}'<<<$'1,\n3,4'
1,3
,4

The -1 argument to split doesn't discard empty fields at the end:

$ ruby -e'p"a,,".split(",")'
["a"]
$ ruby -e'p"a,,".split(",",-1)'
["a", "", ""]

Function form:

$ tp(){ ruby -e'puts STDIN.read.split("\n").map{|x|x.split(ARGV[0],-1)}.transpose.map{|x|x*ARGV[0]}' -- "${1-$'\t'}";}
$ seq 4|paste - -|tp|sed -n l
1\t3$
2\t4$

jq

jq -R .|jq -sr 'map(./"\t")|transpose|map(join("\t"))[]'

jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.

Function form:

tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}

Not very elegant, but this "single-line" command solves the problem quickly:

cols=4; for((i=1;i<=$cols;i++)); do \
            awk '{print $'$i'}' input | tr '\n' ' '; echo; \
        done

Here cols is the number of columns, where you can replace 4 by head -n 1 input | wc -w.


the transpose project on sourceforge is a coreutil-like C program for exactly that.

gcc transpose.c -o transpose
./transpose -t input > output #works with stdin, too.

Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.

Call the input file foo.

  1. If we know foo has four columns:

    for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
    
  2. If we don't know how many columns foo has:

    n=$(head -n 1 foo | wc -w)
    for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
    

    xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:

    { timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
    

    Maximum length of command we could actually use: 2088944

  3. tr & echo:

    for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
    

    ...or if the # of columns are unknown:

    n=$(head -n 1 foo | wc -w)
    for f in $(seq 1 $n); do 
        cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
    done
    
  4. Using set, which like xargs, has similar command line size based limitations:

    for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $@ ; done
    

Have a look at GNU datamash which can be used like datamash transpose. A future version will also support cross tabulation (pivot tables)

Here is how you would do it with space separated columns:

datamash transpose -t ' ' < file > transposed_file

Assuming all your rows have the same number of fields, this awk program solves the problem:

{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}

In words, as you loop over the rows, for every field f grow a ':'-separated string col[f] containing the elements of that field. After you are done with all the rows, print each one of those strings in a separate line. You can then substitute ':' for the separator you want (say, a space) by piping the output through tr ':' ' '.

Example:

$ echo "1 2 3\n4 5 6"
1 2 3
4 5 6

$ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' '
 1 4
 2 5
 3 6

Another awk solution and limited input with the size of memory you have.

awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) }
    END{ for (i in RtoC) print RtoC[i] }' infile

This joins each same filed number positon into together and in END prints the result that would be first row in first column, second row in second column, etc. Will output:

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I was looking for a solution to transpose any kind of matrix (nxn or mxn) with any kind of data (numbers or data) and got the following solution:

Row2Trans=number1
Col2Trans=number2

for ((i=1; $i <= Line2Trans; i++));do
    for ((j=1; $j <=Col2Trans ; j++));do
        awk -v var1="$i" -v var2="$j" 'BEGIN { FS = "," }  ; NR==var1 {print $((var2)) }' $ARCHIVO >> Column_$i
    done
done

paste -d',' `ls -mv Column_* | sed 's/,//g'` >> $ARCHIVO

If you only want to grab a single (comma delimited) line $N out of a file and turn it into a column:

head -$N file | tail -1 | tr ',' '\n'

I normally use this little awk snippet for this requirement:

  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
        max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++) 
                  printf "%s%s", a[i,j], (j==NR?RS:FS)
              }
        }' file

This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input.

This needs to keep track of the maximum amount of columns the initial file has, so that it is used as the number of rows to print back.


I was just looking for similar bash tranpose but with support for padding. Here is the script I wrote based on fgm's solution, that seem to work. If it can be of help...

#!/bin/bash 
declare -a array=( )                      # we build a 1-D-array
declare -a ncols=( )                      # we build a 1-D-array containing number of elements of each row

SEPARATOR="\t";
PADDING="";
MAXROWS=0;
index=0
indexCol=0
while read -a line; do
    ncols[$indexCol]=${#line[@]};
((indexCol++))
if [ ${#line[@]} -gt ${MAXROWS} ]
    then
         MAXROWS=${#line[@]}
    fi    
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))

    done
done < "$1"

for (( ROW = 0; ROW < MAXROWS; ROW++ )); do
  COUNTER=$ROW;
  for (( indexCol=0; indexCol < ${#ncols[@]}; indexCol++ )); do
if [ $ROW -ge ${ncols[indexCol]} ]
    then
      printf $PADDING
    else
  printf "%s" ${array[$COUNTER]}
fi
if [ $((indexCol+1)) -lt ${#ncols[@]} ]
then
  printf $SEPARATOR
    fi
    COUNTER=$(( COUNTER + ncols[indexCol] ))
  done
  printf "\n" 
done

There is a purpose built utility for this,

GNU datamash utility

apt install datamash  

datamash transpose < yourfile

Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods


Another bash variant

$ cat file 
XXXX    col1    col2    col3
row1    0       1       2
row2    3       4       5
row3    6       7       8
row4    9       10      11

Script

#!/bin/bash

I=0
while read line; do
    i=0
    for item in $line; { printf -v A$I[$i] $item; ((i++)); }
    ((I++))
done < file
indexes=$(seq 0 $i)

for i in $indexes; {
    J=0
    while ((J<I)); do
        arr="A$J[$i]"
        printf "${!arr}\t"
        ((J++))
    done
    echo
}

Output

$ ./test 
XXXX    row1    row2    row3    row4    
col1    0       3       6       9   
col2    1       4       7       10  
col3    2       5       8       11

A Python solution:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

import sys

for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
    print(' '.join(c))

This code does assume that every line has the same number of columns (no padding is performed).


GNU datamash is perfectly suited for this problem with only one line of code and potentially arbitrarily large filesize!

datamash -W transpose infile > outfile

#!/bin/bash

aline="$(head -n 1 file.txt)"
set -- $aline
colNum=$#

#set -x
while read line; do
  set -- $line
  for i in $(seq $colNum); do
    eval col$i="\"\$col$i \$$i\""
  done
done < file.txt

for i in $(seq $colNum); do
  eval echo \${col$i}
done

another version with set eval


A oneliner using R...

  cat file | Rscript -e "d <- read.table(file('stdin'), sep=' ', row.names=1, header=T); write.table(t(d), file=stdout(), quote=F, col.names=NA) "

The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:

/bin/rm output 2> /dev/null

cols=`head -n 1 input | wc -w` 
for (( i=1; i <= $cols; i++))
do
  awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output

I used fgm's solution (thanks fgm!), but needed to eliminate the tab characters at the end of each row, so modified the script thus:

#!/bin/bash 
declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s" ${array[$COUNTER]}
    if [ $COUNTER -lt $(( ${#array[@]} - $COLS )) ]
    then
        printf "\t"
    fi
  done
  printf "\n" 
done

Here's a Haskell solution. When compiled with -O2, it runs slightly faster than ghostdog's awk and slightly slower than Stephan's thinly wrapped c python on my machine for repeated "Hello world" input lines. Unfortunately GHC's support for passing command line code is non-existent as far as I can tell, so you will have to write it to a file yourself. It will truncate the rows to the length of the shortest row.

transpose :: [[a]] -> [[a]]
transpose = foldr (zipWith (:)) (repeat [])

main :: IO ()
main = interact $ unlines . map unwords . transpose . map words . lines

A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste

#!/usr/bin/perl
use warnings;
use strict;

my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
    chomp $line;
    my @array = split ("\t",$line);
    open OUTPUT, ">temp$." or die ("unable to open output file!");
    print OUTPUT join ("\n",@array);
    close OUTPUT;
    $counter=$.;
}
close INPUT;

# paste files together
my $execute = "paste ";
foreach (1..$counter) {
    $execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;

Here is a Bash one-liner that is based on simply converting each line to a column and paste-ing them together:

echo '' > tmp1;  \
cat m.txt | while read l ; \
            do    paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \
                  cp tmp2 tmp1; \
            done; \
cat tmp1

m.txt:

0 1 2
4 5 6
7 8 9
10 11 12
  1. creates tmp1 file so it's not empty.

  2. reads each line and transforms it into a column using tr

  3. pastes the new column to the tmp1 file

  4. copies result back into tmp1.

PS: I really wanted to use io-descriptors but couldn't get them to work.


Here is a moderately solid Perl script to do the job. There are many structural analogies with @ghostdog74's awk solution.

#!/bin/perl -w
#
# SO 1729824

use strict;

my(%data);          # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
    my(@row) = split /\s+/;
    my($colnum) = 0;
    foreach my $val (@row)
    {
        $data{$rownum}{$colnum++} = $val;
    }
    $rownum++;
    $maxcol = $colnum if $colnum > $maxcol;
}

my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
    for (my $row = 0; $row < $maxrow; $row++)
    {
        printf "%s%s", ($row == 0) ? "" : "\t",
                defined $data{$row}{$col} ? $data{$row}{$col} : "";
    }
    print "\n";
}

With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.


Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:

Osiris JL: time gawk -f tr.awk xxx  > /dev/null

real    0m0.367s
user    0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null

real    0m0.138s
user    0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx  > /dev/null

real    0m1.891s
user    0m0.924s
sys 0m0.961s
Osiris-2 JL: 

Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.


Simple 4 line answer, keep it readable.

col="$(head -1 file.txt | wc -w)"
for i in $(seq 1 $col); do
    awk '{ print $'$i' }' file.txt | paste -s -d "\t"
done

An awk solution that store the whole array in memory

    awk '$0!~/^$/{    i++;
                  split($0,arr,FS);
                  for (j in arr) {
                      out[i,j]=arr[j];
                      if (maxr<j){ maxr=j}     # max number of output rows.
                  }
            }
    END {
        maxc=i                 # max number of output columns.
        for     (j=1; j<=maxr; j++) {
            for (i=1; i<=maxc; i++) {
                printf( "%s:", out[i,j])
            }
            printf( "%s\n","" )
        }
    }' infile

But we may "walk" the file as many times as output rows are needed:

#!/bin/bash
maxf="$(awk '{if (mf<NF); mf=NF}; END{print mf}' infile)"
rowcount=maxf
for (( i=1; i<=rowcount; i++ )); do
    awk -v i="$i" -F " " '{printf("%s\t ", $i)}' infile
    echo
done

Which (for a low count of output rows is faster than the previous code).


Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to parsing

Got a NumberFormatException while trying to parse a text file for objects Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) Python/Json:Expecting property name enclosed in double quotes Correctly Parsing JSON in Swift 3 How to get response as String using retrofit without using GSON or any other library in android UIButton action in table view cell "Expected BEGIN_OBJECT but was STRING at line 1 column 1" How to convert an XML file to nice pandas dataframe? How to extract multiple JSON objects from one file? How to sum digits of an integer in java?

Examples related to unix

Docker CE on RHEL - Requires: container-selinux >= 2.9 What does `set -x` do? How to find files modified in last x minutes (find -mmin does not work as expected) sudo: npm: command not found How to sort a file in-place How to read a .properties file which contains keys that have a period character using Shell script gpg decryption fails with no secret key error Loop through a comma-separated shell variable Best way to find os name and version in Unix/Linux platform Resource u'tokenizers/punkt/english.pickle' not found

Examples related to transpose

Converting rows into columns and columns into rows using R Postgres - Transpose Rows to Columns Transposing a 2D-array in JavaScript What is the fastest way to transpose a matrix in C++? Excel VBA - Range.Copy transpose paste Transpose list of lists Transposing a 1D NumPy array An efficient way to transpose a file in Bash Transpose/Unzip Function (inverse of zip)?