[regex] Replace all whitespace with a line break/paragraph mark to make a word list

I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.

This question is related to regex sed

The answer is


This should do the work:

sed -e 's/[ \t]+/\n/g'

[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.

[ \t]+ means as many spaces OR tabs as you want (but at least one)

s/x/y/ means replace the pattern x by y (here \n is a new line)

The g at the end means that you have to repeat as many times it occurs in every line.


  1. option 1

    echo $(cat testfile)
    
  2. Option 2

    tr ' ' '\n' < testfile
    

You can also do it with xargs:

cat old | xargs -n1 > new

or

xargs -n1 < old > new

All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.

However, Perl's regex works the same on any machine with Perl installed:

perl -pe 's/\s+/\n/g' file.txt

If you want to save the output:

perl -pe 's/\s+/\n/g' file.txt > newfile.txt

If you want only unique occurrences of words:

perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt

You could use POSIX [[:blank:]] to match a horizontal white-space character.

sed 's/[[:blank:]]\+/\n/g' file

or you may use [[:space:]] instead of [[:blank:]] also.

Example:

$ echo 'this  is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence

The portable way to do this is:

sed -e 's/[ \t][ \t]*/\
/g'

That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)

With GNU sed you can use \n in the substitution, and \s in the regex:

sed -e 's/\s\s*/\n/g'

GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:

sed -r -e 's/\s+/\n/g'

If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.


Using gawk:

gawk '{$1=$1}1' OFS="\n" file