[regex] Replace all whitespace with a line break/paragraph mark to make a word list

For reasonably modern versions of sed, edit the standard input to yield the standard output with

$ echo 't???? ß?ß??? ?? ??p??' | sed -E -e 's/[[:blank:]]+/\n/g'
t????
ß?ß???
??
??p??

If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with

sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab

What it means:

  • The character class [[:blank:]] matches either a single space character or a single tab character.
    • Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
    • The + quantifier means match one or more of the previous pattern.
    • So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
  • The \n in the replacement is the newline that you want.
  • The /g modifier on the end means perform the substitution as many times as possible rather than just once.
  • The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)

Perl Compatible Regexes

For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in

sed -E -e 's/\s+/\n/g' old > new

or

sed -e 's/\s\+/\n/g' old > new

These commands read input from the file old and write the result to a file named new in the current directory.

Maximum portability, maximum cruftiness

Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.

$ echo 't???? ß?ß??? ?? ??p??' | sed -e 's/[ \t][ \t]*/\
/g'
t????
ß?ß???
??
??p??

Notes:

  • Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
  • Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
    • The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
      • Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
    • This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.

Note on backslashes and quoting

The commands above all used single quotes ('') rather than double quotes (""). Consider:

$ echo '\\\\' "\\\\"
\\\\ \\

That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.