[regex] RE error: illegal byte sequence on Mac OS X

I'm trying to replace a string in a Makefile on Mac OS X for cross-compiling to iOS. The string has embedded double quotes. The command is:

sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

And the error is:

sed: RE error: illegal byte sequence

I've tried escaping the double quotes, commas, dashes, and colons with no joy. For example:

sed -i "" 's|\"iphoneos-cross\"\,\"llvm-gcc\:\-O3|\"iphoneos-cross\"\,\"clang\:\-Os|g' Configure

I'm having a heck of a time debugging the issue. Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?

This question is related to regex macos bash sed

The answer is


My workaround had been using Perl:

find . -type f -print0 | xargs -0 perl -pi -e 's/was/now/g'

You simply have to pipe an iconv command before the sed command. Ex with file.txt input :

iconv -f ISO-8859-1 -t UTF8-MAC file.txt | sed 's/something/àéèêçùû/g' | .....

-f option is the 'from' codeset and -t option is the 'to' codeset conversion.

Take care of case, web pages usually show lowercase like that < charset=iso-8859-1"/> and iconv uses uppercase. You have list of iconv supported codesets in you system with command iconv -l

UTF8-MAC is modern OS Mac codeset for conversion.


My workaround had been using gnu sed. Worked fine for my purposes.


Add the following lines to your ~/.bash_profile or ~/.zshrc file(s).

export LC_CTYPE=C 
export LANG=C

mklement0's answer is great, but I have some small tweaks.

It seems like a good idea to explicitly specify bash's encoding when using iconv. Also, we should prepend a byte-order mark (even though the unicode standard doesn't recommend it) because there can be legitimate confusions between UTF-8 and ASCII without a byte-order mark. Unfortunately, iconv doesn't prepend a byte-order mark when you explicitly specify an endianness (UTF-16BE or UTF-16LE), so we need to use UTF-16, which uses platform-specific endianness, and then use file --mime-encoding to discover the true endianness iconv used.

(I uppercase all my encodings because when you list all of iconv's supported encodings with iconv -l they are all uppercase.)

# Find out MY_FILE's encoding
# We'll convert back to this at the end
FILE_ENCODING="$( file --brief --mime-encoding MY_FILE )"
# Find out bash's encoding, with which we should encode
# MY_FILE so sed doesn't fail with 
# sed: RE error: illegal byte sequence
BASH_ENCODING="$( locale charmap | tr [:lower:] [:upper:] )"
# Convert to UTF-16 (unknown endianness) so iconv ensures
# we have a byte-order mark
iconv -f "$FILE_ENCODING" -t UTF-16 MY_FILE > MY_FILE.utf16_encoding
# Whether we're using UTF-16BE or UTF-16LE
UTF16_ENCODING="$( file --brief --mime-encoding MY_FILE.utf16_encoding )"
# Now we can use MY_FILE.bash_encoding with sed
iconv -f "$UTF16_ENCODING" -t "$BASH_ENCODING" MY_FILE.utf16_encoding > MY_FILE.bash_encoding
# sed!
sed 's/.*/&/' MY_FILE.bash_encoding > MY_FILE_SEDDED.bash_encoding
# now convert MY_FILE_SEDDED.bash_encoding back to its original encoding
iconv -f "$BASH_ENCODING" -t "$FILE_ENCODING" MY_FILE_SEDDED.bash_encoding > MY_FILE_SEDDED
# Now MY_FILE_SEDDED has been processed by sed, and is in the same encoding as MY_FILE

Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?

$ uname -a
Darwin Adams-iMac 18.7.0 Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64 x86_64

I got part of the way to answering the above just by using tr.

I have a .csv file that is a credit card statement and I am trying to import it into Gnucash. I am based in Switzerland so I have to deal with words like Zürich. Suspecting Gnucash does not like " " in numeric fields, I decide to simply replace all

; ;

with

;;

Here goes:

$ head -3 Auswertungen.csv | tail -1 | sed -e 's/; ;/;;/g'
sed: RE error: illegal byte sequence

I used od to shed some light: Note the 374 halfway down this od -c output

$ head -3 Auswertungen.csv | tail -1 | od -c
0000000    1   6   8   7       9   6   1   9       7   1   2   2   ;   5
0000020    4   6   8       8   7   X   X       X   X   X   X       2   6
0000040    6   0   ;   M   Y       N   A   M   E       I   S   X   ;   1
0000060    4   .   0   2   .   2   0   1   9   ;   9   5   5   2       -
0000100        M   i   t   a   r   b   e   i   t   e   r   r   e   s   t
0000120                Z 374   r   i   c   h                            
0000140    C   H   E   ;   R   e   s   t   a   u   r   a   n   t   s   ,
0000160        B   a   r   s   ;   6   .   2   0   ;   C   H   F   ;    
0000200    ;   C   H   F   ;   6   .   2   0   ;       ;   1   5   .   0
0000220    2   .   2   0   1   9  \n                                    
0000227

Then I thought I might try to persuade tr to substitute 374 for whatever the correct byte code is. So first I tried something simple, which didn't work, but had the side effect of showing me where the troublesome byte was:

$ head -3 Auswertungen.csv | tail -1 | tr . .  ; echo
tr: Illegal byte sequence
1687 9619 7122;5468 87XX XXXX 2660;MY NAME ISX;14.02.2019;9552 - Mitarbeiterrest   Z

You can see tr bails at the 374 character.

Using perl seems to avoid this problem

$ head -3 Auswertungen.csv | tail -1 | perl -pne 's/; ;/;;/g'
1687 9619 7122;5468 87XX XXXX 2660;ADAM NEALIS;14.02.2019;9552 - Mitarbeiterrest   Z?rich       CHE;Restaurants, Bars;6.20;CHF;;CHF;6.20;;15.02.2019

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to macos

Problems with installation of Google App Engine SDK for php in OS X dyld: Library not loaded: /usr/local/opt/openssl/lib/libssl.1.0.0.dylib dyld: Library not loaded: /usr/local/opt/icu4c/lib/libicui18n.62.dylib error running php after installing node with brew on Mac Could not install packages due to an EnvironmentError: [Errno 13] How do I install Java on Mac OSX allowing version switching? Git is not working after macOS Update (xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools) Can't compile C program on a Mac after upgrade to Mojave You don't have write permissions for the /Library/Ruby/Gems/2.3.0 directory. (mac user) How can I install a previous version of Python 3 in macOS using homebrew? Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'"

Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to sed

Retrieve last 100 lines logs How to replace multiple patterns at once with sed? Insert multiple lines into a file after specified pattern using shell script Linux bash script to extract IP address Ansible playbook shell output remove white space from the end of line in linux bash, extract string before a colon invalid command code ., despite escaping periods, using sed RE error: illegal byte sequence on Mac OS X How to use variables in a command in sed?