I'm trying to replace a string in a Makefile on Mac OS X for cross-compiling to iOS. The string has embedded double quotes. The command is:
sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure
And the error is:
sed: RE error: illegal byte sequence
I've tried escaping the double quotes, commas, dashes, and colons with no joy. For example:
sed -i "" 's|\"iphoneos-cross\"\,\"llvm-gcc\:\-O3|\"iphoneos-cross\"\,\"clang\:\-Os|g' Configure
I'm having a heck of a time debugging the issue. Does anyone know how to get sed
to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?
You simply have to pipe an iconv command before the sed command. Ex with file.txt input :
iconv -f ISO-8859-1 -t UTF8-MAC file.txt | sed 's/something/àéèêçùû/g' | .....
-f option is the 'from' codeset and -t option is the 'to' codeset conversion.
Take care of case, web pages usually show lowercase like that < charset=iso-8859-1"/> and iconv uses uppercase. You have list of iconv supported codesets in you system with command iconv -l
UTF8-MAC is modern OS Mac codeset for conversion.
Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?
$ uname -a
Darwin Adams-iMac 18.7.0 Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64 x86_64
I got part of the way to answering the above just by using tr.
I have a .csv file that is a credit card statement and I am trying to import it into Gnucash. I am based in Switzerland so I have to deal with words like Zürich. Suspecting Gnucash does not like " " in numeric fields, I decide to simply replace all
; ;
with
;;
Here goes:
$ head -3 Auswertungen.csv | tail -1 | sed -e 's/; ;/;;/g'
sed: RE error: illegal byte sequence
I used od to shed some light: Note the 374 halfway down this od -c output
$ head -3 Auswertungen.csv | tail -1 | od -c
0000000 1 6 8 7 9 6 1 9 7 1 2 2 ; 5
0000020 4 6 8 8 7 X X X X X X 2 6
0000040 6 0 ; M Y N A M E I S X ; 1
0000060 4 . 0 2 . 2 0 1 9 ; 9 5 5 2 -
0000100 M i t a r b e i t e r r e s t
0000120 Z 374 r i c h
0000140 C H E ; R e s t a u r a n t s ,
0000160 B a r s ; 6 . 2 0 ; C H F ;
0000200 ; C H F ; 6 . 2 0 ; ; 1 5 . 0
0000220 2 . 2 0 1 9 \n
0000227
Then I thought I might try to persuade tr to substitute 374 for whatever the correct byte code is. So first I tried something simple, which didn't work, but had the side effect of showing me where the troublesome byte was:
$ head -3 Auswertungen.csv | tail -1 | tr . . ; echo
tr: Illegal byte sequence
1687 9619 7122;5468 87XX XXXX 2660;MY NAME ISX;14.02.2019;9552 - Mitarbeiterrest Z
You can see tr bails at the 374 character.
Using perl seems to avoid this problem
$ head -3 Auswertungen.csv | tail -1 | perl -pne 's/; ;/;;/g'
1687 9619 7122;5468 87XX XXXX 2660;ADAM NEALIS;14.02.2019;9552 - Mitarbeiterrest Z?rich CHE;Restaurants, Bars;6.20;CHF;;CHF;6.20;;15.02.2019
Add the following lines to your ~/.bash_profile
or ~/.zshrc
file(s).
export LC_CTYPE=C
export LANG=C
My workaround had been using gnu sed
. Worked fine for my purposes.
My workaround had been using Perl:
find . -type f -print0 | xargs -0 perl -pi -e 's/was/now/g'
mklement0's answer is great, but I have some small tweaks.
It seems like a good idea to explicitly specify bash
's encoding when using iconv
. Also, we should prepend a byte-order mark (even though the unicode standard doesn't recommend it) because there can be legitimate confusions between UTF-8 and ASCII without a byte-order mark. Unfortunately, iconv
doesn't prepend a byte-order mark when you explicitly specify an endianness (UTF-16BE
or UTF-16LE
), so we need to use UTF-16
, which uses platform-specific endianness, and then use file --mime-encoding
to discover the true endianness iconv
used.
(I uppercase all my encodings because when you list all of iconv
's supported encodings with iconv -l
they are all uppercase.)
# Find out MY_FILE's encoding
# We'll convert back to this at the end
FILE_ENCODING="$( file --brief --mime-encoding MY_FILE )"
# Find out bash's encoding, with which we should encode
# MY_FILE so sed doesn't fail with
# sed: RE error: illegal byte sequence
BASH_ENCODING="$( locale charmap | tr [:lower:] [:upper:] )"
# Convert to UTF-16 (unknown endianness) so iconv ensures
# we have a byte-order mark
iconv -f "$FILE_ENCODING" -t UTF-16 MY_FILE > MY_FILE.utf16_encoding
# Whether we're using UTF-16BE or UTF-16LE
UTF16_ENCODING="$( file --brief --mime-encoding MY_FILE.utf16_encoding )"
# Now we can use MY_FILE.bash_encoding with sed
iconv -f "$UTF16_ENCODING" -t "$BASH_ENCODING" MY_FILE.utf16_encoding > MY_FILE.bash_encoding
# sed!
sed 's/.*/&/' MY_FILE.bash_encoding > MY_FILE_SEDDED.bash_encoding
# now convert MY_FILE_SEDDED.bash_encoding back to its original encoding
iconv -f "$BASH_ENCODING" -t "$FILE_ENCODING" MY_FILE_SEDDED.bash_encoding > MY_FILE_SEDDED
# Now MY_FILE_SEDDED has been processed by sed, and is in the same encoding as MY_FILE
Source: Stackoverflow.com