I have a text file containing unwanted null characters (ASCII NUL, \0
). When I try to view it in vi
I see ^@
symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0
and \x0
, but this did not work.
Remove the null characters? Running strings
on the file cleaned it up, but I'm just wondering if this is the best way?
This question is related to
unix
shell
null
special-characters
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv
to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
Here is example how to remove NULL characters using ex
(in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt
(if it is supported by your shell).
Useful for scripting since sed
and its -i
parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
Source: Stackoverflow.com