ruby 1 9 invalid byte sequence in UTF-8

Question

I m writing a crawler in Ruby  1 9  that consumes lots of HTML from a lot of random sites  When trying to extract links  I decided to just use  scan  href         i  instead of nokogiri hpricot  major speedup   The problem is that I now receive a lot of  invalid byte sequence in UTF-8  errors  From what I understood  the net http library doesn t have any encoding specific options and the stuff that comes in is basically not properly tagged  What would be the best way to actually work with that incoming data  I tried  encode with the replace and invalid options set  but no success so far

User · Answer

attachment   file read  begin      Try it as UTF-8 directly    cleaned   attachment dup force encoding  UTF-8      unless cleaned valid encoding         Some of it might be old Windows code page      cleaned   attachment encode   UTF-8    Windows-1252       end    attachment   cleaned  rescue EncodingError      Force it to UTF-8  throwing out invalid bits    attachment   attachment force encoding  ISO-8859-1   encode  utf-8   replace  nil   end

User · Answer

This seems to work   def sanitize utf8 string    return nil if string nil    return string if string valid encoding    string chars select    c  c valid encoding    join end

User · Answer

I ve encountered string  which had mixings of English  Russian and some other alphabets  which caused exception  I need only Russian and English  and this currently works for me   ec1   Encoding  Converter new  UTF-8   Windows-1251   invalid  gt  replace  undef  gt  replace  replace  gt    ec2   Encoding  Converter new  Windows-1251   UTF-8   invalid  gt  replace  undef  gt  replace  replace  gt    t   ec2 convert ec1 convert t

User · Answer

In Ruby 1 9 3 it is possible to use String encode to  ignore  the invalid UTF-8 sequences  Here is a snippet that will work both in 1 8  iconv  and 1 9  String encode     require  iconv  unless String method defined   encode  if String method defined   encode    file contents encode   UTF-8    UTF-8    invalid   gt   replace  else   ic   Iconv new  UTF-8    UTF-8  IGNORE     file contents   ic iconv file contents  end   or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8   require  iconv  unless String method defined   encode  if String method defined   encode    file contents encode   UTF-16    UTF-8    invalid   gt   replace   replace   gt        file contents encode   UTF-8    UTF-16   else   ic   Iconv new  UTF-8    UTF-8  IGNORE     file contents   ic iconv file contents  end

User · Answer

Try this   def to utf8 str    str   str force encoding  UTF-8     return str if str valid encoding    str encode  UTF-8    binary   invalid   replace  undef   replace  replace      end

User · Answer

The accepted answer nor the other answer work for me  I found this post which suggested   string encode   UTF-8    binary   invalid   replace  undef   replace  replace        This fixed the problem for me

User · Answer

While Nakilon s solution works  at least as far as getting past the error  in my case  I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a  get this  cyrillic K which in ruby was a bolded K  To fix this I used  iso-8859-1  viz  CSV parse f   encoding   gt   iso-8859-1    which turned my freaky deaky cyrillic K s into a much more manageable   xCA   which I could then remove with string gsub    xCA

User · Answer

My current solution is to run    my string unpack  C    pack  U      This will at least get rid of the exceptions which was my main problem

User · Answer

Before you use scan  make sure that the requested page s Content-Type header is text html  since there can be links to things like images which are not encoded in UTF-8  The page could also be non-html if you picked up a href in something like a  lt link gt  element  How to check this varies on what HTTP library you are using  Then  make sure the result is only ascii with String ascii only   not UTF-8 because HTML is only supposed to be using ascii  entities can be used otherwise   If both of those tests pass  it is safe to use scan

User · Answer

If you don t  care  about the data you can just do something like    search params   params  search  valid encoding    params  search  gsub   W           nothing   I just used valid encoding  to get passed it  Mine is a search field  and so i was finding the same weirdness over and over so I used something like   just to have the system not break   Since i don t control the user experience to autovalidate prior to sending this info  like auto feedback to say  dummy up    I can just take it in  strip it out and return blank results

User · Answer

I recommend you to use a HTML parser  Just find the fastest one   Parsing HTML is not as easy as it may seem   Browsers parse invalid UTF-8 sequences  in UTF-8 HTML documents  just putting the     symbol  So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string   Even inside attribute values you have to decode HTML entities like amp  Here is a great question that sums up why you can not reliably parse HTML with a regular expression  RegEx match open tags except XHTML self-contained tags

[ruby] ruby 1.9: invalid byte sequence in UTF-8

Examples related to ruby

Examples related to encoding

Examples related to utf-8