PHP DOMDocument loadHTML not encoding UTF-8 correctly

Question

I m trying to parse some HTML using DOMDocument  but when I do  I suddenly lose my encoding  at least that is how it appears to me     profile     lt div gt  lt p gt various japanese characters lt  p gt  lt  div gt     dom   new DOMDocument     dom- gt loadHTML  profile      divs    dom- gt getElementsByTagName  div     foreach   divs as  div        echo  dom- gt saveHTML  div       The result of this code is that I get a bunch of characters that are not Japanese   However  if I do   echo  profile    it displays correctly   I ve tried saveHTML and saveXML  and neither display correctly  I am using PHP 5 3   What I see                                                                                             9                5                                                4                                                                                                                                                                                                                                                                                                        What should be shown                          9    5               4                                                                                  EDIT  I ve simplified the code down to five lines so you can test it yourself    profile     lt div lang ja gt  lt p gt                         lt  p gt  lt  div gt     dom   new DOMDocument     dom- gt loadHTML  profile   echo  dom- gt saveHTML    echo  profile    Here is the html that is returned    lt div lang  ja  gt  lt p gt                                                                                                                                          lt  p gt  lt  div gt   lt div lang  ja  gt  lt p gt                         lt  p gt  lt  div gt

User · Answer

This took me a while to figure out but here's my answer.

Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
    // error message
}
else {
    // process
}

This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:

$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}

etc. Now everything's right with the world. Hope this helps.

User · Answer

The problem is with saveHTML   and saveXML    both of them do not work correctly in Unix  They do not save UTF-8 characters correctly when used in Unix  but they work in Windows   The workaround is very simple   If you try the default  you will get the error you described   str    dom- gt saveHTML       saves incorrectly   All you have to do is save as follows    str    dom- gt saveHTML  dom- gt documentElement      saves correctly   This line of code will get your UTF-8 characters to be saved correctly  Use the same workaround if you are using saveXML       Update  As suggested by  Jack M  in the comments section below  and verified by  Pamela  and  Marco Aur  lio Deleu   the following variation might work in your case    str   utf8 decode  dom- gt saveHTML  dom- gt documentElement        Note   English characters do not cause any problem when you use saveHTML   without parameters  because English characters are saved as single byte characters in UTF-8  The problem happens when you have multi-byte characters  such as Chinese  Russian  Arabic  Hebrew     etc     I recommend reading this article  http   coding smashingmagazine com 2012 06 06 all-about-unicode-utf8-character-sets   You will understand how UTF-8 works and why you have this problem  It will take you about 30 minutes  but it is time well spent

User · Answer

Works finde for me    dom   new  DOMDocument   dom- gt loadHTML utf8 decode  html        return  utf8 encode   dom- gt saveHTML

User · Answer

Make sure the real source file is saved as UTF-8  You may even want to try the non-recommended BOM Chars with UTF-8 to make sure    Also in case of HTML  make sure you have declared the correct encoding using meta tags    lt meta http-equiv  Content-Type  content  text html  charset utf-8  gt    If it s a CMS  as you ve tagged your question with Joomla  you may need to configure appropriate settings for the encoding

User · Answer

The only thing that worked for me was the accepted answer of   profile     lt p gt                        9 lt  p gt     dom   new DOMDocument     dom- gt loadHTML   lt  xml encoding  utf-8    gt      profile   echo  dom- gt saveHTML      HOWEVER  This brought about new issues  of having  lt  xml encoding  utf-8    gt  in the output of the document   The solution for me was then to do  foreach   doc- gt childNodes as  xx        if   xx instanceof  DOMProcessingInstruction             xx- gt parentNode- gt removeChild  xx             Some solutions told me that to remove the xml header  that I had to perform   dom- gt saveXML  dom- gt documentElement     This didn t work for me as for a partial document  e g  a doc with two  lt p gt  tags   only one of the  lt p gt  tags where being returned

User · Answer

The problem is that when you add a parameter to DOMDocument  saveHTML   function  you lose the encoding  In a few cases  you ll need to avoid the use of the parameter and use old string function to find what your are looking for  I think the previous answer works for you  but since this workaround didn t work for me  I m adding that answer to help people who may be in my case

User · Answer

Can also encode like below     gathered from https   davidwalsh name domdocument-utf8-problem   profile     lt p gt                        9 lt  p gt     dom   new DOMDocument     dom- gt loadHTML mb convert encoding  profile   HTML-ENTITIES    UTF-8     echo  dom- gt saveHTML

User · Answer

You could prefix a line enforcing utf-8 encoding  like this     doc- gt loadHTML   lt  xml version  1 0  encoding  UTF-8   gt       n     profile     And you can then continue with the code you already have  like     doc- gt saveXML

User · Answer

DOMDocument  loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise  This results in UTF-8 strings being interpreted incorrectly   If your string doesn t contain an XML encoding declaration  you can prepend one to cause the string to be treated as UTF-8    profile     lt p gt                        9 lt  p gt     dom   new DOMDocument     dom- gt loadHTML   lt  xml encoding  utf-8    gt      profile   echo  dom- gt saveHTML      If you cannot know if the string will contain such a declaration already  there s a workaround in SmartDOMDocument which should help you    profile     lt p gt                        9 lt  p gt     dom   new DOMDocument     dom- gt loadHTML mb convert encoding  profile   HTML-ENTITIES    UTF-8     echo  dom- gt saveHTML      This is not a great workaround  but since not all characters can be represented in ISO-8859-1  like these katana   it s the safest alternative

User · Answer

This worked for me  In php ini file  change the following property  Before  mbstring encoding transration   On  After  mbstring encoding transration   Off

User · Answer

I am using php 7 3 8 on a manjaro and I was working with Persian content  This solved my problem    html    hi lt  b gt  lt p gt      lt div gt      9      doc   new DOMDocument  1 0    UTF-8     doc- gt loadHTML mb convert encoding  html   HTML-ENTITIES    UTF-8     print  doc- gt saveHTML  doc- gt documentElement    PHP EOL   PHP EOL

User · Answer

Use it for correct result   dom   new DOMDocument     dom- gt loadHTML   lt meta http-equiv  Content-Type  content  text html  charset utf-8  gt      profile   echo  dom- gt saveHTML    echo  profile    This operation   mb convert encoding  profile   HTML-ENTITIES    UTF-8      It is bad way  because special symbols like   amp lt      amp gt   can be in  profile  and they will not convert twice after mb convert encoding   It is the hole for XSS and incorrect HTML

User · Answer

You must feed the DOMDocument a version of your HTML with a header that make sense  Just like HTML5     profile    lt  xml version  1 0  encoding      encoding     gt     html    maybe  is a good idea to keep your html as valid as you can  so you don t get into issues when you ll start query    around  -  and stay away from htmlentities      That s an an necessary back and forth wasting resources   keep your code insane

[php] PHP DOMDocument loadHTML not encoding UTF-8 correctly

Examples related to php

Examples related to utf-8

Examples related to character-encoding