remove script tag from HTML content

Question

I am using HTML Purifier  http   htmlpurifier org    I just want to remove  lt script gt  tags only   I don t want to remove inline formatting or any other things   How can I achieve this   One more thing  it there any other way to remove script tags from HTML

User · Answer

function remove script tags  html        dom   new DOMDocument         dom- gt loadHTML  html        script    dom- gt getElementsByTagName  script          remove           foreach  script as  item            remove      item             foreach   remove as  item            item- gt parentNode- gt removeChild  item               html    dom- gt saveHTML         html   preg replace    lt  DOCTYPE    lt html gt     lt body gt  lt p gt  ims        html        html   str replace   lt  p gt  lt  body gt  lt  html gt         html       return  html      Dejan s answer was good  but saveHTML   adds unnecessary doctype and body tags  this should get rid of it  See https   3v4l org 82FNP

User · Answer

Try this complete and flexible solution  It works perfectly  and is based in-part by some previous answers  but contains additional validation checks  and gets rid of additional implied HTML from the loadHTML      function  It is divided into two separate functions  one with a previous dependency so don t re-order rearrange  so you can use it with multiple HTML tags that you would like to remove simultaneously  i e  not just  script  tags   For example removeAllInstancesOfTag      function accepts an array of tag names  or optionally just one as a string  So  without further ado here is the code       Remove all instances of a particular HTML tag  e g   lt script gt     lt  script gt   from a variable containing raw HTML data   BEGIN         Usage Example   scriptless html   removeAllInstancesOfTag  html   script        if   function exists  removeAllInstancesOfTag                  function removeAllInstancesOfTag  html   tag nm                                if   empty  html                                                  html   mb convert encoding  html   HTML-ENTITIES    UTF-8       For UTF-8 Compatibility                              doc   new DOMDocument                             doc- gt loadHTML  html LIBXML HTML NOIMPLIED LIBXML HTML NODEFDTD LIBXML NOWARNING                            if   empty  tag nm                                                                 if  is array  tag nm                                                                                  tag nms    tag nm                                          unset  tag nm                                            foreach   tag nms as  tag nm                                                                                                 rmvbl itms    doc- gt getElementsByTagName strval  tag nm                                                     rmvbl itms arr                                                        foreach   rmvbl itms as  itm                                                                                                                 rmvbl itms arr      itm                                                                                                          foreach   rmvbl itms arr as  itm                                                                                                                 itm- gt parentNode- gt removeChild  itm                                                                                                                                                                               else if  is string  tag nm                                                                                  rmvbl itms    doc- gt getElementsByTagName  tag nm                                            rmvbl itms arr                                                foreach   rmvbl itms as  itm                                                                                                 rmvbl itms arr      itm                                                                                          foreach   rmvbl itms arr as  itm                                                                                                 itm- gt parentNode- gt removeChild  itm                                                                                                                                                  return  doc- gt saveHTML                                          else                                               return                                                      Remove all instances of a particular HTML tag  e g   lt script gt     lt  script gt   from a variable containing raw HTML data   END         Remove all instances of dangerous and pesky  lt script gt  tags from a variable containing raw user-input HTML data   BEGIN         Prerequisites   removeAllInstancesOfTag           if   function exists  removeAllScriptTags                  function removeAllScriptTags  html                                return removeAllInstancesOfTag  html   script                              Remove all instances of dangerous and pesky  lt script gt  tags from a variable containing raw user-input HTML data   END         And here is a test usage example     html    This is a JavaScript retention test  lt br gt  lt br gt  lt span id  chk frst scrpt  gt Congratulations  The first   script   tag was successfully removed  lt  span gt  lt br gt  lt br gt  lt span id  chk secd scrpt  gt Congratulations  The second   script   tag was successfully removed  lt  span gt  lt script gt document getElementById  chk frst scrpt   innerHTML    Oops  The first   script   tag was NOT removed    lt  script gt  lt script gt document getElementById  chk secd scrpt   innerHTML    Oops  The second   script   tag was NOT removed    lt  script gt    echo removeAllScriptTags  html      I hope my answer really helps someone  Enjoy

User · Answer

An example modifing ctf0 s answer  This should only do the preg replace once but also check for errors and block char code for forward slash     str     lt script gt  var a - 1   lt  amp  47 script gt       pattern      script          amp  47   amp  x0002F  script  ius    replace   preg replace  pattern       str    return   replace     null    replace    str      If you are using php 7 you can use the null coalesce operator to simplify it even more     pattern      script          amp  47   amp  x0002F  script  ius    return  preg replace  pattern       str      str

User · Answer

this is a merge of both ClandestineCoder  amp  Binh WPO    the problem with the script tag arrows is that they can have more than one variant      ex    lt     amp lt     amp amp lt    amp         amp gt     amp amp gt     so instead of creating a pattern array with like a bazillion variant  imho a better solution would be  return preg replace   script     script ius        text           preg replace   script     script ius        text            text    this will remove anything that look like script    script regardless of the arrow code variant and u can test it in here https   regex101 com r lK6vS8 1

User · Answer

html    lt  lt  lt HTML     HTML   dom   new DOMDocument     dom- gt loadHTML  html    tags to remove   array  script   style   iframe   link    foreach  tags to remove as  tag        element    dom- gt getElementsByTagName  tag       foreach  element  as  item            item- gt parentNode- gt removeChild  item            html    dom- gt saveHTML

User · Answer

Shorter    html   preg replace    lt script     script gt  s        html    When doing regex things might go wrong  so it s safer to do like this    html   preg replace    lt script     script gt  s        html       html   So that when the  accident  happen  we get the original  html instead of empty string

User · Answer

Because this question is tagged with regex I m going to answer with poor man s solution in this situation    html   preg replace    lt script      gt       lt  script gt  is        html     However  regular expressions are not for parsing HTML XML  even if you write the perfect expression it will break eventually  it s not worth it  although  in some cases it s useful to quickly fix some markup  and as it is with quick fixes  forget about security  Use regex only on content markup you trust    Remember  anything that user inputs should be considered not safe    Better solution here would be to use DOMDocument which is designed for this  Here is a snippet that demonstrate how easy  clean  compared to regex    almost  reliable and  nearly  safe is to do the same    lt  php   html    lt  lt  lt HTML     HTML    dom   new DOMDocument      dom- gt loadHTML  html     script    dom- gt getElementsByTagName  script      remove       foreach  script as  item       remove      item     foreach   remove as  item       item- gt parentNode- gt removeChild  item        html    dom- gt saveHTML      I have removed the HTML intentionally because even this can bork

User · Answer

use the str replace function to replace them with empty space or something        query     lt script gt console log  I should be banned   lt  script gt      badChar   array   lt script gt     lt  script gt      query   str replace  badChar       query    echo  query     this echoes console log  I should be banned

User · Answer

A simple way by manipulating string    str   stripStr  str    lt script     lt  script gt      function stripStr  str   ini   fin        while   pos   mb stripos  str   ini       false                 aux   mb substr  str   pos   mb strlen  ini             str   mb substr  str  0   pos  mb substr  aux  mb stripos  aux   fin    mb strlen  fin               return  str

User · Answer

I had been struggling with this question  I discovered you only really need one function  explode       html   The single common denominator to any tag is  lt  and    Then after that it s usually quotation marks        You can extract information so easily once you find the common denominator  This is what I came up with    html   file get contents  http   some page html      h   explode   gt     html    foreach  h as  k   gt   v         v   trim  v    clean it up a bit      if preg match      lt script       ius    v     my regex here might be questionable           counter    k   match opening tag and start counter for backtrace           elseif preg match          lt   script   ius    v     but it gets the job done               script length    k -  counter                counter   0               for  i    script length   i  gt   0   i--                    h  k- i         backtrace and clear everything in between                                                      for  i   0   i  lt   count  h    i         if  h  i               ht  i     h  i    clean out the blanks so when we implode it works right                   html   implode   gt     ht    all scripts stripped    echo  html    I see this really only working for script tags because you will never have nested script tags  Of course  you can easily add more code that does the same check and gather nested tags   I call it accordion coding  implode   explode    are the easiest ways to get your logic flowing if you have a common denominator

User · Answer

I would use BeautifulSoup if it s available   Makes this sort of thing very easy   Don t try to do it with regexps   That way lies madness

User · Answer

This is a simplified variant of Dejan Marjanovic s answer   function removeTags  html   tag         dom   new DOMDocument         dom- gt loadHTML  html       foreach  iterator to array  dom- gt getElementsByTagName  tag   as  item             item- gt parentNode- gt removeChild  item             return  dom- gt saveHTML        Can be used to remove any kind of tag  including  lt script gt     scriptlessHtml   removeTags  html   script

User · Answer

Use the PHP DOMDocument parser    doc   new DOMDocument        load the HTML string we want to strip  doc- gt loadHTML  html       get all the script tags  script tags    doc- gt getElementsByTagName  script      length    script tags- gt length      for each tag  remove it from the DOM for   i   0   i  lt   length   i         script tags- gt item  i - gt parentNode- gt removeChild  script tags- gt item  i          get the HTML string back  no script html string    doc- gt saveHTML      This worked me me using the following HTML document    lt  doctype html gt   lt html gt       lt head gt           lt meta charset  utf-8  gt           lt title gt              hey          lt  title gt           lt script gt              alert  hello             lt  script gt       lt  head gt       lt body gt          hey      lt  body gt   lt  html gt    Just bear in mind that the DOMDocument parser requires PHP 5 or greater

[php] remove script tag from HTML content

Examples related to php

Examples related to regex

Examples related to htmlpurifier