Removing all script tags from html with JS Regular Expression

Question

I want to strip script tags out of this HTML at Pastebin   http   pastebin com mdxygM0a  I tried using the below regular expression  html replace   lt script   gt    lt   script gt  ims   quot   quot    But it does not remove all of the script tags in the HTML  It only removes in-line scripts  I m looking for some regex that can remove all of the script tags  in-line and multi-line   It would be highly appreciated if a test is carried out on my sample http   pastebin com mdxygM0a

User · Answer

Try this   var text   text replace   lt script   gt    gt        lt   script gt        lt   script gt  g

User · Answer

lt  s w  lt    lt    lt  s w  gi  - Removes any sequence in any combination with

User · Answer

Here are a variety of shell scripts you can use to strip out different elements     doctype find   -regex        html  py     -type f -exec sed -i  s  lt   DOCTYPE s  html   gt    gt   lt   DOCTYPE html gt  gi           meta charset find   -regex        html  py     -type f -exec sed -i  s  lt meta   gt   content             utf-8        gt    gt   lt meta charset   utf-8   gt  gi           script text javascript find   -regex        html  py     -type f -exec sed -i  s    lt script   gt        stype      text  javascript          s     gt    gt     1 3 gi           style text css find   -regex        html  py     -type f -exec sed -i  s    lt style   gt        stype      text  css          s     gt    gt     1 3 gi           html xmlns find   -regex        html  py     -type f -exec sed -i  s    lt html   gt        sxmlns                       s     gt    gt     1 3 gi           html xml lang find   -regex        html  py     -type f -exec sed -i  s    lt html   gt        sxml lang                       s     gt    gt     1 3 gi

User · Answer

In my case  I needed a requirement to parse out the page title AND and have all the other goodness of jQuery  minus it firing scripts  Here is my solution that seems to work             get   somepage htm   function  data                   excluded code to extract title for simplicity             var bodySI   data indexOf   lt body gt        lt body gt   length                  bodyEI   data indexOf   lt  body gt                     body   data substr bodySI  bodyEI - bodySI                    body               body   body replace   lt script   gt    gt  gi     lt  --                 body   body replace   lt   script gt  gi    -- gt                     console log body                 body       lt div gt    html body               console log  body html                   This kind of shortcuts worries about script because you are not trying to remove out the script tags and content  instead you are replacing them with comments rendering schemes to break them useless as you would have comments delimiting your script declarations   Let me know if that still presents a problem as it will help me too

User · Answer

Why not using jQuery parseHTML   http   api jquery com jquery parsehtml

User · Answer

If you want to remove all JavaScript code from some HTML text  then removing  lt script gt  tags isn t enough  because JavaScript can still live in  onclick    onerror    href  and other attributes   Try out this npm module which handles all of this  https   www npmjs com package strip-js

User · Answer

You can try      your div id   remove        or      your div id   html

User · Answer

This Regex should work too    lt script                                                                       n               s            lt   script gt    It even allows to have  problematic  variable strings like these inside    lt script type  text javascript  gt     var test1     lt  script gt       var test2       lt  script gt       var test1       lt  script gt       var test1     lt script gt         var test2     lt scr  ipt gt           lt  script gt            lt  script gt                   var foo      lt  script gt    It seams that jQuery and Prototype fail on these ones     Edit July 31  17  Added a  non-capturing groups for better performance  and no empty groups  and b  support for JavaScript comments

User · Answer

You can do this without a regular expression   Simply cast your HTML string to an HTML node using document createElement    find all scripts with element getElementsByTagName  script    and then just remove   them  Fun fact   SO s demo does not like it when you create an element with a  lt script gt  tag   The snippet below will not run  but it does work at  Full Working Demo at JSBin com   x000D   x000D  var el   document createElement   html     el innerHTML     lt p gt Valid paragraph  lt  p gt  lt p gt Another valid paragraph  lt  p gt  lt script gt Dangerous scripting    lt  script gt  lt p gt Last final paragraph  lt  p gt     var scripts   el getElementsByTagName   script        Live NodeList of your anchor elements  for var i   0  i  lt  scripts length  i        var script   scripts i     script remove       console log el innerHTML   x000D   x000D   x000D   This is a much cleaner solution than a regex  imho

User · Answer

jQuery uses a regex to remove script tags in some cases and I m pretty sure its devs had a damn good reason to do so  Probably some browser does execute scripts when inserting them using innerHTML   Here s the regex     lt script b   lt          lt   script gt   lt    lt      lt   script gt  gi   And before people start crying  but regexes for HTML are evil   Yes  they are - but for script tags they are safe because of the special behaviour - a  lt script gt  section may not contain  lt  script gt  at all unless it should end at this position  So matching it with a regex is easily possible  However  from a quick look the regex above does not account for trailing whitespace inside the closing tag so you d have to test if  lt  script       etc  will still work

User · Answer

Whenever you have to resort to Regex based script tag cleanup  At least add a white-space to the closing tag in the form of    lt  script s  gt    Otherwise things like   lt script gt alert 666  lt  script    gt    would remain since trailing spaces after tagnames are valid

User · Answer

Regexes are beatable  but if you have a string version of HTML that you don t want to inject into a DOM  they may be the best approach  You may want to put it in a loop to handle something like    lt scr lt script gt Ha  lt  script gt ipt gt  alert document cookie   lt  script gt    Here s what I did  using the jquery regex from above   var SCRIPT REGEX     lt script b   lt          lt   script gt   lt    lt      lt   script gt  gi  while  SCRIPT REGEX test text         text   text replace SCRIPT REGEX

User · Answer

Attempting to remove HTML markup using a regular expression is problematic  You don t know what s in there as script or attribute values  One way is to insert it as the innerHTML of a div  remove any script elements and return the innerHTML  e g     function stripScripts s        var div   document createElement  div        div innerHTML   s      var scripts   div getElementsByTagName  script        var i   scripts length      while  i--          scripts i  parentNode removeChild scripts i              return div innerHTML       alert   stripScripts   lt span gt  lt script type  text javascript  gt alert   foo     lt   script gt  lt   span gt         Note that at present  browsers will not execute the script if inserted using the innerHTML property  and likely never will especially as the element is not added to the document

[javascript] Removing all script tags from html with JS Regular Expression

Examples related to javascript

Examples related to html

Examples related to regex