Check if a string is html or not

Question

I have a certain string for which I want to check if it is a html or not  I am using regex for the same but not getting the proper result   I validated my regex and it works fine here   var htmlRegex   new RegExp   lt   A-Za-z  A-Za-z0-9    b   gt    gt       lt   1 gt     return htmlRegex test testString     Here s the fiddle but the regex isn t running in there  http   jsfiddle net wFWtc   On my machine  the code runs fine but I get a false instead of true as the result  What am missing here

User · Answer

Method  1  Here is the simple function to test if the string contains HTML data   function isHTML str      var a   document createElement  div      a innerHTML   str     for  var c   a childNodes  i   c length  i--          if  c i  nodeType    1  return true          return false      The idea is to allow browser DOM parser to decide if provided string looks like an HTML or not  As you can see it simply checks for ELEMENT NODE  nodeType of 1    I made a couple of tests and looks like it works   isHTML   lt a gt this is a string lt  a gt       true isHTML  this is a string             false isHTML  this is a  lt b gt string lt  b gt       true   This solution will properly detect HTML string  however it has side effect that img vide etc  tags will start downloading resource once parsed in innerHTML   Method  2  Another method uses DOMParser and doesn t have loading resources side effects   function isHTML str      var doc   new DOMParser   parseFromString str   text html      return Array from doc body childNodes  some node   gt  node nodeType     1        Notes 1  Array from is ES2015 method  can be replaced with    slice call doc body childNodes  2  Arrow function in some call can be replaced with usual anonymous function

User · Answer

Using jQuery in this case  the simplest form would be   if    testString  length  gt  0    If   testString  length   1  this means that there is one HTML tag inside textStging

User · Answer

A little bit of validation with     lt                gt  br hr input  -- wbr  a-z      gt   lt   a-z       lt    1 gt  i test htmlStringHere     This searches for empty tags  some predefined  and   terminated XHTML empty tags and validates as HTML because of the empty tag OR will capture the tag name and attempt to find it s closing tag somewhere in the string to validate as HTML   Explained demo  http   regex101 com r cX0eP2  Update   Complete validation with       lt  br basefont hr input source frame param area meta  -- col link option base img wbr  DOCTYPE     gt   lt  a abbr acronym address applet article aside audio b bdi bdo big blockquote body button canvas caption center cite code colgroup command datalist dd del details dfn dialog dir div dl dt em embed fieldset figcaption figure font footer form frameset head header hgroup h1 h2 h3 h4 h5 h6 html i iframe ins kbd keygen label legend li map mark menu meter nav noframes noscript object ol optgroup output p pre progress q rp rt ruby s samp script section select small span strike strong style sub summary sup table tbody td textarea tfoot th thead time title tr track tt u ul var video     lt    2 gt  i test htmlStringHere     This does proper validation as it contains ALL HTML tags  empty ones first followed by the rest which need a closing tag   Explained demo here  http   regex101 com r pE1mT5

User · Answer

All of the answers here are over-inclusive  they just look for  lt  followed by  gt   There is no perfect way to detect if a string is HTML  but you can do better    Below we look for end tags  and will be much tighter and more accurate   import re re is html   re compile r     lt     lt    gt       lt    lt     gt       And here it is in action     Correctly identified as not HTML  print re is html search  Hello  World   print re is html search  This is less than  lt   this is greater than  gt     print re is html search   a  lt  3  amp  amp  b  gt  3   print re is html search   lt  lt Important Text gt  gt    print re is html search   lt a gt       Correctly identified as HTML print re is html search   lt a gt Foo lt  a gt    print re is html search   lt input type  submit  value  Ok    gt    print re is html search   lt br  gt       We don t handle  but could with more tweaking  print re is html search   lt br gt    print re is html search  Foo  amp amp  bar   print re is html search   lt input type  submit  value  Ok  gt

User · Answer

zzzzBov s answer above is good  but it does not account for stray closing tags  like for example     lt  a-z   s S   gt  i test  foo  lt  b gt  bar       false   A version that also catches closing tags could be this     lt  a-z    s S   gt  i test  foo  lt  b gt  bar       true

User · Answer

lt       gt    gt   test str  Only detect whether it contains html tags  may be a xml

User · Answer

Since the original request is not say the solution had to be a RegExp  just that an attempt to use a RegExp was being made  I will offer this up  It says something is HTML if a single child element can be parsed  Note  this will return false if the body contains only comments or CDATA or server directives  const isHTML    text    gt      try       const fragment   new DOMParser   parseFromString text  quot text html quot        return fragment body children length gt 0     catch error            return false

User · Answer

There is an NPM package is-html that can attempt to solve this https   github com sindresorhus is-html

User · Answer

There are fancy solutions involving utilizing the browser itself to attempt to parse the text  identifying if any DOM nodes were constructed  which will be    slow  Or regular expressions which will be faster  but    potentially inaccurate  There are also two very distinct questions arising from this problem   Q1  Does a string contain HTML fragments   Is the string part of an HTML document  containing HTML element markup or encoded entities  This can be used as an indicator that the string may require bleaching   sanitization or entity decoding     lt    a-z    gt    gt     amp      w d     d   x a-f d        You can see this pattern in use against all of the examples from all existing answers at the time of this writing  plus some    rather hideous WYSIWYG- or Word-generated sample text and a variety of character entity references   Q2  Is the string an HTML document   The HTML specification is shockingly loose as to what it considers an HTML document   Browsers go to extreme lengths to parse almost any garbage text as HTML   Two approaches  either just consider everything HTML  since if delivered with a text html Content-Type  great effort will be expended to try to interpret it as HTML by the user-agent  or look for the prefix marker    lt  DOCTYPE html gt    In terms of  well-formedness   that  and almost nothing else is  required    The following is a 100  complete  fully valid HTML document containing every HTML element you think is being omitted    lt  DOCTYPE html gt   lt title gt Yes  really  lt  title gt   lt p gt This is everything you need    Yup   There are explicit rules on how to form  missing  elements such as  lt html gt    lt head gt   and  lt body gt   Though I find it rather amusing that SO s syntax highlighting failed to detect that properly without an explicit hint

User · Answer

With jQuery   function isHTML str      return    lt     gt    test str   amp  amp      str  0

User · Answer

My solution is   const element   document querySelector   test element     const setHtml   elem   gt       let getElemContent   elem innerHTML          Clean Up whitespace in the element        If you don t want to remove whitespace  then you can skip this line     let newHtml   getElemContent replace    n t    g               RegEX to check HTML     let checkHtml     lt   A-Za-z  A-Za-z0-9    b   gt    gt       lt    1 gt   test getElemContent          Check it is html or not     if  checkHtml           console log  This is an HTML            console log newHtml trim               else          console log  This is a TEXT            console log elem innerText trim              setHtml element

User · Answer

If you re creating a regex from a string literal you need to escape any backslashes   var htmlRegex   new RegExp   lt   A-Za-z  A-Za-z0-9     b   gt    gt       lt    1 gt        extra backslash added here ---------------------  and here -----    This is not necessary if you use a regex literal  but then you need to escape forward slashes   var htmlRegex     lt   A-Za-z  A-Za-z0-9    b   gt    gt       lt    1 gt       forward slash escaped here ------------------------    Also your jsfiddle didn t work because you assigned an onload handler inside another onload handler - the default as set in the Frameworks  amp  Extensions panel on the left is to wrap the JS in an onload  Change that to a nowrap option and fix the string literal escaping and it  works   within the constraints everybody has pointed out in comments   http   jsfiddle net wFWtc 4   As far as I know JavaScript regular expressions don t have back-references  So this part of your expression    lt   1 gt    won t work in JS  but would work in some other languages

User · Answer

Here s a sloppy one-liner that I use from time to time   var isHTML   RegExp prototype test bind    lt     gt     gt   i     It will basically return true for strings containing a  lt  followed by ANYTHING followed by  gt    By ANYTHING  I mean basically anything except an empty string   It s not great  but it s a one-liner   Usage  isHTML  Testing                     false isHTML   lt p gt Testing lt  p gt               true isHTML   lt img src  hello jpg  gt        true isHTML  My  lt  weird  gt  string         true  caution     isHTML   lt  gt                           false   As you can see it s far from perfect  but might do the job for you in some cases

User · Answer

A better regex to use to check if a string is HTML is         For example       test        true     test  foo bar baz     true     test   lt p gt fizz buzz lt  p gt      true   In fact  it s so good  that it ll return true for every string passed to it  which is because every string is HTML  Seriously  even if it s poorly formatted or invalid  it s still HTML   If what you re looking for is the presence of HTML elements  rather than simply any text content  you could use something along the lines of     lt     a-z   s S   gt  i test     It won t help you parse the HTML in any way  but it will certainly flag the string as containing HTML elements

User · Answer

Here s a regex-less approach I used for my own project  If you are trying to detect HTML string among other non-HTML strings  you can convert to HTML parser object and then back and see if the string lengths are different  I e   def isHTML string       string1   string        soup   BeautifulSoup string   html parser      Can use other HTML parser like etree     string2   soup text      if string1    string2          return True     elif string1    string2          return False  It worked on my sample of 2800 strings

[javascript] Check if a string is html or not

Examples related to javascript

Examples related to regex