Extract part of a regex match

Question

I want a regular expression to extract the title from a HTML page  Currently I have this   title   re search   lt title gt    lt  title gt    html  re IGNORECASE  group   if title      title   title replace   lt title gt        replace   lt  title gt           Is there a regular expression to extract just the contents of  lt title  so I don t have to remove the tags

User · Answer

I d think this should suffice     python import re pattern   re compile r  lt title gt     lt     lt  title gt    re MULTILINE re IGNORECASE  pattern search text        assuming that your text  HTML  is in a variable named  text    This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other  lt  character within such a container block   However      Don t use regular expressions for HTML parsing in Python   Use an HTML parser    Unless you re going to write a full parser  which would be a of extra work when various HTML  SGML and XML parsers are already in the standard libraries   If your handling  real world  tag soup HTML  which is frequently non-conforming to any SGML XML validator  then use the BeautifulSoup package   It isn t in the standard libraries  yet  but is wide recommended for this purpose   Another option is  lxml     which is written for properly structured  standards conformant  HTML   But it has an option to fallback to using BeautifulSoup as a parser  ElementSoup

User · Answer

May I recommend you to Beautiful Soup   Soup is a very good lib to parse all of your html document   soup   BeatifulSoup html doc  titleName   soup title name

User · Answer

Note that starting Python 3 8  and the introduction of assignment expressions  PEP 572      operator   it s possible to improve a bit on Krzysztof Krason s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition s body     pattern     lt title gt      lt  title gt     text     lt title gt hello lt  title gt   if match    re search pattern  text  re IGNORECASE     title   match group 1    hello

User · Answer

The provided pieces of code do not cope with Exceptions May I suggest  getattr re search r  lt title gt      lt  title gt    s  re IGNORECASE    groups   lambda  u       0    This returns an empty string by default if the pattern has not been found  or the first match

User · Answer

Try using capturing groups   title   re search   lt title gt      lt  title gt    html  re IGNORECASE  group 1

User · Answer

Use     in regexp and group 1  in python to retrieve the captured string  re search will return None if it doesn t find the result  so don t use group   directly    title search   re search   lt title gt      lt  title gt    html  re IGNORECASE   if title search      title   title search group 1

User · Answer

Try   title   re search   lt title gt      lt  title gt    html  re IGNORECASE  group 1

User · Answer

re search   lt title gt      lt  title gt    s  re IGNORECASE  group 1

User · Answer

The currently top-voted answer by Krzysztof Krason fails with  lt title gt a lt  title gt  lt title gt b lt  title gt   Also  it ignores title tags crossing line boundaries  e g   for line-length reasons  Finally  it fails with  lt title  gt a lt  title gt   which is valid HTML  White space inside XML HTML tags   I therefore propose the following improvement  import re  def search title html       m   re search r quot  lt title s  gt       lt  title s  gt  quot   html  re IGNORECASE   re DOTALL      return m group 1  if m else None  Test cases  print search title  quot  lt title    gt with spaces in tags lt  title  gt  quot    print search title  quot  lt title n gt with newline in tags lt  title n gt  quot    print search title  quot  lt title gt first of two titles lt  title gt  lt title gt second title lt  title gt  quot    print search title  quot  lt title gt with newline n in title lt  title n gt  quot     Output  with spaces in tags with newline in tags first of two titles with newline   in title  Ultimately  I go along with others recommending an HTML parser - not only  but also to handle non-standard use of HTML tags

[python] Extract part of a regex match

Examples related to python

Examples related to html

Examples related to regex

Examples related to html-content-extraction