regular expression for finding href value of a a link

Question

I need a regex pattern for finding web page links in HTML   I first use     lt a    gt     lt  a gt    to extract links   lt a gt    but I can t fetch href from that   My strings are     lt a href  www example com page php id xxxx amp name yyyy       gt  lt  a gt   lt a href  http   www example com page php id xxxx amp name yyyy       gt  lt  a gt   lt a href  https   www example com page php id xxxx amp name yyyy       gt  lt  a gt   lt a href  www example com page php 404       gt  lt  a gt    1  2 and 3 are valid and I need them  but number 4 is not valid for me    and   is essential     Thanks everyone  but I don t need parsing  lt a gt   I have a list of links in  href  abcdef  format   I need to fetch href of the links and filter it  my favorite urls must be contain   and   like page php id 5   Thanks

User · Answer

Try this regex    href  s    s         lt 1 gt              lt 1 gt   S       You will get more help from discussions over   Regular expression to extract URL from an HTML link  and  Regex to get the link in href   asp net   Hope its helpful

User · Answer

Thanks everyone  specially  plalx   I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as  lt a s       gt     s   href  quot     quot     quot   would suffice to capture all URLs  If you want to make sure they contain at least a query string  you could just use  lt a s       gt     s   href  quot     quot        quot     quot    My final regex string  First use one of this  st     quot   www   https  ftp gopher telnet file notes ms-help                  w d              -      amp     quot   st     quot  lt a href   gt    gt       lt  a gt  quot   st     quot     A-Za-z  3 9                -   amp        w      A-Za-z0-9 -      www   -   amp        w     A-Za-z0-9 -                  w-            -    amp      w           w       quot   st     quot        https  ftp gopher telnet file notes ms-help                www     www     w d                -     amp     quot   st     quot       https  ftp gopher telnet file notes ms-help                www     www    quot   st     quot    https  ftp gopher telnet file notes ms-help                   www     w d              -      amp     quot   st     quot href   quot  quot      lt url gt  http https             com org net gov          quot  quot    quot   st     quot   lt a    gt     lt  a gt   quot   st     quot    hrefs       s quot  quot          mailto location  javascript   css   this            s gt  quot  quot     quot   st     quot http       w       w      a-zA-Z0-9                        amp amp             -                                    quot   st     quot http s        w-        w-      w-      amp       quot   st     quot  http https      a-zA-Z0-9                        amp amp             -                                    quot   st     quot   http ftp https        w -        w -        w -         amp amp           w -       amp amp           quot   st     quot http       w       w      a-zA-Z0-9                        amp amp             -                                    quot   st     quot http s         0-9a-zA-Z   -  w   0-9a-zA-Z      0-9           a-zA-Z0-9 -               amp amp            quot   st     quot    lt Protocol gt  w          lt Domain gt   w        S  quot    my choice is   quot    lt Protocol gt  w          lt Domain gt   w        S  quot   Second Use this  st    quot                quot    Problem Solved  Thanks every one

User · Answer

HTMLDocument DOC   this MySuperBrowser Document as HTMLDocument   public IHTMLAnchorElement imageElementHref   imageElementHref   DOC getElementById  idfirsticonhref   as IHTMLAnchorElement    Simply try this code

User · Answer

I d recommend using an HTML parser over a regex  but still here s a regex that will create a capturing group over the value of the href attribute of each links  It will match whether double or single quotes are used    lt a s       gt     s   href             1   You can view a full explanation of this regex at here   Snippet playground    x000D   x000D  const linkRx     lt a s       gt     s   href             1   x000D  const textToMatchInput   document querySelector   name textToMatch     x000D   x000D  document querySelector  button   addEventListener  click        gt    x000D    console log textToMatchInput value match linkRx    x000D      x000D   lt label gt  x000D    Text to match  x000D     lt input type  text  name  textToMatch  value   lt a href  google com   gt  x000D     x000D     lt button gt Match lt  button gt  x000D    lt  label gt  x000D   x000D   x000D

User · Answer

Using regex to  parse html is not recommended  regex is used for regularly occurring patterns html is not regular with it s format except xhtml  For example html files are valid even if you don t have a closing tag This could break your code   Use an html parser like htmlagilitypack  You can use this code to retrieve all href s in anchor tag using HtmlAgilityPack  HtmlDocument doc   new HtmlDocument    doc Load yourStream    var hrefList   doc DocumentNode SelectNodes    a                      Select p   gt  p GetAttributeValue  href    not found                       ToList      hrefList contains all href s

User · Answer

I came up with this one  that supports anchor and image tags  and supports single and double quotes    lt  a img    s       gt      s    src href                         So   lt a href   something ext  gt click here lt  a gt    Will match    Match 1   something ext   And   lt a href   something ext  gt click here lt  a gt    Will match    Match 1   something ext   Same goes for img src attributes

User · Answer

Try this     public partial class Form1   Form               public Form1                         InitializeComponent                       private void Form1 Load object sender  EventArgs e                        var res   Find html                      public static List lt LinkItem gt  Find string file                        List lt LinkItem gt  list   new List lt LinkItem gt                     1                 Find all matches in file              MatchCollection m1   Regex Matches file      lt a    gt     lt  a gt                     RegexOptions Singleline                   2                 Loop over each match              foreach  Match m in m1                                string value   m Groups 1  Value                  LinkItem i   new LinkItem                        3                     Get href attribute                  Match m2   Regex Match value    href                               RegexOptions Singleline                   if  m2 Success                                        i Href   m2 Groups 1  Value                                        4                     Remove inner tags from text                  string t   Regex Replace value     s  lt     gt  s                        RegexOptions Singleline                   i Text   t                   list Add i                             return list                     public struct LinkItem                       public string Href              public string Text               public override string ToString                                 return Href     n t    Text                                     Input     string html     lt a href   www aaa xx xx zz id xxxx amp name xxxx        gt  lt  a gt  2  lt a href   http   www aaa xx xx zz id xxxx amp name xxxx        gt  lt  a gt        Result    0     www aaa xx xx zz id xxxx amp name xxxx   1     http   www aaa xx xx zz id xxxx amp name xxxx    C  Scraping HTML Links     Scraping HTML extracts important page elements  It has many legal uses   for webmasters and ASP NET developers  With the Regex type and   WebClient  we implement screen scraping for HTML    Edited  Another easy way you can use a web browser control for getting href from tag a like this  see my example    public Form1                         InitializeComponent                webBrowser1 DocumentCompleted    new WebBrowserDocumentCompletedEventHandler webBrowser1 DocumentCompleted                      private void Form1 Load object sender  EventArgs e                        webBrowser1 DocumentText     lt a href   www aaa xx xx zz id xxxx amp name xxxx        gt  lt  a gt  lt a href   http   www aaa xx xx zz id xxxx amp name xxxx        gt  lt  a gt  lt a href   https   www aaa xx xx zz id xxxx amp name xxxx        gt  lt  a gt  lt a href   www aaa xx xx zz xxx        gt  lt  a gt                       void webBrowser1 DocumentCompleted object sender  WebBrowserDocumentCompletedEventArgs e                        List lt string gt  href   new List lt string gt                 foreach  HtmlElement el in webBrowser1 Document GetElementsByTagName  a                                  href Add el GetAttribute  href

[c#] regular expression for finding 'href' value of a <a> link

Examples related to c#

Examples related to regex