Regular expression to match URLs in Java

Question

I use RegexBuddy while working with regular expressions  From its library I copied the regular expression to match URLs  I tested successfully within RegexBuddy  However  when I copied it as Java String flavor and pasted it into Java code  it does not work  The following class prints false    public class RegexFoo        public static void main String   args            String regex      b https  ftp file     -A-Z0-9  amp                  -A-Z0-9  amp                     String text    http   google com           System out println IsMatch text regex           private static boolean IsMatch String s  String pattern            try               Pattern patt   Pattern compile pattern               Matcher matcher   patt matcher s               return matcher matches              catch  RuntimeException e            return false                        Does anyone know what I am doing wrong

User · Answer

The problem with all suggested approaches: all RegEx is validating

All RegEx -based code is over-engineered: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package (filtering Email addresses from text) for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

If you want to filter Emails from text (because later on you will do NLP staff etc) - just remove all tokens containing "@" inside.

This is simple text where RegEx of Java 6 fails. Try it in divverent variants of Java. It takes about 1000 milliseconds per RegEx call, in a long running single threaded test application:

pattern = Pattern.compile("[A-Za-z0-9](([_\\.\\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\\.\\-]?[a-zA-Z0-9]+)*)\\.([A-Za-z]{2,})", Pattern.CASE_INSENSITIVE);

"Avalanna is such a sweet little girl! It would b heartbreaking if cancer won. She's so precious! #BeliebersPrayForAvalanna");
"@AndySamuels31 Hahahahahahahahahhaha lol, you don't look like a girl hahahahhaahaha, you are... sexy.";

Do not rely on regular expressions if you only need to filter words with "@", "http://", "ftp://", "mailto:"; it is huge engineering overhead.

If you really want to use RegEx with Java, try Automaton

User · Answer

I ll try a standard  Why are you doing it this way   answer     Do you know about java net URL   URL url   new URL stringURL     The above will throw a MalformedURLException if it can t parse the URL

User · Answer

This works too   String regex      b https  ftp file     -a-zA-Z0-9  amp                  -a-zA-Z0-9  amp               Note    String regex     lt   b https  ftp file     -a-zA-Z0-9  amp                  -a-zA-Z0-9  amp           gt       matches  lt http   google com gt   String regex     lt   https  ftp file     -a-zA-Z0-9  amp                  -a-zA-Z0-9  amp           gt       does not match  lt http   google com gt    So probably the first one is more useful for general use

User · Answer

When using regular expressions from RegexBuddy s library  make sure to use the same matching modes in your own code as the regex from the library   If you generate a source code snippet on the Use tab  RegexBuddy will automatically set the correct matching options in the source code snippet   If you copy paste the regex  you have to do that yourself   In this case  as others pointed out  you missed the case insensitivity option

User · Answer

When using regular expressions from RegexBuddy s library  make sure to use the same matching modes in your own code as the regex from the library   If you generate a source code snippet on the Use tab  RegexBuddy will automatically set the correct matching options in the source code snippet   If you copy paste the regex  you have to do that yourself   In this case  as others pointed out  you missed the case insensitivity option

User · Answer

In line with billjamesdev answer  here is another approach to validate an URL without using a RegEx   From Apache Commons Validator lib  look at class UrlValidator  Some example code   Construct a UrlValidator with valid schemes of  http   and  https    String   schemes     http   https    UrlValidator urlValidator   new UrlValidator schemes   if  urlValidator isValid  ftp   foo bar com          System out println  url is valid      else      System out println  url is invalid       prints  url is invalid    If instead the default constructor is used   UrlValidator urlValidator   new UrlValidator    if  urlValidator isValid  ftp   foo bar com          System out println  url is valid      else      System out println  url is invalid        prints out  url is valid

User · Answer

Here is a proposal of an URL parser regex that recognizes    Protocol Host Port Path  Document folder  Get parameters      gt    lt protocol gt    alpha       gt      alpha                  lt host gt    gt    alnum     -         gt      lt port gt    digit          lt path gt      gt    alnum     -            gt      lt request gt    gt    alnum        alnum        gt   amp    gt    alnum        alnum             This regex is able to parse an URL such   jdbc hsqldb hsql   localhost 91 index   There can be many way to engineer a URL regex  thus the one I propose can be lightly adapted to match more accurate URL grammars  It can be tested on the following page   https   regex101 com r Dy7HE0 5 Be aware that langages native API for regex  such as java util regex  don t support smart character classes such as    alnum    and    alpha     Use instead  w and  d

User · Answer

The best way to do it now is   android util Patterns WEB URL matcher linkUrl  matches      EDIT  Code of Patterns from https   github com android platform frameworks base blob master core java android util Patterns java          Copyright  C  2007 The Android Open Source Project       Licensed under the Apache License  Version 2 0  the  License       you may not use this file except in compliance with the License     You may obtain a copy of the License at            http   www apache org licenses LICENSE-2 0       Unless required by applicable law or agreed to in writing  software    distributed under the License is distributed on an  AS IS  BASIS     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND  either express or implied     See the License for the specific language governing permissions and    limitations under the License       package android util   import java util regex Matcher  import java util regex Pattern          Commonly used regular expression patterns      public class Patterns                   Regular expression to match all IANA top-level domains          List accurate as of 2011 07 18   List taken from          http   data iana org TLD tlds-alpha-by-domain txt         This pattern is auto-generated by frameworks ex common tools make-iana-tld-pattern py                 deprecated Due to the recent profileration of gTLDs  this API is         expected to become out-of-date very quickly  Therefore it is now         deprecated               Deprecated     public static final String TOP LEVEL DOMAIN STR              aero arpa asia a cdefgilmnoqrstuwxz                 biz b abdefghijmnorstvwyz                 cat com coop c acdfghiklmnoruvxyz                d ejkmoz                edu e cegrstu                f ijkmor                gov g abdefghilmnpqrstuwy                h kmnrtu                info int i delmnoqrst                 jobs j emop                k eghimnprwyz               l abcikrstuvy                mil mobi museum m acdeghklmnopqrstuvwxyz                 name net n acefgilopruz                 org om                pro p aefghklmnrstwy                qa              r eosuw               s abcdeghijklmnortuvyz                tel travel t cdfghjklmnoprtvwz                u agksyz               v aceginu               w fs                 u03b4 u03bf u03ba u03b9 u03bc u03ae  u0438 u0441 u043f u044b u0442 u0430 u043d u0438 u0435  u0440 u0444  u0441 u0440 u0431  u05d8 u05e2 u05e1 u05d8  u0622 u0632 u0645 u0627 u06cc u0634 u06cc  u0625 u062e u062a u0628 u0627 u0631  u0627 u0644 u0627 u0631 u062f u0646  u0627 u0644 u062c u0632 u0627 u0626 u0631  u0627 u0644 u0633 u0639 u0648 u062f u064a u0629  u0627 u0644 u0645 u063a u0631 u0628  u0627 u0645 u0627 u0631 u0627 u062a  u0628 u06be u0627 u0631 u062a  u062a u0648 u0646 u0633  u0633 u0648 u0631 u064a u0629  u0641 u0644 u0633 u0637 u064a u0646  u0642 u0637 u0631  u0645 u0635 u0631  u092a u0930 u0940 u0915 u094d u0937 u093e  u092d u093e u0930 u0924  u09ad u09be u09b0 u09a4  u0a2d u0a3e u0a30 u0a24  u0aad u0abe u0ab0 u0aa4  u0b87 u0ba8 u0bcd u0ba4 u0bbf u0baf u0bbe  u0b87 u0bb2 u0b99 u0bcd u0b95 u0bc8  u0b9a u0bbf u0b99 u0bcd u0b95 u0baa u0bcd u0baa u0bc2 u0bb0 u0bcd  u0baa u0bb0 u0bbf u0b9f u0bcd u0b9a u0bc8  u0c2d u0c3e u0c30 u0c24 u0c4d  u0dbd u0d82 u0d9a u0dcf  u0e44 u0e17 u0e22  u30c6 u30b9 u30c8  u4e2d u56fd  u4e2d u570b  u53f0 u6e7e  u53f0 u7063  u65b0 u52a0 u5761  u6d4b u8bd5  u6e2c u8a66  u9999 u6e2f  ud14c uc2a4 ud2b8  ud55c uad6d xn  -  -0zwm56d xn  -  -11b5bs3a9aj6g xn  -  -3e0b707e xn  -  -45brj9c xn  -  -80akhbyknj4f xn  -  -90a3ac xn  -  -9t4b11yi5a xn  -  -clchc0ea0b2g2a9gcd xn  -  -deba0ad xn  -  -fiqs8s xn  -  -fiqz9s xn  -  -fpcrj9c3d xn  -  -fzc2c9e2c xn  -  -g6w251d xn  -  -gecrj9c xn  -  -h2brj9c xn  -  -hgbk6aj7f53bba xn  -  -hlcj6aya9esc7a xn  -  -j6w193g xn  -  -jxalpdlp xn  -  -kgbechtv xn  -  -kprw13d xn  -  -kpry57d xn  -  -lgbbat1ad8j xn  -  -mgbaam7a8h xn  -  -mgbayh7gpa xn  -  -mgbbh1a71e xn  -  -mgbc0a9azcg xn  -  -mgberp4a5d4ar xn  -  -o3cw4h xn  -  -ogbpf8fl xn  -  -p1ai xn  -  -pgbs0dh xn  -  -s9brj9c xn  -  -wgbh1c xn  -  -wgbl6a xn  -  -xkc2al3hye2a xn  -  -xkc2dl3a5ee0h xn  -  -yfro4i67o xn  -  -ygbi2ammx xn  -  -zckzah xxx               y et               z amw                      Regular expression pattern to match all IANA top-level domains           deprecated This API is deprecated  See   link  TOP LEVEL DOMAIN STR                Deprecated     public static final Pattern TOP LEVEL DOMAIN           Pattern compile TOP LEVEL DOMAIN STR                    Regular expression to match all IANA top-level domains for WEB URL          List accurate as of 2011 07 18   List taken from          http   data iana org TLD tlds-alpha-by-domain txt         This pattern is auto-generated by frameworks ex common tools make-iana-tld-pattern py                 deprecated This API is deprecated  See   link  TOP LEVEL DOMAIN STR                Deprecated     public static final String TOP LEVEL DOMAIN STR FOR WEB URL                               aero arpa asia a cdefgilmnoqrstuwxz                   biz b abdefghijmnorstvwyz                   cat com coop c acdfghiklmnoruvxyz                d ejkmoz                  edu e cegrstu                f ijkmor                  gov g abdefghilmnpqrstuwy                h kmnrtu                  info int i delmnoqrst                   jobs j emop                k eghimnprwyz               l abcikrstuvy                  mil mobi museum m acdeghklmnopqrstuvwxyz                   name net n acefgilopruz                   org om                  pro p aefghklmnrstwy                qa              r eosuw               s abcdeghijklmnortuvyz                  tel travel t cdfghjklmnoprtvwz                u agksyz               v aceginu               w fs                   u03b4 u03bf u03ba u03b9 u03bc u03ae  u0438 u0441 u043f u044b u0442 u0430 u043d u0438 u0435  u0440 u0444  u0441 u0440 u0431  u05d8 u05e2 u05e1 u05d8  u0622 u0632 u0645 u0627 u06cc u0634 u06cc  u0625 u062e u062a u0628 u0627 u0631  u0627 u0644 u0627 u0631 u062f u0646  u0627 u0644 u062c u0632 u0627 u0626 u0631  u0627 u0644 u0633 u0639 u0648 u062f u064a u0629  u0627 u0644 u0645 u063a u0631 u0628  u0627 u0645 u0627 u0631 u0627 u062a  u0628 u06be u0627 u0631 u062a  u062a u0648 u0646 u0633  u0633 u0648 u0631 u064a u0629  u0641 u0644 u0633 u0637 u064a u0646  u0642 u0637 u0631  u0645 u0635 u0631  u092a u0930 u0940 u0915 u094d u0937 u093e  u092d u093e u0930 u0924  u09ad u09be u09b0 u09a4  u0a2d u0a3e u0a30 u0a24  u0aad u0abe u0ab0 u0aa4  u0b87 u0ba8 u0bcd u0ba4 u0bbf u0baf u0bbe  u0b87 u0bb2 u0b99 u0bcd u0b95 u0bc8  u0b9a u0bbf u0b99 u0bcd u0b95 u0baa u0bcd u0baa u0bc2 u0bb0 u0bcd  u0baa u0bb0 u0bbf u0b9f u0bcd u0b9a u0bc8  u0c2d u0c3e u0c30 u0c24 u0c4d  u0dbd u0d82 u0d9a u0dcf  u0e44 u0e17 u0e22  u30c6 u30b9 u30c8  u4e2d u56fd  u4e2d u570b  u53f0 u6e7e  u53f0 u7063  u65b0 u52a0 u5761  u6d4b u8bd5  u6e2c u8a66  u9999 u6e2f  ud14c uc2a4 ud2b8  ud55c uad6d xn  -  -0zwm56d xn  -  -11b5bs3a9aj6g xn  -  -3e0b707e xn  -  -45brj9c xn  -  -80akhbyknj4f xn  -  -90a3ac xn  -  -9t4b11yi5a xn  -  -clchc0ea0b2g2a9gcd xn  -  -deba0ad xn  -  -fiqs8s xn  -  -fiqz9s xn  -  -fpcrj9c3d xn  -  -fzc2c9e2c xn  -  -g6w251d xn  -  -gecrj9c xn  -  -h2brj9c xn  -  -hgbk6aj7f53bba xn  -  -hlcj6aya9esc7a xn  -  -j6w193g xn  -  -jxalpdlp xn  -  -kgbechtv xn  -  -kprw13d xn  -  -kpry57d xn  -  -lgbbat1ad8j xn  -  -mgbaam7a8h xn  -  -mgbayh7gpa xn  -  -mgbbh1a71e xn  -  -mgbc0a9azcg xn  -  -mgberp4a5d4ar xn  -  -o3cw4h xn  -  -ogbpf8fl xn  -  -p1ai xn  -  -pgbs0dh xn  -  -s9brj9c xn  -  -wgbh1c xn  -  -wgbl6a xn  -  -xkc2al3hye2a xn  -  -xkc2dl3a5ee0h xn  -  -yfro4i67o xn  -  -ygbi2ammx xn  -  -zckzah xxx               y et               z amw                      Good characters for Internationalized Resource Identifiers  IRI          This comprises most common used Unicode characters allowed in IRI        as detailed in RFC 3987         Specifically  those two byte Unicode characters are not included              public static final String GOOD IRI CHAR            a-zA-Z0-9 u00A0- uD7FF uF900- uFDCF uFDF0- uFFEF        public static final Pattern IP ADDRESS           Pattern compile                 25 0-5  2 0-4  0-9   0-1  0-9  2   1-9  0-9   1-9      25 0-5  2 0-4                   0-9   0-1  0-9  2   1-9  0-9   1-9  0     25 0-5  2 0-4  0-9   0-1                   0-9  2   1-9  0-9   1-9  0     25 0-5  2 0-4  0-9   0-1  0-9  2                    1-9  0-9   0-9                       RFC 1035 Section 2 3 4 limits the labels to a maximum 63 octets              private static final String IRI                 GOOD IRI CHAR           GOOD IRI CHAR      -  0 61      GOOD IRI CHAR       0 1         private static final String GOOD GTLD CHAR            a-zA-Z u00A0- uD7FF uF900- uFDCF uFDF0- uFFEF       private static final String GTLD         GOOD GTLD CHAR      2 63        private static final String HOST NAME         IRI             GTLD       public static final Pattern DOMAIN NAME           Pattern compile       HOST NAME         IP ADDRESS                          Regular expression pattern to match most part of RFC 3987         Internationalized URLs  aka IRIs   Commonly used Unicode characters are         added              public static final Pattern WEB URL   Pattern compile                http https Http Https rtsp Rtsp               a-zA-Z0-9     -                                                 amp             a-fA-F0-9  2    1 64           a-zA-Z0-9     -                                                 amp             a-fA-F0-9  2    1 25                              DOMAIN NAME                          d 1 5         plus option port number                           GOOD IRI CHAR                      amp                plus option query params              -                                     a-fA-F0-9  2                        b          and finally  a word boundary or end of                            input   This is to stop foo sure from                            matching as foo su      public static final Pattern EMAIL ADDRESS           Pattern compile                a-zA-Z0-9              -     1 256                                       a-zA-Z0-9  a-zA-Z0-9  -  0 64                                                                 a-zA-Z0-9  a-zA-Z0-9  -  0 25                                                 This pattern is intended for searching for things that look like they        might be phone numbers in arbitrary text  not for validating whether        something is in fact a phone number   It will miss many things that        are legitimate phone numbers                 lt p gt  The pattern matches the following          lt ul gt          lt li gt Optionally  a   sign followed immediately by one or more digits  Spaces  dots  or dashes        may follow          lt li gt Optionally  sets of digits in parentheses  separated by spaces  dots  or dashes          lt li gt A string starting and ending with a digit  containing digits  spaces  dots  and or dashes          lt  ul gt              public static final Pattern PHONE           Pattern compile                          sdd   space  dot  or dash                       0-9     -                      lt digits gt  lt sdd gt                           0-9        -                 lt digits gt   lt sdd gt                        0-9  0-9  -       0-9          lt digit gt  lt digit sdd gt   lt digit gt                   Convenience method to take all of the non-null matching groups in a         regex Matcher and return them as a concatenated string                  param matcher      The Matcher object from which grouped text will                             be extracted                 return             A String comprising all of the non-null matched                             groups concatenated together             public static final String concatGroups Matcher matcher            StringBuilder b   new StringBuilder            final int numGroups   matcher groupCount             for  int i   1  i  lt   numGroups  i                  String s   matcher group i                if  s    null                    b append s                                    return b toString                          Convenience method to return only the digits and plus signs        in the matching string                 param matcher      The Matcher object from which digits and plus will                            be extracted                return             A String comprising all of the digits and plus in                            the match             public static final String digitsAndPlusOnly Matcher matcher            StringBuilder buffer   new StringBuilder            String matchingRegion   matcher group             for  int i   0  size   matchingRegion length    i  lt  size  i                  char character   matchingRegion charAt i                if  character           Character isDigit character                     buffer append character                                   return buffer toString                          Do not create this static utility class              private Patterns

User · Answer

http  https ftp file        W w  3     a-zA-Z0-9     a-zA-Z     check here - https   www freeformatter com java-regex-tester html ad-output  It sorts out theses entries correctly   google com www google com wwwgooglecom ft  Www google com  ft https   www google com https    https   www  https   google com

User · Answer

Try the following regex string instead  Your test was probably done in a case-sensitive manner  I have added the lowercase alphas as well as a proper string beginning placeholder   String regex      https  ftp file     -a-zA-Z0-9  amp                  -a-zA-Z0-9  amp               This works too   String regex      b https  ftp file     -a-zA-Z0-9  amp                  -a-zA-Z0-9  amp               Note   String regex     lt   b https  ftp file     -a-zA-Z0-9  amp                  -a-zA-Z0-9  amp           gt       matches  lt http   google com gt   String regex     lt   https  ftp file     -a-zA-Z0-9  amp                  -a-zA-Z0-9  amp           gt       does not match  lt http   google com gt

[java] Regular expression to match URLs in Java

Examples related to java

Examples related to regex

Examples related to regexbuddy