Does anyone know of a regular expression I could use to find URLs within a string? I've found a lot of regular expressions on Google for determining if an entire string is a URL but I need to be able to search an entire string for URLs. For example, I would like to be able to find www.google.com
and http://yahoo.com
in the following string:
Hello www.google.com World http://yahoo.com
I am not looking for specific URLs in the string. I am looking for ALL of the URLs in the string which is why I need a regular expression.
It is just simple.
Use this pattern: \b((ftp|https?)://)?([\w-\.]+\.(com|net|org|gov|mil|int|edu|info|me)|(\d+\.\d+\.\d+\.\d+))(:\d+)?(\/[\w-\/]*(\?\w*(=\w+)*[&\w-=]*)*(#[\w-]+)*)?
It matches any link contains:
Allowed Protocols: http, https and ftp
Allowed Domains: *.com, *.net, *.org, *.gov, *.mil, *.int, *.edu, *.info and *.me OR IP
Allowed Ports: true
Allowed Parameters: true
Allowed Hashes: true
None of the solutions provided here solved the problems/use-cases I had.
What I have provided here, is the best I have found/made so far. I will update it when I find new edge-cases that it doesn't handle.
\b
#Word cannot begin with special characters
(?<![@.,%&#-])
#Protocols are optional, but take them with us if they are present
(?<protocol>\w{2,10}:\/\/)?
#Domains have to be of a length of 1 chars or greater
((?:\w|\&\#\d{1,5};)[.-]?)+
#The domain ending has to be between 2 to 15 characters
(\.([a-z]{2,15})
#If no domain ending we want a port, only if a protocol is specified
|(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])
Matching a URL in a text should not be so complex
(?:(?:(?:ftp|http)[s]*:\/\/|www\.)[^\.]+\.[^ \n]+)
I use this Regex:
/((\w+:\/\/\S+)|(\w+[\.:]\w+\S+))[^\s,\.]/ig
It works fine for many URLs, like: http://google.com, https://dev-site.io:8080/home?val=1&count=100, www.regexr.com, localhost:8080/path, ...
Using the regex provided by @JustinLevene did not have the proper escape sequences on the back-slashes. Updated to now be correct, and added in condition to match the FTP protocol as well: Will match to all urls with or without protocols, and with out without "www."
Code: ^((http|ftp|https):\/\/)?([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?
Example: https://regex101.com/r/uQ9aL4/65
This is a slight improvement on/adjustment to (depending on what you need) Rajeev's answer:
([\w\-_]+(?:(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&:/~\+#]*[A-Z\-\@?^=%&/~\+#]){2,6}?
See here for an example of what it does and does not match.
I got rid of the check for "http" etc as I wanted to catch url's without this. I added slightly to the regex to catch some obfuscated urls (i.e. where user's use [dot] instead of a "."). Finally I replaced "\w" with "A-Z" to and "{2,3}" to reduce false positives like v2.0 and "moo.0dd".
Any improvements on this welcome.
I think this regex pattern handle precisely what you want
/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/
and this is an snippet example to extract Urls:
// The Regular Expression filter
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The Text you want to filter for urls
$text = "The text you want https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string to filter goes here.";
// Check if there is a url in the text
preg_match_all($reg_exUrl, $text, $url,$matches);
var_dump($matches);
Wrote one up myself:
let regex = /([\w+]+\:\/\/)?([\w\d-]+\.)*[\w-]+[\.\:]\w+([\/\?\=\&\#\.]?[\w-]+)*\/?/gm
It works on ALL of the following domains:
https://www.facebook.com
https://app-1.number123.com
http://facebook.com
ftp://facebook.com
http://localhost:3000
localhost:3000/
unitedkingdomurl.co.uk
this.is.a.url.com/its/still=going?wow
shop.facebook.org
app.number123.com
app1.number123.com
app-1.numbEr123.com
app.dashes-dash.com
www.facebook.com
facebook.com
fb.com/hello_123
fb.com/hel-lo
fb.com/hello/goodbye
fb.com/hello/goodbye?okay
fb.com/hello/goodbye?okay=alright
Hello www.google.com World http://yahoo.com
https://www.google.com.tr/admin/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
http://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
ftp://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
drive.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255
shop.facebook.org/derf.html
You can see how it performs here on regex101 and adjust as needed
All of the above answers are not match for Unicode characters in URL, for example: http://google.com?query=d?c+filan+dã+search
For the solution, this one should work:
(ftp:\/\/|www\.|https?:\/\/){1}[a-zA-Z0-9u00a1-\uffff0-]{2,}\.[a-zA-Z0-9u00a1-\uffff0-]{2,}(\S*)
I have utilize c# Uri class and it works, well with IP Address, localhost
public static bool CheckURLIsValid(string url)
{
Uri returnURL;
return (Uri.TryCreate(url, UriKind.Absolute, out returnURL)
&& (returnURL.Scheme == Uri.UriSchemeHttp || returnURL.Scheme == Uri.UriSchemeHttps));
}
I use the logic of finding text between two dots or periods
the regex below works fine with python
(?<=\.)[^}]*(?=\.)
This is the best one.
NSString *urlRegex="(http|ftp|https|www|gopher|telnet|file)(://|.)([\\w_-]+(?:(?:\\.[\\w_-]+)??+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?";
Here a little bit more optimized regexp:
(?:(?:(https?|ftp|file):\/\/|www\.|ftp\.)|([\w\-_]+(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&:\/~\+#]*[A-Z\-\@?^=%&\/~\+#]){2,6}?
Here is test with data: https://regex101.com/r/sFzzpY/6
I found this which covers most sample links, including subdirectory parts.
Regex is:
(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))?
IMPROVED
Detects Urls like these:
Regex:
/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/gm
If you have the url pattern, you should be able to search for it in your string. Just make sure that the pattern doesnt have ^
and $
marking beginning and end of the url string. So if P is the pattern for URL, look for matches for P.
If you have to be strict on selecting links, I would go for:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
For more infos, read this:
An Improved Liberal, Accurate Regex Pattern for Matching URLs
Short and simple. I have not tested in javascript code yet but It looks it will work:
((http|ftp|https):\/\/)?(([\w.-]*)\.([\w]*))
(?:vnc|s3|ssh|scp|sftp|ftp|http|https)\:\/\/[\w\.]+(?:\:?\d{0,5})|(?:mailto|)\:[\w\.]+\@[\w\.]+
If you want an explanation of each part, try in regexr[.]com where you will get a great explanation of every character.
This is split by an "|" or "OR" because not all useable URI have "//" so this is where you can create a list of schemes as or conditions that you would be interested in matching.
A probably too simplistic, but working method might be:
[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+
I tested it on Python and as long as the string parsing contains a space before and after and none in the url (which I have never seen before) it should be fine.
Here is an online ide demonstrating it
However here are some benefits of using it:
file:
and localhost
as well as ip addresses#
or -
(see url of this post)I liked Stefan Henze 's solution but it would pick up 34.56. Its too general and I have unparsed html. There are 4 anchors for a url;
www ,
http:\ (and co) ,
. followed by letters and then / ,
or letters . and one of these: https://ftp.isc.org/www/survey/reports/current/bynum.txt .
I used lots of info from this thread. Thank you all.
"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
Above solves just about everything except a string like "eurls:www.google.com,facebook.com,http://test.com/", which it returns as a single string. Tbh idk why I added gopher etc. Proof R code
if(T){
wierdurl<-vector()
wierdurl[1]<-"https://JP??.?.jp/dir1/?? "
wierdurl[2]<-"xn--jp-cd2fp15c.xn--fsq.jp "
wierdurl[3]<-"http://52.221.161.242/2018/11/23/biofourmis-collab"
wierdurl[4]<-"https://12000.org/ "
wierdurl[5]<-" https://vg-1.com/?page_id=1002 "
wierdurl[6]<-"https://3dnews.ru/822878"
wierdurl[7]<-"The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
The code below catches all urls in text and returns urls in list. "
wierdurl[8]<-"Thelinkofthisquestion:https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Alsotherearesomeurls:www.google.com,facebook.com,http://test.com/method?param=wasd
Thecodebelowcatchesallurlsintextandreturnsurlsinlist. "
wierdurl[9]<-"Thelinkofthisquestion:https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-stringAlsotherearesomeurlsZwww.google.com,facebook.com,http://test.com/method?param=wasdThecodebelowcatchesallurlsintextandreturnsurlsinlist."
wierdurl[10]<-"1facebook.com/1res"
wierdurl[11]<-"1facebook.com/1res/wat.txt"
wierdurl[12]<-"www.e "
wierdurl[13]<-"is this the file.txt i need"
wierdurl[14]<-"xn--jp-cd2fp15c.xn--fsq.jpinspiredby "
wierdurl[15]<-"[xn--jp-cd2fp15c.xn--fsq.jp/inspiredby "
wierdurl[16]<-"xnto--jpto-cd2fp15c.xnto--fsq.jpinspiredby "
wierdurl[17]<-"fsety--fwdvg-gertu56.ffuoiw--ffwsx.3dinspiredby "
wierdurl[18]<-"://3dnews.ru/822878 "
wierdurl[19]<-" http://mywebsite.com/msn.co.uk "
wierdurl[20]<-" 2.0http://www.abe.hip "
wierdurl[21]<-"www.abe.hip"
wierdurl[22]<-"hardware/software/data"
regexstring<-vector()
regexstring[2]<-"(http|ftp|https)://([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
regexstring[3]<-"/(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#\\/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#\\/%=~_|$])/igm"
regexstring[4]<-"[a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]?"
regexstring[5]<-"((http|ftp|https)\\:\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
regexstring[6]<-"((http|ftp|https):\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?"
regexstring[7]<-"(http|ftp|https)(:\\/\\/)([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
regexstring[8]<-"(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#/%=~_|$])"
regexstring[10]<-"((http[s]?|ftp):\\/)?\\/?([^:\\/\\s]+)((\\/\\w+)*\\/)([\\w\\-\\.]+[^#?\\s]+)(.*)?(#[\\w\\-]+)?"
regexstring[12]<-"http[s:/]+[[:alnum:]./]+"
regexstring[9]<-"http[s:/]+[[:alnum:]./]+" #in DLpages 230
regexstring[1]<-"[[:alnum:]-]+?[.][:alnum:]+?(?=[/ :])" #in link_graphs 50
regexstring[13]<-"^(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?$"
regexstring[14]<-"(((((http|ftp|https):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]+(?:(?:\\.[\\w_-]+)*))((\\.((org|com|net|edu|gov|mil|int)|(([:alpha:]{2})(?=[, ]))))|([\\/]([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
regexstring[15]<-"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
}
for(i in wierdurl){#c(7,22)
for(c in regexstring[c(15)]) {
print(paste(i,which(regexstring==c)))
print(str_extract_all(i,c))
}
}
This is a simplest one. which work for me fine.
%(http|ftp|https|www)(://|\.)[A-Za-z0-9-_\.]*(\.)[a-z]*%
Guess no regex is perfect for this use. I found a pretty solid one here
/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])/igm
Some differences / advantages compared to the other ones posted here:
moo.com
without http
or www
See here for examples
I used this
^(https?:\\/\\/([a-zA-z0-9]+)(\\.[a-zA-z0-9]+)(\\.[a-zA-z0-9\\/\\=\\-\\_\\?]+)?)$
text = """The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd¶ms2=kjhdkjshd
The code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-&?=%.]+', text)
print(urls)
Output:
[
'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string',
'www.google.com',
'facebook.com',
'http://test.com/method?param=wasd',
'http://test.com/method?param=wasd¶ms2=kjhdkjshd'
]
I used below regular expression to find url in a string:
/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/
This is the one I use
(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
Works for me, should work for you too.
Source: Stackoverflow.com