What would be the best regular expression for this scenario?
Given this URL:
http://php.net/manual/en/function.preg-match.php
How should I go about selecting everything between (but not including) http://php.net
and .php
:
/manual/en/function.preg-match
This is for an Nginx configuration file.
Like this:
if (preg_match('/(?<=net).*(?=\.php)/', $subject, $regs)) {
$result = $regs[0];
}
Explanation:
"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
net # Match the characters “net” literally
)
. # Match any single character that is not a line break character
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
\. # Match the character “.” literally
php # Match the characters “php” literally
)
"
Here's a regex solution better than what most have provided so far, if you ask me: http://regex101.com/r/nQ8rH5
/http:\/\/[^\/]+\K.*(?=\.[^.]+$)/i
There's no need to use a regular expression to dissect a URL. PHP has built-in functions for this, pathinfo() and parse_url().
Simple:
$url = "http://php.net/manual/en/function.preg-match.php";
preg_match("/http:\/\/php\.net(.+)\.php/", $url, $matches);
echo $matches[1];
$matches[0]
is your full URL, $matches[1]
is the part you want.
See yourself: http://codepad.viper-7.com/hHmwI2
Just for the fun of it, here are two ways that have not been explored:
substr($url, strpos($s, '/', 8), -4)
Or:
substr($s, strpos($s, '/', 8), -strlen($s) + strrpos($s, '.'))
Based on the idea that HTTP schemes http://
and https://
are at most 8 characters, so typically it suffices to find the first slash from the 9th position onwards. If the extension is always .php
the first code will work, otherwise the other one is required.
For a pure regular expression solution you can break the string down like this:
~^(?:[^:/?#]+:)?(?://[^/?#]*)?([^?#]*)~
^
The path portion would be inside the first memory group (i.e. index 1), indicated by the ^
in the line underneath the expression. Removing the extension can be done using pathinfo()
:
$parts = pathinfo($matches[1]);
echo $parts['dirname'] . '/' . $parts['filename'];
You can also tweak the expression to this:
([^?#]*?)(?:\.[^?#]*)?(?:\?|$)
This expression is not very optimal though, because it has some back tracking in it. In the end I would go for something less custom:
$parts = pathinfo(parse_url($url, PHP_URL_PATH));
echo $parts['dirname'] . '/' . $parts['filename'];
Try this:
preg_match("/net(.*)\.php$/","http://php.net/manual/en/function.preg-match.php", $matches);
echo $matches[1];
// prints /manual/en/function.preg-match
This general URL match allows you to select parts of a URL:
if (preg_match('/\\b(?P<protocol>https?|ftp):\/\/(?P<domain>[-A-Z0-9.]+)(?P<file>\/[-A-Z0-9+&@#\/%=~_|!:,.;]*)?(?P<parameters>\\?[-A-Z0-9+&@#\/%=~_|!:,.;]*)?/i', $subject, $regs)) {
$result = $regs['file'];
//or you can append the $regs['parameters'] too
} else {
$result = "";
}
re> |(?<=\w)/.+(?=\.\w+$)| Compile time 0.0011 milliseconds Memory allocation (code space): 32 Study time 0.0002 milliseconds Capturing subpattern count = 0 No options First char = '/' No need char Max lookbehind = 1 Subject length lower bound = 2 No set of starting bytes data> http://php.net/manual/en/function.preg-match.php Execute time 0.0007 milliseconds 0: /manual/en/function.preg-match
re> |//[^/]*(.*)\.\w+$| Compile time 0.0010 milliseconds Memory allocation (code space): 28 Study time 0.0002 milliseconds Capturing subpattern count = 1 No options First char = '/' Need char = '.' Subject length lower bound = 4 No set of starting bytes data> http://php.net/manual/en/function.preg-match.php Execute time 0.0005 milliseconds 0: //php.net/manual/en/function.preg-match.php 1: /manual/en/function.preg-match
re> |/[^/]+(.*)\.| Compile time 0.0008 milliseconds Memory allocation (code space): 23 Study time 0.0002 milliseconds Capturing subpattern count = 1 No options First char = '/' Need char = '.' Subject length lower bound = 3 No set of starting bytes data> http://php.net/manual/en/function.preg-match.php Execute time 0.0005 milliseconds 0: /php.net/manual/en/function.preg-match. 1: /manual/en/function.preg-match
re> |/[^/]+\K.*(?=\.)| Compile time 0.0009 milliseconds Memory allocation (code space): 22 Study time 0.0002 milliseconds Capturing subpattern count = 0 No options First char = '/' No need char Subject length lower bound = 2 No set of starting bytes data> http://php.net/manual/en/function.preg-match.php Execute time 0.0005 milliseconds 0: /manual/en/function.preg-match
re> |\w+\K/.*(?=\.)| Compile time 0.0009 milliseconds Memory allocation (code space): 22 Study time 0.0003 milliseconds Capturing subpattern count = 0 No options No first char Need char = '/' Subject length lower bound = 2 Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z data> http://php.net/manual/en/function.preg-match.php Execute time 0.0011 milliseconds 0: /manual/en/function.preg-match
Regular expression for matching everything after "net" and before ".php":
$pattern = "net([a-zA-Z0-9_]*)\.php";
In the above regular expression, you can find the matching group of characters enclosed by "()" to be what you are looking for.
Hope it's useful.
A regular expression might not be the most effective tool for this job.
Try using parse_url()
, combined with pathinfo()
:
$url = 'http://php.net/manual/en/function.preg-match.php';
$path = parse_url($url, PHP_URL_PATH);
$pathinfo = pathinfo($path);
echo $pathinfo['dirname'], '/', $pathinfo['filename'];
The above code outputs:
/manual/en/function.preg-match
http:[\/]{2}.+?[.][^\/]+(.+)[.].+
let's see, what it done:
http:[\/]{2}.+?[.][^\/]
- non-capture group for http://php.net
(.+)[.]
- capture part until last dot occur: /manual/en/function.preg-match
[.].+
- matching extension of file like this: .php
Source: Stackoverflow.com