[regex] Match the path of a URL, minus the filename extension

What would be the best regular expression for this scenario?

Given this URL:

http://php.net/manual/en/function.preg-match.php

How should I go about selecting everything between (but not including) http://php.net and .php:

/manual/en/function.preg-match

This is for an Nginx configuration file.

This question is related to regex nginx

The answer is


Like this:

if (preg_match('/(?<=net).*(?=\.php)/', $subject, $regs)) {
    $result = $regs[0];
}

Explanation:

"
(?<=      # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
   net       # Match the characters “net” literally
)
.         # Match any single character that is not a line break character
   *         # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?=       # Assert that the regex below can be matched, starting at this position (positive lookahead)
   \.        # Match the character “.” literally
   php       # Match the characters “php” literally
)
"

Here's a regex solution better than what most have provided so far, if you ask me: http://regex101.com/r/nQ8rH5

/http:\/\/[^\/]+\K.*(?=\.[^.]+$)/i

There's no need to use a regular expression to dissect a URL. PHP has built-in functions for this, pathinfo() and parse_url().


Simple:

$url = "http://php.net/manual/en/function.preg-match.php";
preg_match("/http:\/\/php\.net(.+)\.php/", $url, $matches);
echo $matches[1];

$matches[0] is your full URL, $matches[1] is the part you want.

See yourself: http://codepad.viper-7.com/hHmwI2


Just for the fun of it, here are two ways that have not been explored:

substr($url, strpos($s, '/', 8), -4)

Or:

substr($s, strpos($s, '/', 8), -strlen($s) + strrpos($s, '.'))

Based on the idea that HTTP schemes http:// and https:// are at most 8 characters, so typically it suffices to find the first slash from the 9th position onwards. If the extension is always .php the first code will work, otherwise the other one is required.

For a pure regular expression solution you can break the string down like this:

~^(?:[^:/?#]+:)?(?://[^/?#]*)?([^?#]*)~
                              ^

The path portion would be inside the first memory group (i.e. index 1), indicated by the ^ in the line underneath the expression. Removing the extension can be done using pathinfo():

$parts = pathinfo($matches[1]);
echo $parts['dirname'] . '/' . $parts['filename'];

You can also tweak the expression to this:

([^?#]*?)(?:\.[^?#]*)?(?:\?|$)

This expression is not very optimal though, because it has some back tracking in it. In the end I would go for something less custom:

$parts = pathinfo(parse_url($url, PHP_URL_PATH));
echo $parts['dirname'] . '/' . $parts['filename'];

Try this:

preg_match("/net(.*)\.php$/","http://php.net/manual/en/function.preg-match.php", $matches);
echo $matches[1];
// prints /manual/en/function.preg-match

This general URL match allows you to select parts of a URL:

if (preg_match('/\\b(?P<protocol>https?|ftp):\/\/(?P<domain>[-A-Z0-9.]+)(?P<file>\/[-A-Z0-9+&@#\/%=~_|!:,.;]*)?(?P<parameters>\\?[-A-Z0-9+&@#\/%=~_|!:,.;]*)?/i', $subject, $regs)) {
    $result = $regs['file'];
    //or you can append the $regs['parameters'] too
} else {
    $result = "";
}

|(?<=\w)/.+(?=\.\w+$)|

  • select everything from the first literal '/' preceded by
  • look behind a Word(\w) character
  • until followed by a look ahead
    • literal '.' appended by
    • one or more Word(\w) characters
    • before the end $
  re> |(?<=\w)/.+(?=\.\w+$)|
Compile time 0.0011 milliseconds
Memory allocation (code space): 32
  Study time 0.0002 milliseconds
Capturing subpattern count = 0
No options
First char = '/'
No need char
Max lookbehind = 1
Subject length lower bound = 2
No set of starting bytes
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0007 milliseconds
 0: /manual/en/function.preg-match

|//[^/]*(.*)\.\w+$|

  • find two literal '//' followed by anything but a literal '/'
  • select everything until
  • find literal '.' followed by only Word \w characters before the end $
  re> |//[^/]*(.*)\.\w+$|
Compile time 0.0010 milliseconds
Memory allocation (code space): 28
  Study time 0.0002 milliseconds
Capturing subpattern count = 1
No options
First char = '/'
Need char = '.'
Subject length lower bound = 4
No set of starting bytes
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0005 milliseconds
 0: //php.net/manual/en/function.preg-match.php
 1: /manual/en/function.preg-match

|/[^/]+(.*)\.|

  • find literal '/' followed by at least 1 or more non literal '/'
  • aggressive select everything before the last literal '.'
  re> |/[^/]+(.*)\.|
Compile time 0.0008 milliseconds
Memory allocation (code space): 23
  Study time 0.0002 milliseconds
Capturing subpattern count = 1
No options
First char = '/'
Need char = '.'
Subject length lower bound = 3
No set of starting bytes
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0005 milliseconds
 0: /php.net/manual/en/function.preg-match.
 1: /manual/en/function.preg-match

|/[^/]+\K.*(?=\.)|

  • find literal '/' followed by at least 1 or more non literal '/'
  • Reset select start \K
  • aggressive select everything before
  • look ahead last literal '.'
  re> |/[^/]+\K.*(?=\.)|
Compile time 0.0009 milliseconds
Memory allocation (code space): 22
  Study time 0.0002 milliseconds
Capturing subpattern count = 0
No options
First char = '/'
No need char
Subject length lower bound = 2
No set of starting bytes
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0005 milliseconds
 0: /manual/en/function.preg-match

|\w+\K/.*(?=\.)|

  • find one or more Word(\w) characters before a literal '/'
  • reset select start \K
  • select literal '/' followed by
  • anything before
  • look ahead last literal '.'
  re> |\w+\K/.*(?=\.)|
Compile time 0.0009 milliseconds
Memory allocation (code space): 22
  Study time 0.0003 milliseconds
Capturing subpattern count = 0
No options
No first char
Need char = '/'
Subject length lower bound = 2
Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P 
  Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z 
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0011 milliseconds
 0: /manual/en/function.preg-match

Regular expression for matching everything after "net" and before ".php":

$pattern = "net([a-zA-Z0-9_]*)\.php"; 

In the above regular expression, you can find the matching group of characters enclosed by "()" to be what you are looking for.

Hope it's useful.


A regular expression might not be the most effective tool for this job.

Try using parse_url(), combined with pathinfo():

$url      = 'http://php.net/manual/en/function.preg-match.php';
$path     = parse_url($url, PHP_URL_PATH);
$pathinfo = pathinfo($path);

echo $pathinfo['dirname'], '/', $pathinfo['filename'];

The above code outputs:

/manual/en/function.preg-match

http:[\/]{2}.+?[.][^\/]+(.+)[.].+

let's see, what it done:

http:[\/]{2}.+?[.][^\/] - non-capture group for http://php.net

(.+)[.] - capture part until last dot occur: /manual/en/function.preg-match

[.].+ - matching extension of file like this: .php