[regex] Regex for parsing directory and filename

I'm trying to write a regex that will parse out the directory and filename of a fully qualified path using matching groups.



would recognize group 1 to be "/var/log/xyz" and group 2 to be "10032008.log"

Seems simple but I can't get the matching groups to work for the life of me.

NOTE: As pointed out by some of the respondents this is probably not a good use of regular expressions. Generally I'd prefer to use the file API of the language I was using. What I'm actually trying to do is a little more complicated than this but would have been much more difficult to explain, so I chose a domain that everyone would be familiar with in order to most succinctly describe the root problem.

This question is related to regex parsing

The answer is

In languages that support regular expressions with non-capturing groups:


I'll explain the gnarly regex by exploding it...


What the parts mean:

(  -- capture group 1 starts
  (?:  -- non-capturing group starts
    [^/]*  -- greedily match as many non-directory separators as possible
    /  -- match a single directory-separator character
  )  -- non-capturing group ends
  *  -- repeat the non-capturing group zero-or-more times
)  -- capture group 1 ends
(.*)  -- capture all remaining characters in group 2


To test the regular expression, I used the following Perl script...

#!/usr/bin/perl -w

use strict;
use warnings;

sub test {
  my $str = shift;
  my $testname = shift;

  $str =~ m#((?:[^/]*/)*)(.*)#;

  print "$str -- $testname\n";
  print "  1: $1\n";
  print "  2: $2\n\n";

test('/var/log/xyz/10032008.log', 'absolute path');
test('var/log/xyz/10032008.log', 'relative path');
test('10032008.log', 'filename-only');
test('/10032008.log', 'file directly under root');

The output of the script...

/var/log/xyz/10032008.log -- absolute path
  1: /var/log/xyz/
  2: 10032008.log

var/log/xyz/10032008.log -- relative path
  1: var/log/xyz/
  2: 10032008.log

10032008.log -- filename-only
  2: 10032008.log

/10032008.log -- file directly under root
  1: /
  2: 10032008.log

Most languages have path parsing functions that will give you this already. If you have the ability, I'd recommend using what comes to you for free out-of-the-box.

Assuming / is the path delimiter...


The first group will be whatever the directory/path info is, the second will be the filename. For example:

  • /foo/bar/baz.log: "/foo/bar/" is the path, "baz.log" is the file
  • foo/bar.log: "foo/" is the path, "bar.log" is the file
  • /foo/bar: "/foo/" is the path, "bar" is the file
  • /foo/bar/: "/foo/bar/" is the path and there is no file.

What about this?


Deterministic :


Strict :


Try this:


It will leave the trailing slash on the path, though.

What language? and why use regex for this simple task?

If you must:


gives you the two parts you wanted. You might need to quote the parentheses:


depending on your preferred language syntax.

But I suggest you just use your language's string search function that finds the last "/" character, and split the string on that index.


I did a little research through trial and error method. Found out that all the values that are available in keyboard are eligible to be a file or directory except '/' in *nux machine.

I used touch command to create file for following characters and it created a file.

(Comma separated values below)
'!', '@', '#', '$', "'", '%', '^', '&', '*', '(', ')', ' ', '"', '\', '-', ',', '[', ']', '{', '}', '`', '~', '>', '<', '=', '+', ';', ':', '|'

It failed only when I tried creating '/' (because it's root directory) and filename container / because it file separator.

And it changed the modified time of current dir . when I did touch .. However, file.log is possible.

And of course, a-z, A-Z, 0-9, - (hypen), _ (underscore) should work.


So, by the above reasoning we know that a file name or directory name can contain anything except / forward slash. So, our regex will be derived by what will not be present in the file name/directory name.


Step by Step regexp creation process

Pattern Explanation

Step-1: Start with matching root directory

A directory can start with / when it is absolute path and directory name when it's relative. Hence, look for / with zero or one occurrence.


enter image description here

Step-2: Try to find the first directory.

Next, a directory and its child is always separated by /. And a directory name can be anything except /. Let's match /var/ first then.


enter image description here

Step-3: Get full directory path for the file

Next, let's match all directories


enter image description here

Here, single_dir is yz/ because, first it matched var/, then it found next occurrence of same pattern i.e. log/, then it found the next occurrence of same pattern yz/. So, it showed the last occurrence of pattern.

Step-4: Match filename and clean up

Now, we know that we're never going to use the groups like single_dir, filepath, root. Hence let's clean that up.

Let's keep them as groups however don't capture those groups.

And rest_of_the_path is just the filename! So, rename it. And a file will not have / in its name, so it's better to keep [^/]


This brings us to the final result. Of course, there are several other ways you can do it. I am just mentioning one of the ways here.

enter image description here

Regex Rules used above are listed here

^ means string starts with
(?P<dir>pattern) means capture group by group name. We have two groups with group name dir and file
(?:pattern) means don't consider this group or non-capturing group.
? means match zero or one. + means match one or more [^\/] means matches any char except forward slash (/)

[/]? means if it is absolute path then it can start with / otherwise it won't. So, match zero or one occurrence of /.

[^\/]+/ means one or more characters which aren't forward slash (/) which is followed by a forward slash (/). This will match var/ or xyz/. One directory at a time.

A very late answer, but hope this will help


This uses lazy check for /, and I just modified the accepted answer
