I would like to cleanly filter a dataframe using regex on one of the columns.
For a contrived example:
In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]:
a b
0 1 hi
1 2 foo
2 3 fat
3 4 cat
I want to filter the rows to those that start with f
using a regex. First go:
In [213]: foo.b.str.match('f.*')
Out[213]:
0 []
1 ()
2 ()
3 []
That's not too terribly useful. However this will get me my boolean index:
In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]:
0 False
1 True
2 True
3 False
Name: b
So I could then do my restriction by:
In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]:
a b
1 2 foo
2 3 fat
That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?
Write a Boolean function that checks the regex and use apply on the column
foo[foo['b'].apply(regex_function)]
Using str
slice
foo[foo.b.str[0]=='f']
Out[18]:
a b
1 2 foo
2 3 fat
It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match
. The docs explain the difference between match
, fullmatch
and contains
.
Note that in order to use the results for indexing, set the na=False
argument (or True
if you want to include NANs in the results).
Multiple column search with dataframe:
frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]
Use contains instead:
In [10]: df.b.str.contains('^f')
Out[10]:
0 False
1 True
2 True
3 False
Name: b, dtype: bool
There is already a string handling function Series.str.startswith()
.
You should try foo[foo.b.str.startswith('f')]
.
Result:
a b
1 2 foo
2 3 fat
I think what you expect.
Alternatively you can use contains with regex option. For example:
foo[foo.b.str.contains('oo', regex= True, na=False)]
Result:
a b
1 2 foo
na=False
is to prevent Errors in case there is nan, null etc. values
Thanks for the great answer @user3136169, here is an example of how that might be done also removing NoneType values.
def regex_filter(val):
if val:
mo = re.search(regex,val)
if mo:
return True
else:
return False
else:
return False
df_filtered = df[df['col'].apply(regex_filter)]
Also you can also add regex as an arg:
def regex_filter(val,myregex):
...
df_filtered = df[df['col'].apply(res_regex_filter,regex=myregex)]
Source: Stackoverflow.com