I've got this little script in sh
(Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:
files="*.jpg"
for f in $files
do
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
echo $name
done
So far (obviously, to you shell gurus) $name
merely holds 0, 1 or 2, depending on if grep
found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+)
and store that to a variable.
I'd like to use grep
only, if possible. If not, please no Python or Perl, etc. sed
or something like it – I'm new to shell and would like to attack this from the *nix purist angle.
Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'
?
Please explain what's going on, if you've got the time.
str="1w 2d 1h"
regex="([0-9])w ([0-9])d ([0-9])h"
if [[ $str =~ $regex ]]
then
week="${BASH_REMATCH[1]}"
day="${BASH_REMATCH[2]}"
hour="${BASH_REMATCH[3]}"
echo $week --- $day ---- $hour
fi
output: 1 --- 2 ---- 1
This isn't really possible with pure grep
, at least not generally.
But if your pattern is suitable, you may be able to use grep
multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut
and sed
are far better at this).
Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_
You could extract this like so:
echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'
The first grep
would remove any lines that didn't match your overall patern, the second grep
(which has --only-matching
specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.
(Aside: Personally I'd use grep
+ cut
to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2
. This gets cut
to parse the line into fields by splitting on the delimiter _
, and returns just field 2 (field numbers start at 1)).
Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep
+ sed
etc is a more Unixy way of doing things :-)
I prefer the one line python
or perl
command, both often included in major linux disdribution
echo $'
<a href="http://stackoverflow.com">
</a>
<a href="http://google.com">
</a>
' | python -c $'
import re
import sys
for i in sys.stdin:
g=re.match(r\'.*href="(.*)"\',i);
if g is not None:
print g.group(1)
'
and to handle files:
ls *.txt | python -c $'
import sys
import re
for i in sys.stdin:
i=i.strip()
f=open(i,"r")
for j in f:
g=re.match(r\'.*href="(.*)"\',j);
if g is not None:
print g.group(1)
f.close()
'
A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:
f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}
Then name
will have the value abc
.
See Apple developer docs, search forward for 'Parameter Expansion'.
I realize that an answer was already accepted for this, but from a "strictly *nix purist angle" it seems like the right tool for the job is pcregrep
, which doesn't seem to have been mentioned yet. Try changing the lines:
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
to the following:
name=$(echo $f | pcregrep -o1 -Ei '[0-9]+_([a-z]+)_[0-9a-z]*')
to get only the contents of the capturing group 1.
The pcregrep
tool utilizes all of the same syntax you've already used with grep
, but implements the functionality that you need.
The parameter -o
works just like the grep
version if it is bare, but it also accepts a numeric parameter in pcregrep
, which indicates which capturing group you want to show.
With this solution there is a bare minimum of change required in the script. You simply replace one modular utility with another and tweak the parameters.
Interesting Note: You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line.
if you have bash, you can use extended globbing
shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done
or
ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done
This is a solution that uses gawk. It's something I find I need to use often so I created a function for it
function regex1 { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'1'}']}'; }
to use just do
$ echo 'hello world' | regex1 'hello\s(.*)'
world
Not possible in just grep I believe
for sed:
name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*/\2/'`
I'll take a stab at the bonus though:
echo "$name.jpg"
Source: Stackoverflow.com