I've got a shell script outputting data like this:
1234567890 *
1234567891 *
I need to remove JUST the last three characters " *". I know I can do it via
(whatever) | sed 's/\(.*\).../\1/'
But I DON'T want to use sed for speed purposes. It will always be the same last 3 characters.
Any quick way of cleaning up the output?
Both awk
and sed
are plenty fast, but if you think it matters feel free to use one of the following:
If the characters that you want to delete are always at the end of the string
echo '1234567890 *' | tr -d ' *'
If they can appear anywhere within the string and you only want to delete those at the end
echo '1234567890 *' | rev | cut -c 4- | rev
The man pages of all the commands will explain what's going on.
I think you should use sed
, though.
I can guarantee you that bash
alone won't be any faster than sed
for this task. Starting up external processes in bash
is a generally bad idea but only if you do it a lot.
So, if you're starting a sed
process for each line of your input, I'd be concerned. But you're not. You only need to start one sed
which will do all the work for you.
You may however find that the following sed
will be a bit faster than your version:
(whatever) | sed 's/...$//'
All this does is remove the last three characters on each line, rather than substituting the whole line with a shorter version of itself. Now maybe more modern RE engines can optimise your command but why take the risk.
To be honest, about the only way I can think of that would be faster would be to hand-craft your own C-based filter program. And the only reason that may be faster than sed
is because you can take advantage of the extra knowledge you have on your processing needs (sed
has to allow for generalised procession so may be slower because of that).
Don't forget the optimisation mantra: "Measure, don't guess!"
If you really want to do this one line at a time in bash
(and I still maintain that it's a bad idea), you can use:
pax> line=123456789abc
pax> line2=${line%%???}
pax> echo ${line2}
123456789
pax> _
You may also want to investigate whether you actually need a speed improvement. If you process the lines as one big chunk, you'll see that sed
is plenty fast. Type in the following:
#!/usr/bin/bash
echo This is a pretty chunky line with three bad characters at the end.XXX >qq1
for i in 4 16 64 256 1024 4096 16384 65536 ; do
cat qq1 qq1 >qq2
cat qq2 qq2 >qq1
done
head -20000l qq1 >qq2
wc -l qq2
date
time sed 's/...$//' qq2 >qq1
date
head -3l qq1
and run it. Here's the output on my (not very fast at all) R40 laptop:
pax> ./chk.sh
20000 qq2
Sat Jul 24 13:09:15 WAST 2010
real 0m0.851s
user 0m0.781s
sys 0m0.050s
Sat Jul 24 13:09:16 WAST 2010
This is a pretty chunky line with three bad characters at the end.
This is a pretty chunky line with three bad characters at the end.
This is a pretty chunky line with three bad characters at the end.
That's 20,000 lines in under a second, pretty good for something that's only done every hour.
Here's an old-fashioned unix trick for removing the last 3 characters from a line that makes no use of sed OR awk...
> echo 987654321 | rev | cut -c 4- | rev
987654
Unlike the earlier example using 'cut', this does not require knowledge of the line length.
what do you mean don't want to use sed/awk for speed purposes? sed/awk are faster than the shell's while read loop for processing files.
$ sed 's/[ \t]*\*$//' file
1234567890
1234567891
$ sed 's/..\*$//' file
1234567890
1234567891
with bash shell
while read -r a b
do
echo $a
done <file
No need for cut or magic, in bash you can cut a string like so:
ORGSTRING="123456"
CUTSTRING=${ORGSTRING:0:-3}
echo "The original string: $ORGSTRING"
echo "The new, shorter and faster string: $CUTSTRING"
Note: This answer is somewhat intended to be a joke, but it actually does work...
#!/bin/bash
outfile="/tmp/$RANDOM"
cfile="$outfile.c"
echo '#include <stdio.h>
int main(void){int e=1;char c;while((c=getc(stdin))!=-1){if(c==10)e=1;if(c==32)e=0;if(e)putc(c,stdout);}}' >> "$cfile"
gcc -o "$outfile" "$cfile"
rm "$cfile"
cat somedata.txt | "$outfile"
rm "$outfile"
You can replace cat somedata.txt
with a different command.
You could try
(whatever) | while read line; do echo $line | head --bytes -3; done;
head
itself should be faster than sed
or cut
because there's no regex or delimeter matching, but invoking a for every line separately would probably outweigh that.
You can use awk just to print the first 'field' if there won't be any spaces (or if there will be, change the separator'.
I put the fields you had above into a file and did this
awk '{ print $1 }' < test.txt
1234567890
1234567891
I don't know if that's any better.
If the script always outputs lines of 10 characters followed by 3 extra (in other words, you just want the first 10 characters), you can use
script | cut -c 1-10
If it outputs an uncertain number of non-space characters, followed by a space and then 2 other extra characters (in other words, you just want the first field), you can use
script | cut -d ' ' -f 1
... as in majhool's comment earlier. Depending on your platform, you may also have colrm, which, again, would work if the lines are a fixed length:
script | colrm 11
$ x="can_haz"
$ echo "${x%???}"
can_
Another answer relies on the third-to-last character being a space. This will work with (almost) any character in that position and does it "WITHOUT using sed, or perl, etc.":
while read -r line
do
echo ${line:0:${#line}-3}
done
If your lines are fixed length change the echo
to:
echo ${line:0:9}
or
printf "%.10s\n" "$line"
but each of these is definitely much slower than sed
.
Source: Stackoverflow.com