YJL: Removing duplicate lines without sorting

I stumbled across this Awk one-liner command on some unknown blog:

awk '!a[$0]++'

If a doesn’t have $0, then ! would make the expression evaluate as true, therefore the line prints. After that, a[$0] has at least value 1 and it would never be evaluated as true with !, so same content would never be printed out again.

After some searches, I found awk1line.txt, which last dated on 2008-04-30 as version 0.27, but the command might be included much earlier than 2003-07-22 v0.22, it might be from one of the listed books. Anyway, it has even more variations:

# remove duplicate, consecutive lines (emulates "uniq")
awk 'a !~ $0; {a=$0}'

# remove duplicate, nonconsecutive lines
awk '!a[$0]++'                     # most concise script
awk '!($0 in a){a[$0];print}'      # most efficient script

I timed with 1,000,000 of Bash $RANDOM generated test data, the former was 0.839 seconds and the latter was 0.698 seconds. The latter does run faster, if that’s what efficiency means.

For the sake of comparison, if you use sort + uniq for same amount of data:

sort INPUT | uniq > OUTPUT

It took 0.755 seconds, the Awk definitely is the winner, not to mention the output is sorted and you might want to keep each line in their order unchanged.

Of course, this isn’t going to be just somebody’s code, a Bash version I wrote just for fun:

while read L; do ((X[L]++)) || echo "$L"; done < INPUT > OUTPUT

It’s basically a port of the Awk code. For 10,000 numbers, the Awk took 0.033 seconds and my Bash took 1.066 seconds. Not that it might not be perfect since Bash would not see surrounding spaces because of word splitting.

The Awk code is so simple, 8 characters only in one line and runs really fast, nothing could beat that.

YJL

Removing duplicate lines without sorting

0 comments:

Post a Comment