YJL: Snippet: Using Awk and md5sum to delete duplications

Snippet: Using Awk and md5sum to delete duplications

Thursday, November 27, 2008

Note

This is only a code snippet, not for general cases. You need to modify to suit your requirements.

I have a Python script to download Blogger.com templates, but it misses a very important feature, which is duplication detection. In other words, it stores duplications.

I wrote a quick awk script, this might be my first one. It’s easy to write, it only took me 10 to 20 minutes to write and to read An Awk Primer.


#!/bin/bash
# 2008-11-28T03:17:10+0800

md5sum *.xml | awk \
'
BEGIN {
prevhash = "";
}

{
if (NR>1 && prevhash == $1) {
    system("rm " $2);
    printf("%s deleted.\n", $2);
    }
prevhash = $1;
}'

I know BEGIN section is unnecessary, but I don’t like referring to a variable without assigning before.

The argument of system() function is possibly the tricky part. It does string concatenation, I seem to know of similar syntax in other language, but I couldn’t recall. I know you can do similar thing with Python, for example: 'abc' 'def'. But that doesn’t not apply on variables.

The following screenshot is it in action:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaO2rXBYj48FznPQ0p2b_O35QjRuJrW7O4CQzN265_sFVl2IMsOWchh08pQ-BW7SRRXt1nRIhhM6xjEeB3ux2HcMjGpuR4NdxCvBSlBXesxcIYDsproAw9okgFlYo_kuI9rxUMFg68uso/s800/remove-duplication.png

This script finds duplications by md5sum. Newer duplications will be deleted, oldest will be kept. If the behavior of ls changes, this script will be broken.

Any tips for me, an awk newbie?

1 comment:

heinzSeptember 27, 2010 at 7:09 PM
This will work only if the duplication files are grouped together.
For example if the first and the last file have the same md5sum it will be not deleted.
You can fix this when you sort the output odf the md5sum command before you feed it into the awk script.
### md5sum *.xml | sort | awk \
or
you can build an array whith the md5 hashes as the index and filename as value
if ( $1 in HASH )
system("rm " $2);
HASH[ $1 ] = $2;

so you can ensure the hash occure only once
br heinz
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.