Note

This is only a code snippet, not for general cases. You need to modify to suit your requirements.

I have a Python script to download Blogger.com templates, but it misses a very important feature, which is duplication detection. In other words, it stores duplications.

I wrote a quick awk script, this might be my first one. It’s easy to write, it only took me 10 to 20 minutes to write and to read An Awk Primer.


#!/bin/bash
# 2008-11-28T03:17:10+0800

md5sum *.xml | awk \
'
BEGIN {
prevhash = "";
}

{
if (NR>1 && prevhash == $1) {
system("rm " $2);
printf("%s deleted.\n", $2);
}
prevhash = $1;
}'

I know BEGIN section is unnecessary, but I don’t like referring to a variable without assigning before.

The argument of system() function is possibly the tricky part. It does string concatenation, I seem to know of similar syntax in other language, but I couldn’t recall. I know you can do similar thing with Python, for example: 'abc' 'def'. But that doesn’t not apply on variables.

The following screenshot is it in action:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaO2rXBYj48FznPQ0p2b_O35QjRuJrW7O4CQzN265_sFVl2IMsOWchh08pQ-BW7SRRXt1nRIhhM6xjEeB3ux2HcMjGpuR4NdxCvBSlBXesxcIYDsproAw9okgFlYo_kuI9rxUMFg68uso/s800/remove-duplication.png

This script finds duplications by md5sum. Newer duplications will be deleted, oldest will be kept. If the behavior of ls changes, this script will be broken.

Any tips for me, an awk newbie?