I have wanted to search multiple documents and this is my second time to find a solution. The first time, I even wanted to search within encrypted documents but I didn’t find a way to do so.
This time, at first, I read about a program Loook can do such task. It uses Tk as GUI and it can search within a directory recursively and support simple keyword matching with AND/OR boolean operation. It will give you a list of matched documents. I tried the version 0.6.8, it has done the job.
If you are a shell-nut, then you might want to do it all by yourself, unzip the document file and search within XML file, you can start with this thread. I don’t recommend this way.
Now, the best way if you are a CLI lover (with shell tools, Windows? not sure about) and fine with regular expression, well, you don’t have to be. Simple keyword will be okay.
Using odt2txt is the best way in my opinion. Converting document to plain text makes everything easier. You might argue, after conversion, the layout/typesetting/format is lost. Seriously, do you really need to have colors, bold, italic, etc. in your search results? All you want is the text matching, isn’t it true?
So, here is how I do as an example. Mind you that I haven’t really done this for my real case. Anyway, I downloaded five documents:
% ll
total 5044
-rw-r--r-- 1 livibetter livibetter 266445 Jan 13 11:36 0101GS-WhatIsOOo.odt
-rw-r--r-- 1 livibetter livibetter 1343180 Jan 13 11:36 0105GS-SettingUpOOo.odt
-rw-r--r-- 1 livibetter livibetter 2628687 Jan 13 11:34 DevelopersGuide_OOo3.1.0_06OfficeDevelopment.odt
-rw-r--r-- 1 livibetter livibetter 305524 Jan 13 11:35 Installation_Guide_OOo3.odt
-rw-r--r-- 1 livibetter livibetter 590155 Jan 13 11:36 ooo2prodflyera3en.odt
The basic commands are as follow:
odt_regexp='\bword\b'
odt_matches=3
odt_lines=1
for doc in *.odt; do
result="$(odt2txt "$doc" | egrep -m $odt_matches -C $odt_lines -n --color=always "$odt_regexp")"
[[ $result ]] && echo -e "\e[37;41;1m${doc}\e[0m\n${result}\n"
: ; done
odt_regexp is the regular expression you want to use with grep. odt_matches tells grep how many matches it should print out. odt_lines tells grep how many context lines it should give. Here is a screenshot using exact commands as show above:
It matches three out of five documents. You can simply use keyword like word (without \b), but it will also match "password". This also proves this method is better, Loook doesn’t seem to support regular expression or having an option to match whole word.
If you need to do an case-ignore search, then add -i to grep.
If you have many files which are stored in deep directories, you might want to use find:
find -name '*.odt' | while read doc; do
result="$(odt2txt "$doc" | egrep -m $odt_matches -C $odt_lines -n --color=always "$odt_regexp")"
[[ $result ]] && echo -e "\e[37;41;1m${doc}\e[0m\n${result}\n"
: ; done
As for boolean operations in Loook, you can do some modification to list file name only and use combine to complete the boolean operations with two file lists, it’s more powerful than Loook’s boolean operations, but require more shell usage. To list file names, you can use the commands as follow:
find -name '*.odt' | while read doc; do
result="$(odt2txt "$doc" | egrep --files-with-matches "$odt_regexp")"
[[ $result ]] && echo "${doc}"
: ; done
# : ; done > file.lst
And don’t forget the --files-without-matches if really need to play with boolean operations.
This all is to provide an idea of how to search in multiple documents, it should be enough to do simple search or even more advanced search. And, odt2txt can also convert other format other than .odt. If you have thousands documents, you will want to convert all into text at once and store the result to disk, then search those converted files. But I leave that part of shell scripting to you.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.