YJL: 002 Multi-processing on same content using Process Substitution

I had few experiences of needing to process the same content with different goals. But before we get into that, please read the following code first:

shuf -n 5 /usr/share/dict/words | nl | tee >(wc -l)

     1 anicut
     2 kinetoscopic
     3 corded
     4 overdevelop
     5 quisquilian
5

One time, I wanted to see the content and the total lines, I wrapped up the command like above. Before this, I never thought about using tee like this. It’s quick and clean, you can’t ask more than that.

tee command echoes what it receives from pipe and pass (write) to one or more files with exactly same content. The syntax >() is called Process Substitution, it’s only half thereof:

Process Substitution

Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files. It takes the form of <(list) or >(list). The process list is run with its input or output connected to a FIFO or some file in /dev/fd. The name of this file is passed as an argument to the current command as the result of the expansion. If the >(list) form is used, writing to the file will provide input for list. If the <(list) form is used, the file passed as an argument should be read to obtain the output of list.

When available, process substitution is performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

Combining tee and Process Substitution, you can achieve multi-processing on same content with only single line of code. You don’t need to pass the content to a loop and use Bash scripting to do the process like:

while read line; do
  : do something to check the line, e.g.
  case "$line" in
    pattern)
      :
      ;;
  esac
done < FILE

There is nothing wrong with this loop, it’s fine. Except rare issue with IFS and read if you retrieve a line this way, some spaces around will be missing from the result. If spaces ain’t important, then it will be fine.

1 Dispatching lines

Say you want to put odd-numbered lines to one file, the rest to another, you can:

shuf -n 5 /usr/share/dict/words | nl |
tee >(awk 'NR%2==1' > /tmp/odd) \
    >(awk 'NR%2==0' > /tmp/even)
echo 'ODD'  ; cat /tmp/odd
echo 'EVEN' ; cat /tmp/even

     1 undreggy
     2 nonmythological
     3 bier
     4 resubmerge
     5 interstage
ODD
     1 undreggy
     3 bier
     5 interstage
EVEN
     2 nonmythological
     4 resubmerge

You can use awk to print certain lines with such expressions, when the expression evaluates as true, the line is printed out. NR is line number basically¹. You can use sed -n '1~2p' to achieve line filtering, instead. But they are not topic of this series, please see Further reading section for more information.

[1]	It is actually the number of lines have been read in.

2 Filtering

You can filter and save the result base on criteria you need, for example:

shuf -n 5 /usr/share/dict/words | nl |
tee >(egrep [ae] > /tmp/vowel-ae) \
    >(egrep [ou] > /tmp/vowel-ou)
echo 'AE' ; cat /tmp/vowel-ae
echo 'OU' ; cat /tmp/vowel-ou

     1 us
     2 Ligusticum
     3 unicycle
     4 Sinify
     5 oscurrantist
AE
     3 unicycle
     5 oscurrantist
OU
     1 us
     2 Ligusticum
     3 unicycle
     5 oscurrantist

3 Conclusion

Process Substitution enables you a quick and simple way to provide data to a program which only accepts file inputs. You can simply <(echo 'string') without even needing to create a temporary file, or tricking cat like this:

cat FILE \
    <(echo 'blah') \
    <(grep PATTERN /path/to/some-file) \
    <(echo 'blah boo') \
    > DEST_FILE

There are a lot of potential usages of Process Substitution which can reduce the complication of your code and into just one-liner.

If you need more than just grep, you can always use awk script or just write a Bash function, you can use Process Substitution with your Bash function. There is no need to go back to the loop approach.

4 Further reading

Wikipedia has a nice article about Process Substitution.
Process Substitution is also in zsh.
Addresses in info sed.
PATTERNS AND ACTIONS in man awk.

YJL

002 Multi-processing on same content using Process Substitution

1 Dispatching lines

2 Filtering

3 Conclusion

4 Further reading

0 comments:

Post a Comment