Since I started to code for some websevice’s authentication method, it’s common that you need to use Percent-encoding somewhere. I used to use Perl do this such job:
% echo -n $'Encoded string has 中文\n' | perl -p -e 's/([^A-Za-z0-9-._~])/sprintf("%%%02X", ord($1))/seg' ; echo Encoded%20string%20has%20%E4%B8%AD%E6%96%87%0A
If you want to do it in pure Bash, there are two issues:
- To be able to feed Bash builtin printf with argument is same as ord($1) as shown above, and
- Commonly, Bash supports Unicode, which implies that you get character by character not byte character. A Unicode character could be multi-byte.
The first key is to have ord() in Bash. If you search for bash chr ord on Internet, you will find something like this:
% printf "%%%02X" "'A" ; echo %41
I took looks at few pages but it seems that many just copied from here and there, no explanations on single quote usage in "'A". I found the answer from info coreutils 'printf invocation':
- If the leading character of a numeric argument is " or ' then its value is the numeric value of the immediately following character. Any remaining characters are silently ignored if the POSIXLY_CORRECT environment variable is set; otherwise, a warning is printed. For example, printf "%d" "'a" outputs 97 on hosts that use the ASCII character set, since a has the numeric value 97 in ASCII.
Second key is to get byte character. Amazingly, it’s fairly simple, you just change locale, e.g. LANG=en_US.ISO8859-1.
A complete code is:
pe () { local LANG=en_US.ISO8859-1 ch i for ((i=0;i<${#1};i++)); do ch="${1:i:1}" [[ $ch =~ [._~A-Za-z0-9-] ]] && echo -n "$ch" || printf "%%%02X" "'$ch" done } pe $'Encoded string has 中文\n' ; echo
The performance isn’t good, on my computer, the encoding rate is about 3.84 kbytes/second, Perl’s rate is about 2.182 Mbytes/second. But it should be enough for general use since usually you won’t feed it with a file but a simple string.
Nice... Or if you are lazy, you could use the uni2ascii command. apt-get install uni2ascii
ReplyDeleteThanks info about uni2ascii! But I don't think it's an alternative, because not all distro has it in their package repository, manual compilation could be required. For instance, Gentoo Portage tree doesn't have it or it has another package name, so I couldn't find it.
ReplyDeleteI failed to mention in my post about the purpose of doing encoding in pure Bash. The goal is reduce the amount of dependency. Perl it is in the case.
However, it's good to know many ways to do a job. So, could you kindly post an example (commands and outputs) of encoding by using uni2ascii? Other people might want a quick glance. (I am reluctant to download and compile uni2ascii, because it's no real use for me to keep on my harddisk)
Firstly, I must fess-up... my tip above was based upon the uni2ascii manpage, which isn't ideal :-( That said, the likely useful format of the uni2ascii command is...
ReplyDeleteuni2ascii -q -s -aJ outfile.txt
where the -s option specifies that spaces get percent-encoded too (%20). But having now tested it out, I find the percent-encoding generates okay, but still not okay for the spaces. So in hindsight, it isn't yet perfect; there must be some detail there I have missed.
But my actual use of uni2ascii has mostly been to convert percent-encoded urls back into UTF-8. Coz some apps just won't accept percent-encoded urls. So fortunately, I can confirm that the uni2ascii command converts urls back into UTF-8 with the greatest of ease...
uni2ascii -qB outfile.txt
The uni2ascii package also provides an ascii2uni command, which similarly converts back into UTF-8 with ease...
ascii2uni -qaJ outfile.txt
and both commands work nicely with pipes, so this can work okay too...
cat infile.txt | uni2ascii -qB | tee outfile.txt
And yes, I totally understand about having fewer deps. My own fetish is for commands that pipe nicely.