the most portable language in the world, awk
22 Feb 2014
The other day while I was browsing I found an article called “The awk origins”, I liked so much than I decided to learn awk (it’s pronounced “auk”). I had already used one-liners but considered larger awk programs unfriendly and its syntax over complicated, however once I started diving on it and unbelief my fears I found it quite fun and easy to use, what a powerful tool based on minimal principles!
Awk, an event drive language
The most important thing in awk (and what took me more time to learn) was to understand that it’s an event drive language based in 5 important areas:
- begin
- body
- search
- action
- end
This means that every awk program (even the smallest ones) have a begin, body and end sections. The begin and end sections are similar, they’re executed only once, at the beginning and at the end of the program, to write the classic “Hello World” a person can do it in both sections:
$ awk 'BEGIN {print "Hello World"}' < /dev/null Hello World $ awk 'END {print "Hello World"}' < /dev/null Hello World
Every section is defined by its name and its actions (which are defined between {}). Awk programs are written between (‘) so the shell doesn’t interpret any variable or keywords. Awk programs can also be written in files and be executed directly:
$ cat hello.awk #!/usr/bin/awk -f BEGIN {print "Hello World"} $ ./hello.awk Hello World
In between, is the body section, the most powerful one, it defines search patterns and related actions.
$ awk '/.*/ {print $0}' file
The above line is comparable to $(cat file), the search pattern is /.*/ (any character) and the action is {print $0} (print current line). The body section is executed once per line, if a file contain 10 lines, the body section will be executed 10 times and will print all the content. Any length of pattern-actions can be declared within an awk program. The next example will look for daemon and root and will print every line where awk finds those strings.
$ awk '/root/ {print $0} /daemon/ {print $0}' /etc/passwd root:x:0:0:root:/root:/bin/bash daemon:x:1:1:daemon:/usr/sbin:/bin/sh
If no search pattern is defined for an action, the action will be executed once per line, if no action is defined for a search pattern the default action will be to print the current line, if no parameter is given to print it will print the whole line ($0). Therefore the above examples can be rewritten as follows:
$ awk '{print $0}' file $ awk '{print}' file $ awk '/root/ || /daemon/ {print $0}' /etc/passwd $ awk '/root/ || /daemon/ {print}' /etc/passwd $ awk '/root/ || /daemon/' /etc/passwd
These alternatives ways of writing awk programs (I think) are part of the reason why awk seems like a cryptographic language and why so many awk programs are tiny. Awk also defines default variables, some of the most important are:
NR = Number of Record (line number)
NF = Number of Field
RS = Record separator (\n by default)
FS = Field separator (white spaces by default)
If you’ve a file with the content:
1 2 3 4 5 6
Awk will see 2 records and 3 fields. So, $(cat -n file) can be emulated in awk with:
$ awk '{print NR, $0}' file
As the search pattern is missing, the action will be executed once for every line, and for every time it will print NR plus the whole line, NR will increase +1 in every iteration, that’s a lot of things happening in a minuscule definition. Let’s review other example, $(wc -l):
$ awk 'END {print NR}' file $ awk '{i++} END {print i}' file
NR will always increment, so in the first program when the END sections gets executed it will print the total amount of lines in the file. The second example is easier to analyze, it doesn’t have a search pattern so the action (i++) will always be executed and at the end the program it will be printed. Its amazing how easily other Unix core utilities can be implemented in a simple line, let’s now copy $(head) behavior:
$ awk 'NR <= 10' file $ awk -v hl=10 'NR <= hl' file
Does it make sense?, awk is not as difficult as it seems 😉. Sed can also be emulated:
$ awk '{gsub(/original/,"replace"); print}' file $ awk 'function sed(search, replace) { gsub(search,replace); print } {sed("search","replace")}' file
Awk can also use control structures and functions, in the first example it uses de gsub function to replace all original string with “replace” in a file, just as sed would do it. In the second one, a function called “sed” is defined and used to replace the same strings. Awk is a complete turing language, so even though it could be seen as a toy tool, it’s a powerful tool which can be used to build sotisficated programs. Nevertheless if you’ve read till this phrase you already know its core principles and are ready to take advantage of its power.
I’m leaving some more example to get started, can you guess how do they work?
Awk as Unix swiss army knife
cat file ▷ awk '{print}' file cat -n file ▷ awk '{print NR, $0}' file cat -n file ▷ awk '{print FNR, $0}' file head file ▷ awk 'NR <= 10' file head -15 file ▷ awk -v hl=15 'NR <= hl' file cut -d: -f1 /etc/passwd ▷ awk -F":" '{print $1}' /etc/passwd cut -d: -f1 /etc/passwd ▷ awk 'BEGIN {FS=":"} {print $1}' /etc/passwd wc -l file ▷ awk '{i++} END {print i}' file wc -l file ▷ awk 'END {print NR}' file wc -w file ▷ awk '{total = total + NF}; END {print total+0}' file grep pattern file ▷ awk '/pattern/' file grep -v pattern file ▷ awk '!/pattern/' file sed 's/foo/bar/g' ▷ awk '{gsub(/foo/,"bar"); print $0}' file tail file ▷ awk -v tl=10 '{a=a b $0;b=RS;if(NR<=tl)next;a=substr(a,index(a,RS)+1)}END{print a}' file tail - 15 file ▷ awk -v tl=15 '{a=a b $0;b=RS;if(NR<=tl)next;a=substr(a,index(a,RS)+1)}END{print a}' file tac file ▷ awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file uniq file ▷ awk 'a !~ $0; {a=$0}'
Awk one-liners
awk '$2 ~ /pattern/' file #print line when second field matches pattern awk '$2 !~ /^[0-9]+$/' file #print line when second field is not a number awk '1; {print ""}' file #adds double space awk 'BEGIN {ORS="\n\n"}; 1' file #adds double space awk 'NF {print $0 "\n"}' file #adds double space to lines with content awk 'BEGIN {RS="";ORS="\n\n"}/pattern/' #print whole paragrams where pattern is found awk '{print $NF}' file #print the last field of every line awk '{field=$NF} END{print field}' file #print the last field of the last line awk 'NF > 4' file #print lines with more than 4 fields awk '{sub(/^[ \t]+/, "");print}' file #delete white spaces at the beggining of a line awk '{sub(/[ \t]+$/, "");print}' file #delete white spaces at the end of a line awk '{gsub(/^[ \t]+|[ \t]+$/, "");print}' #delete white spaces at the beggining and end of a line awk '{$2=""; print}' file #delete the 2nd field of every line awk '/AAA|BBB|CCC/' file #search and print "AAA", "BBB" or "CCC" awk '/AAA.*BBB.*CCC/' file #search and print "AAA", "BBB" and "CCC" in that order
References