the most portable language in the world, awk

22 Feb 2014

The other day while I was browsing I found an article called “The awk origins”, I liked so much than I decided to learn awk (it’s pronounced “auk”). I had already used one-liners but considered larger awk programs unfriendly and its syntax over complicated, however once I started diving on it and unbelief my fears I found it quite fun and easy to use, what a powerful tool based on minimal principles!

Awk, an event drive language

The most important thing in awk (and what took me more time to learn) was to understand that it’s an event drive language based in 5 important areas:

This means that every awk program (even the smallest ones) have a begin, body and end sections. The begin and end sections are similar, they’re executed only once, at the beginning and at the end of the program, to write the classic “Hello World” a person can do it in both sections:

$ awk 'BEGIN {print "Hello World"}' < /dev/null
Hello World
$ awk 'END   {print "Hello World"}' < /dev/null
Hello World

Every section is defined by its name and its actions (which are defined between {}). Awk programs are written between (‘) so the shell doesn’t interpret any variable or keywords. Awk programs can also be written in files and be executed directly:

$ cat hello.awk
#!/usr/bin/awk -f
BEGIN {print "Hello World"}
$ ./hello.awk
Hello World

In between, is the body section, the most powerful one, it defines search patterns and related actions.

$ awk '/.*/ {print $0}' file

The above line is comparable to $(cat file), the search pattern is /.*/ (any character) and the action is {print $0} (print current line). The body section is executed once per line, if a file contain 10 lines, the body section will be executed 10 times and will print all the content. Any length of pattern-actions can be declared within an awk program. The next example will look for daemon and root and will print every line where awk finds those strings.

$ awk '/root/ {print $0} /daemon/ {print $0}' /etc/passwd

If no search pattern is defined for an action, the action will be executed once per line, if no action is defined for a search pattern the default action will be to print the current line, if no parameter is given to print it will print the whole line ($0). Therefore the above examples can be rewritten as follows:

$ awk '{print $0}' file
$ awk '{print}'    file

$ awk '/root/ || /daemon/ {print $0}' /etc/passwd
$ awk '/root/ || /daemon/ {print}'    /etc/passwd
$ awk '/root/ || /daemon/'            /etc/passwd

These alternatives ways of writing awk programs (I think) are part of the reason why awk seems like a cryptographic language and why so many awk programs are tiny. Awk also defines default variables, some of the most important are:

NR = Number of Record (line number)

NF = Number of Field

RS = Record separator (\n by default)

FS = Field separator (white spaces by default)

If you’ve a file with the content:

1 2 3
4 5 6

Awk will see 2 records and 3 fields. So, $(cat -n file) can be emulated in awk with:

$ awk '{print NR, $0}' file

As the search pattern is missing, the action will be executed once for every line, and for every time it will print NR plus the whole line, NR will increase +1 in every iteration, that’s a lot of things happening in a minuscule definition. Let’s review other example, $(wc -l):

$ awk 'END {print NR}'      file
$ awk '{i++} END {print i}' file

NR will always increment, so in the first program when the END sections gets executed it will print the total amount of lines in the file. The second example is easier to analyze, it doesn’t have a search pattern so the action (i++) will always be executed and at the end the program it will be printed. Its amazing how easily other Unix core utilities can be implemented in a simple line, let’s now copy $(head) behavior:

$ awk 'NR <= 10'          file
$ awk -v hl=10 'NR <= hl' file

Does it make sense?, awk is not as difficult as it seems 😉. Sed can also be emulated:

$ awk '{gsub(/original/,"replace"); print}' file
$ awk 'function sed(search, replace) { gsub(search,replace); print } {sed("search","replace")}' file

Awk can also use control structures and functions, in the first example it uses de gsub function to replace all original string with “replace” in a file, just as sed would do it. In the second one, a function called “sed” is defined and used to replace the same strings. Awk is a complete turing language, so even though it could be seen as a toy tool, it’s a powerful tool which can be used to build sotisficated programs. Nevertheless if you’ve read till this phrase you already know its core principles and are ready to take advantage of its power.

I’m leaving some more example to get started, can you guess how do they work?

Awk as Unix swiss army knife

cat file                 ▷ awk '{print}' file
cat -n file              ▷ awk '{print NR, $0}' file
cat -n file              ▷ awk '{print FNR, $0}' file
head file                ▷ awk 'NR <= 10' file
head -15 file            ▷ awk -v hl=15 'NR <= hl' file
cut -d: -f1 /etc/passwd  ▷ awk -F":" '{print $1}' /etc/passwd
cut -d: -f1 /etc/passwd  ▷ awk 'BEGIN {FS=":"} {print $1}' /etc/passwd
wc -l file               ▷ awk '{i++} END {print i}' file
wc -l file               ▷ awk 'END {print NR}' file
wc -w file               ▷ awk '{ total = total + NF }; END { print total+0 }' file
grep pattern file        ▷ awk '/pattern/' file
grep -v pattern file     ▷ awk '!/pattern/' file
sed 's/foo/bar/g'        ▷ awk '{gsub(/foo/,"bar"); print $0}' file
tail file                ▷ awk -v tl=10 '{a=a b $0;b=RS;if(NR<=tl)next;a=substr(a,index(a,RS)+1)}END{print a}' file
tail - 15 file           ▷ awk -v tl=15 '{a=a b $0;b=RS;if(NR<=tl)next;a=substr(a,index(a,RS)+1)}END{print a}' file
tac file                 ▷ awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file
uniq file                ▷ awk 'a !~ $0; {a=$0}'

Awk one-liners

awk '1; {print ""}' file                     #adds double space
awk 'BEGIN { ORS="\n\n" }; 1' file           #adds double space
awk 'NF {print $0 "\n"}' file                #adds double space to lines with content
awk '{print $NF}' file                       #print last field of every line
awk 'NF > 4' file                            #print lines with more than 4 fields
awk '{ sub(/^[ \t]+/, ""); print }' file     #delete white spaces at the beggining of a line
awk '{ sub(/[ \t]+$/, ""); print }' file     #delete white spaces at the end of a line
awk '{ gsub(/^[ \t]+|[ \t]+$/, ""); print }' #delete white spaces at the beggining and end of a line
awk '{$2=""; print}'                         #delete the 2nd field of every line
awk '/AAA|BBB|CCC/'                          #search and print "AAA", "BBB" or "CCC"
awk '/AAA.*BBB.*CCC/'                        #search and print "AAA", "BBB" and "CCC" in that order