How to get started using awk

by Sukrit Dhandhania on February 5, 2012

Linux

awk, sed, and grep are three of my favorite tools in the Linux or UNIX command line. They are all pretty powerful. Today we’ll look at how to get cracking with awk to help you ease into using it. Then we’ll look at some useful awk one liners to make things a bit more fun for you.

AWK is a programming language designed for processing text-based data, either in files or data streams. It was created at Bell Labs in the 1970s. Although it’s quite old, don’t get fooled by it’s age. It is extremely powerful and efficient at what it does. Let’s get our hands dirty now.

Before we delve into the complex working and usage of awk let’s get you started on it’s basics. We’ll create and use a dummy file for this exercise. You can use pretty much any text file, such as a log from your system. I will be using an sample output from one of my favorite system monitoring tools – Dstat. Here’s the output:

screenshot of the output of dstat
click to enlarge

This is an ideal output for awk to handle. awk is great with comma or tab separated content. You’ll see why soon. So either create some similar data or copy and paste my example the above into a dummy file called something like test.txt. Launch a terminal window on your Linux computer. Almost all flavors of Linux ship with awk. In case you have found one that does not have it for some reason please install it. On the terminal window type the following from the directory where you have stored the test.txt file –

# awk {‘print’} test.txt

The output should contain the entire contents of the text file. What’s the fun in that.

Now let’s see how you can pick a column and print just that one. Execute the following command:

# awk {‘print $1′} test.txt

Now we are asking awk to print just the first column of the text file. It will automatically figure out that the file is a tab separated one and print just the first column of the contents. You should see something like this in the output:

—-total-cpu-usage—-
usr
5
13
8
0
1
1
1
0
1
1

You can do the same for any column you like. If you want awk to print the third column change command above shown command to:

# awk {‘print $3′} test.txt

You can also have awk print multiple columns. So if you want the first, third, and seventh columns printed add them to the command separated by commas.

# awk {‘print $1, $3, $7′} test.txt

would do the trick for you:

—-total-cpu-usage—- -net/total-
usr idl read
5 93 154k
13 87 0
8 92 0
0 99 0
1 97 0
1 98 0
1 99 0
0 99 0
1 99 0
1 100 0

If you have a trickier file like the /etc/password file where the data is separated by colons rather that spaces or tabs, awk doesn’t pick that up automatically. In such cases you can feed awk with the correct separator. Use a command like this to print the second column of the file:

# awk -F’:’ {‘print $1′} /etc/passwd

This command will give you an output of the usernames of all the users on your system:

apple
mango
banana
watermelon
kiwi
orange

You can do the same with any other type of separators. You can also use awk to parse your log files. For example, if you want to view all the IP addresses and the related web URLs that have been accesses on your web server you can use awk to parse your web server’s access log to get this information. Use the following command:

# awk ‘$9 == 200 { print $1, $7}’ access.log

199.63.142.250 /2008/10/my-5-favourite-hangouts/
220.180.94.221 /2009/02/querious-a-mysql-client-for-the-mac/
67.190.114.46 /2009/05/
173.234.43.110 /2009/01/bicycle-rental/
173.234.38.110 /wp-comments-post.php

Using parsing like this you can figure out if someone is visiting your website a lot, as they may be stealing information. You can also sort this information. Say you wanted to know how many times a particular IP address visited your website

# awk ‘$9 == 200 { print $1}’ access.log | sort | uniq -c | sort -nr

46 122.248.161.1
35 122.248.161.2
26 65.202.21.10
24 67.195.111.46
19 144.36.231.111
18 59.183.121.71