Ze Chen

Bash Tricks (I)

Text Processing Tools: grep, cut, sed, and awk


grep

grep stands for "global regular expression print", which basically searches for the lines that contains the pattern given. Let the file hosts be given below, copied from here.

# Copyright (c) 1993-2009 Microsoft Corp.
#
# This is a sample HOSTS file.
#
# For example:
#
#      102.54.94.97     rhino.acme.com          # source server
#       38.25.63.10     x.acme.com              # x client host
# localhost name resolution is handled within DNS itself.
# 127.0.0.1       localhost
# ::1             localhost
192.168.168.200 doug.LoweWriter.com             # Doug’s computer
192.168.168.201 server1.LoweWriter.com s1       # Main server
192.168.168.202 debbie.LoweWriter.com           # Debbie’s computer
192.168.168.203 printer1.LoweWriter.com p1      # HP Laser Printer
192.168.168.204 www.google.com                  # Google

Now we are going to find out all the lines that contains .com. The following command

cat hosts | grep ".com"

prints

#      102.54.94.97     rhino.acme.com          # source server
#       38.25.63.10     x.acme.com              # x client host
192.168.168.200 doug.LoweWriter.com             # Doug’s computer
192.168.168.201 server1.LoweWriter.com s1       # Main server
192.168.168.202 debbie.LoweWriter.com           # Debbie’s computer
192.168.168.203 printer1.LoweWriter.com p1      # HP Laser Printer
192.168.168.204 www.google.com                  # Google

It accidentally includes the lines commented out as well. To exclude these, we use the regular expression with grep by the -E option.

cat hosts | grep -E "^[^#].*\.com"

The ouput now becomes

192.168.168.200 doug.LoweWriter.com             # Doug’s computer
192.168.168.201 server1.LoweWriter.com s1       # Main server
192.168.168.202 debbie.LoweWriter.com           # Debbie’s computer
192.168.168.203 printer1.LoweWriter.com p1      # HP Laser Printer
192.168.168.204 www.google.com                  # Google

cut

Now we want the IP addresses and the domain names only, leaving out the comments at the end of lines. The following cut command does this job,

cat hosts | grep -E "^[^#].*\.com" | cut -d " " -f 1-2

which prints

192.168.168.200 doug.LoweWriter.com
192.168.168.201 server1.LoweWriter.com
192.168.168.202 debbie.LoweWriter.com
192.168.168.203 printer1.LoweWriter.com
192.168.168.204 www.google.com

cut -d " " doesn't work like String.split() in most programming languages.

It should be noted that, with cut -d " ", every single space is counted as a delimiter. For example,

cat hosts | grep -E "^[^#].*\.com" | cut -d " " -f 1-3

prints

192.168.168.200 doug.LoweWriter.com 
192.168.168.201 server1.LoweWriter.com s1
192.168.168.202 debbie.LoweWriter.com 
192.168.168.203 printer1.LoweWriter.com p1
192.168.168.204 www.google.com 

sed

sed stands for "stream editor", with which we are able to make certain replacements. For example

cat hosts | grep -E "^[^#].*\.com" | cut -d " " -f 1-2 | sed 's/192\.168/255.255/'

prints

255.255.168.200 doug.LoweWriter.com
255.255.168.201 server1.LoweWriter.com
255.255.168.202 debbie.LoweWriter.com
255.255.168.203 printer1.LoweWriter.com
255.255.168.204 www.google.com

awk

awk stands for "Aho, Weinberger, Kernighan", i.e. the authors. It's among the most powerful text processing tools. We may swap the two columns with it, using

cat hosts | grep -E "^[^#].*\.com" | cut -d " " -f 1-2 | awk '{ t = $1; $1 = $2; $2 = t; print }'

which prints

doug.LoweWriter.com 192.168.168.200
server1.LoweWriter.com 192.168.168.201
debbie.LoweWriter.com 192.168.168.202
printer1.LoweWriter.com 192.168.168.203
www.google.com 192.168.168.204

2021/1/29 23:54:33

0%

Uploaded successfully.

0%

Uploaded successfully.


0%

Uploaded successfully.

0%

Uploaded successfully.