Script: Find Frequency of Words in a File

by mike on February 4, 2012

Frequency of Words in a File
In the 1980s a Bell Labs researcher Jon Bentley posed a challenge for someone to write a program that would take a text file, input an integer for n and print the frequency of occurrence of words from largest to smallest. Several programs were written to achieve the goal but on written by Doug McIlroy wrote a six step script in a few minutes to achieve the goal. As you review the script he wrote note how a problem is broken into simple steps to create a solution.

#!/bin/bash
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 | sed ${1:-25}q

The script, a one liner using pipes, is a lot more powerful than you might initially believe. The first command replaces characters that are not letters with newlines.
tr -cs A-Za-z\' '\n'

The second command changes all upper case to lower case.
tr A-Z a-z

The third command sorts.
sort

At this point the list is sorted by descending order and then by ascending order.
sort -k1,1nr -k2

Finally, the first field is printed, the default is 25 lines.
sed ${1:-25}q

Create the wf utility, the script is called wf.sh in the example.

#!/bin/bash
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 | sed ${1:-25}q

Make the script executable
chmod 755 wf.sh

Example:
Create a text file that lists all of the contents of the /etc/ directory.
ls /etc > etclist

./wf.sh < etclist
38 conf
28 d
9 rc
5 cron
4 gnome
3 bash
3 ca
3 certificates
3 hosts
3 insserv
3 ld
3 magic
3 mime
3 so
3 tools
2 aliases
2 blkid
2 completion
2 console
2 deny
2 discover
2 dpkg
2 group
2 gshadow
2 issue

Here is an example of the script listing the top 25 words found in httpd.conf.
sh wf.sh 10 < /etc/httpd/conf/httpd.conf
202 the
137 to
77 mod
72 of
61 a
60 module
60 so
59 you
58 modules
58 server

Comments on this entry are closed.

Previous post:

Next post: