Abstract
Practice shell scripting by analyzing a web server log file.Details
Your company's web server is starting to see some action. However, when skimming the logs of your web server, you start to see activity that doesn't appear to be generated by a human. Many of the requests come from clients with search engine names, and so you wonder how much of your web server traffic is resulting just from those robots, or 'bots (aka spiders, crawlers, etc.).Your job is to write a single shell script that will analyze the January 2005 web requests found in the apache web log file /home/brian/cse271/january-access.log. In particular, your script needs to output a report containing
Your script is expected to call other UNIX utilities, including, but not limited to sort, grep, sed, wc, cut, and uniq. Your script must be self-contained -- it may be as long as necessary, but may not run any other custom script in another file. It should take one or more log files as arguments (no hardcoded filenames!).
- the absolute count and relative percentages of regular browser requests vs. robot requests
- the identities of the top 10 most active robots (and their request counts)
- the ip addresses of the top 10 most active clients (robot or otherwise, and their counts)
- the URLs of the top 10 most common referring web pages (and their counts)
- the URLs of the top 10 most requested web pages (and their counts) regardless of who requested them
Your script will likely need to read through the datafile multiple times. Faster (and often more elegant) scripts will minimize the number of times the datafile must be read.
Some background information and hints:
- Properly operating robots will at some point request the /robots.txt resource to find out what URLs on the site are permitted to be retrieved. (See http://www.robotstxt.org/orig.html if you are interested in understanding the format of the robots.txt file and about web crawlers in general.)
- Requests for /favicon.ico are made by modern browsers to present an icon for the page next to the URL of the page (or to decorate the browser tab).
- The non-constant fields of each line in this file are, in order: client IP address, date of request, HTTP method used (usualy GET), the URL requested, the version of HTTP supported, the web server's response code (three digits), the size of the response in bytes, the URL of the referring page, and the UserAgent string for the client (self-identifying string)
- Some crawlers operate from multiple IP addresses.
- The log file contains a few days of December 2004 as well as most of January 2005. Make sure you filter out the December log entries and only report those from January.
Submission Requirements
- As usual, the script must reside in the cse271.131/p6 subdirectory. Name your script p6.sh.
- Your name must be in the comment section (along with appropriate description, etc.).
- Do a touch DONE when the program is ready to be collected.