THE HOLY GRAIL: Server Logs Provide a Wealth of Information To Benefit Your Web Business

They are the topic of much discussion and the subject of many recommendations. They're also misunderstood (or not understood at all), misquoted, mishandled, and often just plain forgotten.
Yet, once tamed, server log files can be a Webmaster's best friend and one of his or her best tools.
In a nutshell, server logs are automatically generated files of information about every electronic activity in which a server has engaged. Stored in ASCII text format, a log may look like a random collection of letters and numbers. Hidden within that mathematical soup, however, is a wealth of knowledge about such things as how often individual pages are viewed, how many unique and return visitors a site has received, average time surfers spend at the site, which sections of the site are most popular during a given period, where visitors came from, what browsers and operating systems visitors are using, what hours and days the server is busiest, what keyword searches lead visitors to the site, what errors users are encountering, and other tidbits. Savvy Webmasters learn to use all of this information when designing and promoting their sites.
Every server collects and stores data differently. Fortunately, there are enough similarities between formats that learning to read and take advantage of server logs is fairly easy. It requires a few definitions, a few examples, and a bit of experience pawing around within the innards of the subject machine.

Defining Terms

Contrary to popular belief, "hits" are not a true measure of Website popularity. A hit is recorded in a server's log every time the server is asked to display or execute something for a visitor: One .gif equals one hit, one HTML file equals one hit, one PDF file equals one hit, one interaction with a Perl script or active server page equals one hit. When a server displays an HTML document containing four graphics and a form, the server records five hits (one for the HTML document and one for each of the graphics). If the visitor submits the form, the server records one hit for the act of submission. The oft-used term "unique hits" is misleading, but hit data can provide enterprising Webmasters with useful information about what images users find most appealing and how much of what they have in their storage space is being accessed often enough to make its storage worthwhile.

Better measures of site popularity are "page views" and "user sessions." A user session is logged each time a surfer enters a site. It's terminated when the surfer exits, is inactive for a specified period of time that varies by server, or closes the Web browser. "Unique" users are identified by IP address, and this can lead to some inflation of statistics, because most dial-up users are issued a new IP address each time they log on to their access provider's service. The other side of that coin is that some surfers - like those from AOL and other large service providers - sometimes aren't counted as visitors at all, because they access the Web through what is known as a proxy server (which provides access to many people through one IP address). Search engine robots that scour the Web in their efforts to index it also are counted among user sessions, though they really shouldn't be considered "visitors" in the normal sense. Still, user session numbers can provide a fairly clear picture of how widespread a site's fame has become and how often "regulars" are logging on. User sessions also indicate how long surfers stayed within the confines of a site, where they came from, and what path they took through the content: where they entered, what pages they visited, where they exited, and sometimes where they went.

"Page views" are logged by users, and each page is counted only the first time it is requested during a user session, regardless of how many times that user returns to the page. If the user reloads or refreshes the page during a session, however, a new page view is recorded. Page view statistics are extremely useful for setting advertising rates and determining banner placement.

"Authenticated users" are those who use passwords to gain access to protected areas of a site, and their activities are logged separately from those of the run-of-the-mill public. "Errors," which usually are logged in a separate file, include records of unsuccessful attempts to access protected content, pages that no longer exist or have moved, mistyped URLs, as well as entries generated when users' browsers can't decipher the HTML code on a page or don't have the required plug-ins installed.

Collecting the Offering

Surfers move along the electronic highway as electromagnetic collections of information known as "packets." A packet begins to be built at the browser level. As it leaves the originating computer on its way to a Web server, it collects information about the user and his or her environment: browser type, operating system, IP address, date and time the packet was sent, the URL to which it is traveling, the specific page target at that URL (if specified), and the "referer" (the term was misspelled by one of the original programmers of the Internet, so complain to him). The referer field in a packet may be empty if the user typed a URL into a browser's address window to begin the journey; otherwise the field contains information about where the surfer is traveling from: a search engine, another site, even a desktop bookmark. In the case of search engines, the referer field data also usually includes the search terms used to find the link, and may include the relative location of the link within the results (first page, 15th page, etc.).

Servers are inveterate collectors. When the packet arrives at its destination, the Web server there snatches all of its data and stores it in an access log, one packet per line. If the information the packet was sent to find doesn't exist on the server or the user doesn't have permission to access it, the server also stores the packet data and its own response in an error log, which has proven invaluable to more than one system administrator beset by hackers or spam.

Server logs are generated automatically and constantly, 24/7; depending upon a site's popularity, its server logs can become very large very quickly, so they should be cleared and restarted periodically to spare the administrator the embarrassment of a completely full disk drive.

Logs are located in a variety of places on servers, based in part on the operating system the machine is running and administrator preference. Often, they're stored in a directory called, appropriately enough, "logs." They're easy to recognize: Most log files bear the file extension ".log."

Once the logs are located, the real work begins. Although data can be gleaned from raw log files, it's generally much easier and less time-consuming to employ an analysis program that will present the material in easy-to-understand graphs and charts. A host of such software is readily available on the Web; search for "server log analysis" using your favorite engine.

Making Sense

Just for the exercise - and because diehard geeks and data purists can never be convinced to do things the easy way - it's worth looking at how log files are organized. As mentioned before, each line in an access log represents the data collected from a single packet. Commas or blank spaces delimit the fields in each entry. Unfortunately, there are no rules governing the order in which the fields are presented, and the number of fields is administrator-settable. An access log entry might look like the following, give or take a few fields:

127.0.0.1, frank, 10/Oct/2000:13:55:36 -0700, "GET /apache_pb.gif HTTP/1.0", 200, 2326, 10453, Mozilla/4.03 [en] (Win95; I), http://www.lycos.com/cgi-bin/pursuit?query=adult+ chat&cat=dir

Moving from left to right, the entry above provides the following information:

Client IP address: "127.0.0.1" represents the location of the machine making the request for information. If the server is set up to "resolve" IP addresses, the entry might be a server name, like "bigboy.sbcglobal.net" or "namespace.domainsforall.com."

Client username: "frank" is the identity the user has taken on, and it usually only shows up in the log when the user has been authenticated. In most cases, this field would bear a hyphen (-), meaning no data was recorded.

Date: "[10/Oct/2000:13:55:36 -0700]" is the date and time the server finished processing the packet's request, presented in the format [day/month/ year:hour:minute:second GMTzone]. The date and time can be recorded in various formats, doesn't always include the Greenwich Mean Time offset, and may occupy two fields instead of one.

Target file: "GET /apache_pb.gif HTTP/1.0" specifies the resource or page the packet sought (/apache_pb.gif), the method it used (GET), and the protocol it employed (HTTP/1.0).

Status code: Sent back to the client by the server after the request is processed, the data in this field can be very valuable. Codes beginning in 2 (like the "200" in the example) indicate a successful response. Codes beginning in 3 indicate a redirection. Codes beginning in 4 indicate an error caused by the client, and codes beginning in 5 indicate an error on the part of the server. Other error codes are possible.

Bytes sent: There were 2,326 ("2326") bytes sent to the client in response to its request. This field is a good indicator of "bot" or spider activity, as it will contain a zero or a hyphen if the request was sent by one of them.

Processing time: "10453" is the time it took the server to process the request, in milliseconds.

Browser and platform: In this case, the visitor was using the English-language version of Netscape 4.03 (Mozilla is Netscape's underlying technology) on a computer running the Windows 98 operating system.

Referring URL: This field (http://www.lycos.com/cgi-bin/pursuit?query=adult+chat&cat=dir) indicates from whence the visitor came. In this case, he or she was referred by the search engine Lycos based on the search terms "adult" and "chat."

In addition to representing this mishmash of information in graphical format, log analysis programs also combine it by type in order to show trends and some other useful stuff.

Making it Work

Why should Webmasters and site administrators be concerned about what's in their server logs? Marketing and usability are two big reasons.

Knowing where users come from and how they're searching for information offers an opportunity to provide what surfers seek. Additionally, if surfers hit a site's home page based on a referral from another site or a search engine, but they leave almost immediately, it may indicate that the site isn't doing a good enough job at convincing them to stay. Perhaps a bit more effort should be expended to make the home page more representative of what's inside, more attractive, or otherwise more robust.

Usability issues - like pages not designed for the majority of browsers visiting - often show up first in log files. Repeated error entries, too, indicate a problem that's crying for attention.

And there's always the money issue. Webmasters who know how many and what kinds of surfers visit their sites are in a much better position to set appropriate advertising rates and attract advertisers who'll stay with them.

In short, server logs are essential Webmaster tools. In the real world, businesses that don't answer the phone or ignore customers standing at the sales counter miss valuable financial opportunities. In the virtual world, those that don't pay attention to their log files make the same grievous mistake.