Forums

Access log is messed up

It seems that the date-sorting on the Access Log is totally messed up, and also it seems to miss lots of access requests....

Sorry about that. One of our web servers crashed hard yesterday, and a consequence is that it's replaying all of the hits that it received since early December -- this is a bug on our side, and we're trying to work out a way to make our servers checkpoint where they are in the access logs periodically so that if they are forced to restart they don't lose track of where they are.

Unfortunately there's nothing we can do about it this time, but it should have caught up soon -- let me know which of your web apps is showing the problem (via "Send feedback" if you want to keep it private) if you'd like me to give an ETA, or to email you more up-to-date logs.

We've just pushed a config update to all of our web servers so that they checkpoint their access log positions every 1,000 lines. This means that if one of them has a hard crash (necessitating a power-off/power-on reboot) then the worst case is that it will replay 1,000 access log lines. This is across all sites served by the server, so will typically mean significantly less than 1,000 duplicates for any one website.

Normal controlled reboots (which are much more common) automatically write a checkpoint, so there should be no duplication when they happen.

If you want to de-duplicate your logs, I've got some throw-away code from ages ago which I used for merging Apache logs from multiple sources. It assumes that you pipe the log files in on stdin and it outputs the exact same log lines sorted into timestamp order and de-duplicated. Of course, this will also remove any entries which are legitimitely duplicates (i.e. request at the same timestamp from the same IP and browser for the same resource). Be aware that it has to read all the log lines into memory in order to sort them, so don't pipe GB of logs into it in one go.

I must apologise for the comparatively unreadable Python - I was writing this in a real hurry at the time. I'm sure it can be easily tidied up into a friendlier (and more efficient) script. It's also quite fragile and makes assumptions about the log file format which held for the logs I was using at the time, but might feasibly be broken by subtle changes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/usr/bin/env python

from datetime import datetime
import sys

a2dt = lambda s: datetime.strptime(s, "%d/%b/%Y:%H:%M:%S")
last_line = None
for ts, line in sorted((a2dt(i.split(None, 5)[3][1:]), i) for i in sys.stdin):
    if line != last_line:
        sys.stdout.write(line)
    last_line = line