Forums

OT: filesystem load testing

Have just uploaded a wee filesystem load testing script to GitHub. Suggestions welcomed!

https://github.com/pythonanywhere/stress-filesystem

I'll have a quick look when I get chance, but is there a reason why you didn't go with something standard like dbench?

EDIT: Only glanced at it, but don't the child processes after the first one end up with a non-empty children list and hence end up waiting for their siblings? Strikes me that moving the final for loop into an else clause will arrange that only the parent ends up waiting for the children:

for procs in range(num_processes):
    child = os.fork()
    if child:
        children.append(child)
        # ...
    else:
        do_work(target_dir, reps)
        break
else:
    for child in children:
        # ...

(I mean, children waiting for their siblings isn't a problem per se right now, but it's the sort of thing that could catch you out in the future as the code changes).

EDIT 2:

Few other initial comments... Somewhat random stream-of-consciousness list as I spot things, apologies for the lack of organisation.

Python's builtin file support does a lot of buffering, I'd be inclined to use os.open() and friends for a more realistic test - just have to remember to use a try...finally with os.close() because the filehandles are plain integers so won't support the context manager protocol. Since that calls the OS raw IO, each read or write really does go to the file straight away (still goes via the buffer cache, of course, but that's all part of a filesystem test (as opposed to a disk test, which you wouldn't want to do in Python).

Small point - why does the parent initialise start at global scope when it's not then used? Can drop this, AFAICT.

Currently there's a small chance children could tread on each other's toes and access the same file. You could argue that makes for a more realistic test, but I'm guessing it's not intentional. Any reason not to just use a separate subdirectory for each child process? Anyway, clashes are quite unlikely in reality. On a related note, random.random() returns a float - random.randint() might be cleaner. Minor point, really.

For more accurate timings, don't record start in the child until you've done the unrelated housekeeping with hashlib. Also, I would pre-generate the list of random filenames before starting the timer as well - it's going to be slightly faster to run through a pre-generated list than use random each time around the loop. I'd be tempted to do something like:

def filename_list(path, num_files):
    for i in xrange(num_files):
        yield os.path.join(path, "file" + str(os.randint(100000000)))

def do_work(path, reps):
    # ... Do housekeeping
    ops = [(random.choice(read_some, write_some),
            list(filename_list(path, NUM_FILES_TO_WRITE))) for i in xrange(reps)]
    start = time.time()
    for func, filelist in ops:
        func(filelist)
    # ... Finish off here

To use this, you'd also need to modify the functions to take the file list rather than generating them. Normally, of course, lazy evaluation is one of Python's strengths but I assume you want to minimise the overhead to get the best model of performance. On a related note, why the time.sleep()? I assumed the idea was to push the filesystem at maximum capacity for minimal timings? Also, you seem to be assuming that your right fork() loop will generate consecutive PIDs, but that's not guaranteed, I'm afraid, so consecutive runs might not generate results which are directly comparable.

Dbench, eh? See, that's exactly what I was hoping for, someone to point me to where others have totally solved the problem already.

Quoting README.md:

We're building a few tools to test various new filesystems + fileservers against. We thought we'd put them out there, partially in case anyone else finds them useful, but mostly in the hope that someone will come along and say "You idiots! You're doing it all wrong! And besides, the perfect tool for this already exists, it has done since the 70s, here's a link".