Forums

OSError: write error in error.log

I see a lot of "OSError: write error" message in my error log. Been there for weeks.

I'm not getting any errors in my app although I am writing data to disk to cache it.

Is this coming from the server level? It appears so.

Should I be concerned? Any insight as to the source of the error?

Thanks, Chris

We had some trouble with networking on AWS that meant that one of our fileservers was slow. That may have affected your web app. It appears to be working now, so if your web app is still having trouble then it may be something else.

The app seems completely fine and I'm not getting an exceptions, etc. This seems to be coming from the server. Really the only impact seems to be that it is cluttering the log but if it is actually coming from my app that would be great to know.

Are they appearing alongside GeneratorExit messages? If so, this help page should explain what's going on.

Thanks, no. Just the OSError: write error message. Here's an excerpt:

2018-11-26 18:58:20,040: OSError: write error
2018-11-26 19:11:08,110: OSError: write error
2018-11-26 20:04:12,727: OSError: write error
2018-11-26 21:13:56,156: OSError: write error
2018-11-26 21:13:56,156: OSError: write error
2018-11-26 21:14:21,486: OSError: write error
2018-11-26 22:03:23,262: OSError: write error
2018-11-26 22:05:18,174: OSError: write error
2018-11-27 03:25:17,531: OSError: write error
2018-11-27 03:47:52,955: OSError: write error
2018-11-27 05:41:26,548: OSError: write error
2018-11-27 11:00:31,114: OSError: write error
2018-11-27 12:10:22,251: OSError: write error
2018-11-27 12:33:31,353: OSError: write error
2018-11-27 12:33:37,890: OSError: write error
2018-11-27 15:53:30,449: OSError: write error
2018-11-27 18:59:08,227: OSError: write error

Interesting. Is it correlated with slow requests at the same time in the access log? Each request handled by your site has one line there, and at the end of the line there will be a response time.

Thanks for the suggestion. I spot checked a few from Dec 02 (errors vs. access log) and some but not all of them correlate to longer requests. I am requesting data from an API that can take up to three seconds to respond, so my requests are occasionally that long.

During the longer requests I cache the results to disk and do a bunch of stuff with zero length files to provide locking functionality. I have some exception handling in there that could be hiding some errors. They all cover pretty obscure cases (lock file didn't exist but another process created between the check and the touch with exist_ok=False). Given my traffic volume, I don't expect there would be very many collisions.

In any event I have added debug messages to those three exception handlers to see if they correlate to the messages in the error log.

If so sorry for the hassle! I will let you know either way.

Chris

OK -- looking forward to hearing the results of your investigations!

It does not appear that the failed touch nor other exceptions are causing the message. As far as I can tell this doesn't originate from my code, but I could certainly be missing something.

Here's what I did:

  • Turned on a lot of log / debug messages.
  • Generated 12 requests / sec of test traffic to the site.
  • I can see the various scenarios getting exercised in the code, including the failed touch call and related exception handling that I was suspicious of.
  • I let the test traffic run for maybe 10 minutes. This generated approximately 7,000 requests (some take longer). On a given day the site only receives 300 - 700 page views so this was a good test.
  • Checked the error log and didn't find any "OSError: write error" messages. I did see one from earlier but none during the test period.

Note also that for normal users I don't do anything unique to their identity (I do basic auth for admin functions only). So generating traffic this way should be a fair simulation of normal traffic.

If it's not coming from your code, it could be coming from the uWSGI middleware -- that's certainly where the OSError: write error messages combined with GeneratorExit messages that I mentioned earlier come from.

Maybe we can see if it's just a different symptom of the same cause? If you were to make a request to the site with a super-short timeout -- perhaps pick the slowest view -- maybe we could see if the same error appears:

requests.get('https://yoursite.com/slow-view', timeout=0.1)

I created a test script with a short timeout and ran it along side my test traffic generator. I caught the timeout exceptions and printed them. I can see a lot of timeouts.

I do see a lot of " OSError: write error". Apparently they are coming from the client timeouts.

In the sever log I see the result of the timeout with errors like the ones below. I understand and expect these.

2018-12-06 05:47:44 Thu Dec  6 05:47:44 2018 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request / (ip 10.0.0.98) !!!
2018-12-06 05:47:44 Thu Dec  6 05:47:44 2018 - uwsgi_response_writev_headers_and_body_do(): Broken pipe [core/writer.c line 306] during GET / (10.0.0.98)

The strange thing I see a lot more 'write error's vs. the 'writing to closed pipe' errors above on my production site but now that I know what I am looking for I can find those that are there.

Thanks for your help tracking this down! I was fixated on this being a disk error. Didn't think about pipes / client disconnects.

Chris

Excellent, glad you were able to get to the bottom of it!