Gavrie's Blog

Whatever I find interesting.

Buffers Will Fill Up... Eventually

(Updated version: Incorporates fixes from reader comments)

A colleague asked me to look into a problem with him, mentioning that “Tlib hangs when we run it”. Tlib is a fairly large project that is written in Python. His initial analysis showed that it hangs at a very early phase, during with it tries to fetch the latest version to run from a git server. Various users complained about the same problem, suggesting that it is not a local issue.

Running the code, and interrupting it with ^C when it hangs, turned up a result similar to the following:

^CTraceback (most recent call last):
  File "./execute_wrong.py", line 8, in <module>
    retval, out, err = execute("git ls-remote")
  File "./execute_wrong.py", line 5, in execute
    retval = p.wait()
  File "/usr/lib/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt

Interesting. The code seems to hang while waiting for the git child process to terminate. However, running git ls-remote from the command line works fine, so why does it hang when run from the code?

Investigating the Issue

Let’s look at the code history to see if anything has changed recently.

Nope. A quick git blame shows that this code is more than 5 years old, and has been performing flawlessly every day since then. So what else has changed?

Let’s see what is happening while waiting for the git child process to finish, by running strace on the process:

$ strace -p 5774
Process 5774 attached
write(1, "pull/624/merge\n2fe9da91f5a4b5ba3"..., 4096

OK. The process is trying to write to stdout, but hanging while doing so. It looks like some buffer is getting filled, which blocks the process from writing further.

Let’s look at the size of the data returned by the child process:

$ git ls-remote | wc -c
From gitserver:/git/qa/tlib
   65572

Aha! This number looks suspiciously like “a bit more than 64k”. Which affirms our hypothesis. Let’s try to reduce its size by deleting some old remote branches:

$ git push origin :old_branch1
$ git push origin :old_branch2
...
$ git ls-remote | wc -c
From gitserver:/git/qa/tlib
   65368

Now to run the program again… It worked! The program continues to run successfully.

The Code

Here is what the original code looked like (slightly changed to protect the innocent). Can you spot what is wrong?

The original code (execute_wrong.py) download
1
2
3
4
5
6
7
8
import subprocess

def execute(command):
    p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    retval = p.wait()
    return retval, p.stdout.read(), p.stderr.read()

retval, out, err = execute("git ls-remote")

The problem is that we wait() for the process to terminate, without reading its output. Only after it terminates do we read its output. This code has worked correctly for years, since the output so far happened to be smaller than 64k and fitted completely in the pipe’s buffer. Once it exceeded the buffer’s size due to one remote branch too many, it blocked the process on the pipe, while the parent was waiting for it to terminate. A classic deadlock condition.

Fixing the Problem

Now that we see the problem, fixing it is simple: First read, then wait.

The fixed code (execute_right.py) download
1
2
3
4
5
6
7
8
9
10
import subprocess

def execute(command):
    p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out = p.stdout.read()
    err = p.stderr.read()
    retval = p.wait()
    return retval, out, err

retval, out, err = execute("git ls-remote")

Unfortunately, as Alon Horev and Baruch Even pointed out in theirs comments, this is still not correct. Since we try to read from stdout until it gets an EOF and only then try to read from stderr, the stderr pipe’s buffer can fill up if the child process writes a lot of data to it.

The correct solution involves reading from both pipes in tandem. The communicate() method of the Popen class implements this using the select() system call on POSIX, and with threads on Windows platforms.

Conclusion

Buffers, filesystems, databases will all fill up sometime in the future. Always take this into account when writing code. Any time you generate some data, be sure to set up a process to prune the data and don’t assume that it will be OK – because it will come back to bite you, or at least some future user of your code.

Comments