Gavrie's Blog

Whatever I find interesting.

A Subprocess Bug? Nah.

A few weeks ago, a colleague came to me with an interesting bug: When running a child process with Python’s subprocess module, no exception is thrown when the child process fails. In essence, what happened was the following (typed at the interactive Python prompt):

>>> import subprocess
>>> subprocess.check_call("false")
0

The false command always exits with a nonzero exit code. The expected behavior, as confirmed on another machine, would be as follows:

>>> subprocess.check_call("false")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 542, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'false' returned non-zero exit status 1

So, what happens is that subprocess thinks that the child process exited successfully, even though it did not.

This behavior, of course, wreaks total havoc with the application. Instead of an exception being thrown due to the failed child process, the application goes on and fails at a later point when it tries to do something based on the data received from the child process. This goes against the good design principle of failing early, as embodied by the way Python uses exceptions instead of error codes.

So, what is going on here? You may want to think it through and see if you can find the problem.

Buffers Will Fill Up… Eventually

(Updated version: Incorporates fixes from reader comments)

A colleague asked me to look into a problem with him, mentioning that “Tlib hangs when we run it”. Tlib is a fairly large project that is written in Python. His initial analysis showed that it hangs at a very early phase, during with it tries to fetch the latest version to run from a git server. Various users complained about the same problem, suggesting that it is not a local issue.

Running the code, and interrupting it with ^C when it hangs, turned up a result similar to the following:

^CTraceback (most recent call last):
  File "./execute_wrong.py", line 8, in <module>
    retval, out, err = execute("git ls-remote")
  File "./execute_wrong.py", line 5, in execute
    retval = p.wait()
  File "/usr/lib/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt

Interesting. The code seems to hang while waiting for the git child process to terminate. However, running git ls-remote from the command line works fine, so why does it hang when run from the code?

Why I Don’t Like RPC

Lately I’ve been working on modifying an architectural aspect of an existing software project. This project makes heavy use of remote execution of code on several hosts. To accomplish this feat, it uses several different methods for remote execution: SSH for running general shell commands, RPyC for executing arbitrary Python code remotely, as well as a couple of proprietary interfaces.

One issue I encountered with the current design is that it is quite difficult to make a clean separation between code running locally and code running remotely. This turns debugging any problem involving execution of remote code into an incredibly complicated endeavor.

Much clarity could be gained by changing the architecture of this project to be more explicitly distributed. This would involve several agents running on multiple hosts that communicate amongst themselves to get the work done. The agents would have a clean and documented API, making them usable and testable as standalone components, as well as allowing them to act on behalf of a central process. All remote execution would be explicit, using one single method for any kind of execution – be it shell commands, Python code or anything else.

The YAGNI/NIH Conundrum

Disclaimer: This post is about software development, but it is more about the human side of it than about the technical side. This means you may enjoy it even if you’re not a software developer (or a technical person at all).

One of the recurring issues that I encounter as a software developer is that of having to decide where on the scale between YAGNI and NIH I want to be. You are probably wondering what the hell I am talking about (unless you’re one of those people who are nodding your head now and smiling), so I’ll explain.

YAGNI stands for “You ain’t gonna need it”, and refers to a software development principle according to which you don’t want to start writing a huge bunch of code up front until you know exactly what you are going to need. Since software requirements change all the time, and one of the common maladies of many code bases is superfluous complexity due to over-engineering and over-abstraction, it makes a lot of sense to implement only what you know you are going to need right now, and leave the rest for later.

NIH stands for “Not Invented Here”, and refers to a reluctance among some software developers to base their work on code written by someone else: “Hey, I can do this myself, so why should I use this existing project”? This leads people to reinvent the wheel time and again, creating a proliferation of half-baked solutions to problems that have been solved fairly well at other times and places. In addition to the waste of resources in the creating the solution in the first place, it also creates yet another maintainability headache.

Generating Virtual FC WWPNs for a Storage Lab

At work, I encountered an interesting problem: While testing the Fibre Channel (FC) scalability of a storage product, we needed to create a lot of FC connections between hosts and the storage system. This would in turn require a large number of FC Initiators, each of which having a unique World-Wide Port Name (WWPN).

The easiest and cheapest method to set up a lot of initiators without actually purchasing zillions of FC HBAs would be to use N_Port ID Virtualization, a.k.a. NPIV. This method allows a single FC HBA to present itself to the FC fabric with multiple WWPNs. This, in turn, allows the creation of many connections to the target storage device from a small number of hosts.

The Problem

WWPNs can’t just be pulled out of thin air. They are allocated – in chunks – by a central authority, the IEEE Registration Authority. Just making up random WWPNs could cause trouble for two reasons:

  1. The WWPN must be unique on the fabric, which means it must be generated in a deterministic way so that two hosts won’t be using the same WWPN and thereby confuse the fabric.
  2. The WWPN should not have a chance of clashing with official WWPNs of purchased HBAs.

With physical (as opposed to virtual) HBAs, this is managed by allocating a OUI (Organizationally Unique Identifier) to every vendor, who in turns tacks on his own vendor-specific serial number to come up with a unique WWPN. This is similar to the MAC address allocation of Ethernet, Wi-Fi and Bluetooth devices.

The textbook solution for our problem would have involved the use of an officially allocated OUI to generate legal WWPNs, but that seemed like overkill for lab project which would never be used on a production SAN.

Improving Readability and Flow Control in Python

Recently, a colleague and I refactored a piece of existing code that had new behavior added to it. During the process, we managed to improve the readability of the code using several techniques that I’ll describe below.

Where We Started

The original code was fairly simple: It decides whether certain “dead” components need to be “revived”, and presents the user with a prompt to choose from one of several actions. Depending on the user’s choice, the code then proceeds to take the appropriate action:

The original code (code0.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def revive_dead_components():
    choice = choose('"I see dead components..."\n'
       "Do you wish to (r)evive them, (c)ontinue without reviving, or (q)uit?",
       { "r": "revive",
         "c": "continue",
         "Q": "quit"},
       default = "quit")

    if choice == "revive":
        revive_components()
    elif choice == "quit":
        raise TestCannotRunException("Dead components exist")
    elif choice == "continue":
        pass

Adding New Behavior

We now wanted to add some new functionality to the above code, namely the ability to allow the user to select a subset of the components that he wants to revive.

To make things more foolproof, in case the user chose to revive selected components but then neglected to select any components from the list, the code would not proceed blindly but rather send the user back to the menu so that he could try again.

The first version of the new code looked like this:

New behavior, first version (code1.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def revive_dead_components():
    while True:
        choice = choose('"I see dead components..."\n'
            "Do you wish to (r)evive them all, (s)elect components to revive, "
            "(c)ontinue without reviving, d(i)sable reviver, or (q)uit?",
            { "r": "revive",
              "s": "select",
              "c": "continue",
              "Q": "quit"},
            default = "quit")

        if choice == "revive":
            revive_components()
            break
        elif choice == "quit":
            raise TestCannotRunException("Dead components exist")
        elif choice == "continue":
            break
        elif choice == "select":
            selected = show_menu("Which components would you like to revive?")
            if selected:
                revive_components(selected)
                break
            else:
                logger.info("Nothing selected...")

To get the required behavior, we used an infinite while True loop that terminates with an explicit break when a valid choice is made by the user and reiterates otherwise. This ensures that we don’t continue until a valid choice is made.

Can We Do Better?

The problem with the above method is that the flow control is not immediately apparent when looking at the code: It’s not obvious that the infinite loop should actually terminate in all but one case. A future developer could easily break this behavior.

In addition, the if/elif ladder becomes a bit too long to read comfortably.

The second iteration was meant to make the flow control clearer:

Refactoring, first try (code2.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def revive_dead_components():
    class InvalidChoiceError(Exception): pass

    def choice_revive():    revive_components()
    def choice_quit():      raise TestCannotRunException("Dead components exist")
    def choice_continue_(): pass
    def choice_select():
        selected = show_menu("Which components would you like to revive?")
        if not selected:
            logger.info("Nothing selected...")
            raise InvalidChoiceError

        revive_components(selected)

    while True:
        choice = choose('"I see dead components..."\n'
            "Do you wish to (r)evive them all, (s)elect components to revive, "
            "(c)ontinue without reviving, d(i)sable reviver, or (q)uit?",
            { "r": "revive",
              "s": "select",
              "c": "continue",
              "Q": "quit"},
            default = "quit")
        try:
            locals().get('choice_%s' % choice)()
            break
        except InvalidChoiceError:
            continue

We use several techniques here to improve the clarity of the code:

  • We used internal functions to encapsulate the possible actions to take. The advantage of using internal functions is that it keeps the external namespace clean, and the naming of each function makes its purpose quite clear.

  • We used a dictionary instead of the if/elif construct. Since Python doesn’t have a switch or case statement, this is a more readable replacement.

  • We decided to use an exception to signify, well, exceptional flow control: If the user hasn’t selected any components, this warrants exceptional behavior. This technique is much debated, but we felt like it was appropriate in this case.

  • The function name to be called is determined dynamically at runtime from the user’s selection. The idea was to avoid code duplication by needing to specify the names of the functions yet again (but see below).

Removing Some Coolness For Readability

My colleague pointed out that the locals().get('choice_%s' % choice)() trick is not quite readable. I agreed, and was happy to accept his improved proposal:

Refactoring, final version (code3.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def revive_dead_components():

    # ...

    while True:
        choice_func = choose('"I see dead components..."\n'
            "Do you wish to (r)evive them all, (s)elect components to revive, "
            "(c)ontinue without reviving, d(i)sable reviver, or (q)uit?",
            dict(r = choice_revive,
                 s = choice_select,
                 c = choice_continue,
                 Q = choice_quit
                 ),
            default = "quit")
        try:
            choice_func()
            break
        except InvalidChoiceError:
            continue

This version has several advantages:

  • It doesn’t use “magic” to achieve the selection of the function. Duplication is better than magic in this case, since it makes the code more readable.

  • The dictionary is created using the dict() syntax instead of the {...} syntax, which gets rid of a lot of punctuation and makes the code clearer.

Conclusion

It turns out that even in such a simple piece of code, several programming techniques can be used to make the code clearer to read and maintain.

See you next time!