Tail Recursion, Objective-C, and ARC

A tale of conflicting optimizations.

Published on Saturday, 2014-05-24.

Tail recursion is a way to perform recursion without using a stack frame. Put another way, tail recursion is writing iterative loops using recursive syntax.

For a contrived but simple example, consider a linked list definition, with a method to calculate the length:

@interface ListNode : NSObject
@property (nonatomic, copy) NSString *name;
@property (nonatomic, strong) ListNode *next;
- (NSUInteger)length;
@end

A simple, recursive implementation of -length could be:

// v1
- (NSUInteger)length {
    return 1 + [self.next length];
}

This exploits the fact that calls to nil will return zero¹, which happens to work in this specific case because the base case is zero. But we’ll need to play with the base case shortly, so let’s make it explicit:

// v2
- (NSUInteger)length {
    return [[self class] lengthOfListWithHead:self];
}

+ (NSUInteger)lengthOfListWithHead:(ListNode *)node {
    if (!node)
        return 0; // base case
    else
        return 1 + [self lengthOfListWithHead:node.next];
}

This code is not tail-recursive, and running it will create a stack frame for each list node. This is undesirable because the list may have an arbitrary number of nodes², but the stack we’re using to count them has a relatively small fixed size, so with a sufficiently long list the stack will overflow and the program will crash.

Let’s convert this to an iterative implementation:

// v3
- (NSUInteger)length {
    return [[self class] lengthOfListWithHead:self];
}

+ (NSUInteger)lengthOfListWithHead:(ListNode *)node {
    NSUInteger count = 0; // base case
    while (node) {
        count += 1;
        node = node.next;
    }
    return count;
}

Now calculating the length will use a constant number of stack frames³, avoiding the problem above. Some may argue that the elegance of the recursive approach has been lost, though.

Note that in each iteration, the code changes node — a parameter — as well a new local variable. We can play with this, making a parameter for count as well:

// v4
- (NSUInteger)length {
    return [[self class] lengthOfListWithHead:self count:0 /* base case */];
}

+ (NSUInteger)lengthOfListWithHead:(ListNode *)node count:(NSUInteger)count {
    while (node) {
        count += 1;
        node = node.next;
    }
    return count;
}

Now unwrap the while loop:

// v5
+ (NSUInteger)lengthOfListWithHead:(ListNode *)node count:(NSUInteger)count {
top:
    if (!node) {
        return count;
    } else {
        count += 1;
        node = node.next;
        goto top;
    }
}

Pay attention to the code in else clause. This technique — setting parameters and jumping to the top of a method — is also known as calling a method:

// v6
+ (NSUInteger)lengthOfListWithHead:(ListNode *)node count:(NSUInteger)count {
    if (!node)
        return count;
    else
        return [self lengthOfListWithHead:node.next count:(count + 1)];
}

This is tail recursion. The initial recursive implementation (v2) is not tail-recursive because each pass performs an addition after the recursive call returns. The state of this incomplete addition must be stored somewhere, and that “somewhere” is the stack.

Let this soak in for a moment.

The compiler can see that these two forms are equivalent too. With sufficient optimizations enabled — -O1 for clang — the tail-recursive version is compiled into iterative code, using just one stack frame regardless of length. This is called tail call optimization (TCO).

I happened across a discussion of tail call optimization in ECMAScript / JavaScript today, and decided to sanity check my understanding, so made a little Xcode project and wrote the code above. I was surprised to see that v6 did not get its tail-call optimized:

What is going on? This should work; there is no work to be done after the recursive call, so why isn’t being optimized? Dump the assembly! (Source annotations added by hand.)

Tail`+[ListNode lengthOfListWithHead_v6:count:] at main.m:103:
0x100001ae0:  pushq  %rbp
0x100001ae1:  movq   %rsp, %rbp
0x100001ae4:  pushq  %r15
0x100001ae6:  pushq  %r14
0x100001ae8:  pushq  %r12
0x100001aea:  pushq  %rbx
0x100001aeb:  movq   %rcx, %rbx
0x100001aee:  movq   %rdi, %r14
    ; if (!node)
0x100001af1:  testq  %rdx, %rdx
0x100001af4:  je     0x100001b37               ; +[ListNode lengthOfListWithHead_v6:count:] + 87 at main.m:108
0x100001af6:  movq   0x8b3(%rip), %rsi         ; "next"
0x100001afd:  movq   0x514(%rip), %r12         ; (void *)0x00007fff99e68080: objc_msgSend
0x100001b04:  movq   %rdx, %rdi
    ;     node.next
0x100001b07:  callq  *%r12
0x100001b0a:  movq   %rax, %rdi
0x100001b0d:  callq  0x100001c9a               ; symbol stub for: objc_retainAutoreleasedReturnValue
0x100001b12:  movq   %rax, %r15
    ;     (count + 1)
0x100001b15:  incq   %rbx
0x100001b18:  movq   0x8c1(%rip), %rsi         ; "lengthOfListWithHead_v6:count:"
0x100001b1f:  movq   %r14, %rdi
0x100001b22:  movq   %r15, %rdx
0x100001b25:  movq   %rbx, %rcx
    ;     [self lengthOfListWithHead_v6:node.next count:(count + 1)];
0x100001b28:  callq  *%r12
0x100001b2b:  movq   %rax, %rbx
0x100001b2e:  movq   %r15, %rdi
0x100001b31:  callq  *0x4e9(%rip)              ; (void *)0x00007fff99e6b0d0: objc_release
0x100001b37:  movq   %rbx, %rax
0x100001b3a:  popq   %rbx
0x100001b3b:  popq   %r12
0x100001b3d:  popq   %r14
0x100001b3f:  popq   %r15
0x100001b41:  popq   %rbp
0x100001b42:  ret

The problem is revealed: there is work to be done after the recursive call: automatic reference counting inserted a release call for the value returned from node.next. If we were writing this with manual retain/release, one wouldn’t insert any memory management calls into this at all, because -next returns an autoreleased object. However when this is compiled under ARC, calls to objc_retainAutoreleasedReturnValue and objc_release are inserted to allow for another optimization — having the return value skip the autorelease pool entirely. Unfortunately in this case, it conflicts with tail call optimization.

One way to avoid this is to use the instance variable directly:

// v7
- (NSUInteger)length {
    return [[self class] lengthOfListWithHead:self count:0];
}
+ (NSUInteger)lengthOfListWithHead:(ListNode *)node count:(NSUInteger)count {
    if (!node)
        return count;
    else
        return [self lengthOfListWithHead:node->_next count:(count + 1)];
}

This generates much more compact assembly:

Tail`+[ListNode lengthOfListWithHead_v7:count:] at main.m:115:
0x100001a50:  pushq  %rbp
0x100001a51:  movq   %rsp, %rbp
0x100001a54:  testq  %rdx, %rdx
0x100001a57:  je     0x100001a75               ; +[ListNode lengthOfListWithHead_v7:count:] + 37 at main.m:120
0x100001a59:  movq   0xaf0(%rip), %rax         ; ListNode._next
0x100001a60:  movq   (%rdx,%rax), %rdx
0x100001a64:  incq   %rcx
0x100001a67:  movq   0x9b2(%rip), %rsi         ; "lengthOfListWithHead_v7:count:"
0x100001a6e:  popq   %rbp
    ; The “jump” instead of “call” shows tail call optimization in effect
0x100001a6f:  jmpq   *0x5a3(%rip)              ; (void *)0x00007fff99e68080: objc_msgSend
0x100001a75:  movq   %rcx, %rax
0x100001a78:  popq   %rbp
0x100001a79:  ret

However the behavior is slightly different: node could be an instance of a ListNode subclass that has overridden next to return something different. The compiler, being conservative, won’t replace the message send with an instance variable access for this reason.

So depending on the use case, we might choose to:

Assume next isn’t overridden, access the ivar directly, and get tail call optimization.

Try to handle both situations:

// v8
+ (NSUInteger)lengthOfListWithHead:(ListNode *)node count:(NSUInteger)count {
    if (!node)
        return count;
    else if ([node isMemberOfClass:self])
        return [self lengthOfListWithHead:node->_next count:(count + 1)];
    else
        return [self lengthOfListWithHead:node.next count:(count + 1)];
}

Unfortunately I was unable to get this to work, even trying a variety of ways to check the class there would always be unconditional release calls inserted at the end of the method that thwarted TCO.

Compile this code without ARC, allowing TCO at the cost of autorelease optimization.
But the best choice is:
Admit defeat, and just use explicit iteration.

In an ARC environment, tail call optimization (and thus tail recursion) is too fragile. Don’t rely on it.

See Programming with Objective-C: Working with nil ↩
As written here, where every node is an object in memory, of course there is a limit to those too. Both 32-bit and 64-bit address spaces allow much longer lists than the stack could could accommodate, though. ↩
Naively this will use 2 frames, one for -[ListNode length] and one for +[ListNode lengthOfListWithHead:]. If the tail call in -[ListNode length] can be turned into a jump, there may only be one frame; but it likely can’t be. See Proper Tail Recursion in C. ↩