web-content/content/blogs/pita-bugs-part-1.html

---
title:       "PITA bugs part 1"
author:      "Z98"
date:        2013-05-06
aliases:     [ "/pita-bugs-part-1", "/node/606" ]
---

<p>In my years working as a programmer, I have run into many, many bugs and introduced many of my own. A few stand out as tremendously irritating to debug, as their behavior made little sense and the source of the bug was non-obvious. The first issue was from my work for the HTCondor Project, an open source cluster management software used by research groups around the world including those working on crunching data from the Large Hadron Collider, the LIGO graviton detector, and the IceCube neutrino observatory. In fact, the HTCondor cluster at UW-Madison often gets lots of cycles used up by the physics department just because they have so much data to work through. HTCondor&#39;s current list of supported platforms is Linux, Mac OS X, and Windows. In the past, it also supported things like AIX, HPUX, and Solaris, though with a drop in demand combined with the expenses of maintaining systems running those operating systems, work on those ceased. Even then, the effort needed to support the many variations of Linux plus OS X and Windows results in a lot of complexity in HTCondor. In fact, the need to support older version of Linux has been a heavier handicap in using newer C++ features than the need to support older versions of Windows, but that is a story for another day.</p><p>This particular bug showed up in HTCondor&#39;s logging code and actually only manifested on Linux, which was especially irritating because as a developer on HTCondor, I was mostly focused on Windows and had generally tried to avoid having to deal with Linux. Unfortunately, it was in rearchitecturing the logging code that I ran into the issue, so there was little choice but to fire up gdb to see what the problem was. It soon became apparent that the crash in the logging code was due to a buffer overflow, but then the question became how did this overflow occur. While gdb was able to show what had been corrupted, it could not tell me when the corruption had happened. Next was valgrind, which pointed me to an entry in a data structure that indicated the maximum size a log was allowed to grow to. These sizes had originally been stored in an array but now with the new structs, a mistaken write would clobber other data in the struct needed to keep track of log output files. Previously a mistaken write would have likely only clobbered the maximum sizes of other logs, which may or may not be noticed depending if an administrator had reason to pay attention.</p><p>Initial walking through the code in gdb did not reveal anything out of place, so I called in a coworker. He also could not see anything wrong with the code and after about half an hour scratching our heads called in another guy, who realized the cause of the problem due to seeing something similar very recently. Following his advice, breakpoints were set in two different files that touched the maximum size member and a check was made to see the size of the type that variable was. To my considerable surprise and irritation, the type sizes did not match in the two files. The cause of this was the use of a typedefed variable type in C, off_t.</p><p>In GCC, off_t can be either 32 or 64bit in size depending on whether a macro is defined before the inclusion of whatever header file defined off_t. This macro however must be defined before the header inclusion in ALL source files, so missing even one will result in a differently sized off_t. This is exactly what happened, as one file had the macro defined and another did not. The one that did not unfortunately was also the one that defined the struct with the off_t member, while the one that did was the file that instantiated and wrote to the off_t member. With this mismatch, the program would try to write a 64bit value in one module while another assumed the struct was 32bit in size and accessed the other members of the struct accordingly. The end result was the logging code crashing with my rearchitecturing.</p><p>The solution was to forgo the use of off_t and simply use a 64bit value, which required any function dealing with a file&#39;s interior to also handle 64bit variables. This however uncovered a different problem. The function being used to deal with moving around inside a file was lseek, which depending on which OS and toolchain one is on, can be either 32 or 64bit. In GCC, the same macro that turned off_t into a 64bit value also turned lseek into its 64bit variant. At the same time, we had already been bitten once by that macro and so it was decided to explicitly use the 64bit version. However, each operating system, Linux/GCC, OS X, and Windows/VC++, called the 64bit version of lseek by a different name. With Linux/GCC, it is lseek64. On Windows, it is _lseeki64 (let&#39;s not forget the fact that normal lseek is also _lseek instead). Finally on OS X, aka BSD, they had forgone a 32bit lseek entirely so regular lseek is always 64bits. The result was a separate ifdef for Linux, OS X, and Windows every time lseek was called. These callsites were mercifully few, but highlight the complexity and tradeoffs one must make to support multiple operating systems, even when using functions that appear to be parts of the C standard library. If you are lucky, it is a simple matter of renaming a function. If you are unlucky, you get to hunt around for unexpected side effects.</p>