Features


A Universal C++ Data Type

Michael Kelly


Michael Kelly is a free-lance Writer/Programmer who uses the languages C, C++, Pascal, 80x86 Assembler, and English. He can be reached by regular mail at 254 Gold Street, Boston, MA 02127, or by e-mail on TelePath as mkelly.

Overloading In C++

C++ allows the overloading of most standard operators. You can manipulate user-defined data types with the same notation as the built-in types. That lets you plug the user-defined type into a standard algorithm with minimal modification. A sorting algorithm provides an example. To keep the code simple, most books or magazine articles that implement a sorting algorithm use integers as the data type to sort. When you have grasped the mechanics and want to use the technique, you find that you must modify the sorting implementation to accommodate the data you want to sort. If you want to sort more than one data type in a program, you have two choices:

The latter is probably the better choice but is also more difficult. Unlike straight C, C++ should allow you to avoid the choice just outlined.

By using C++ overloading, the algorithm accommodates the C++ data type by globally replacing int with the name of your C++ class. If you use inheritance to enable any type of data to be enclosed in a descendant of this class, then you should not need to modify the sorting algorithm again, regardless of what you want to sort with it.

The real world is not quite so wonderful. I had to discard my first class design. After using a debugger to help resolve some side effects, I achieved a result that comes pretty close to my ideal. Along the way I came across several issues concerning C++ behavior.

Pseudo Virtual Base Classes

For the algorithm to manipulate user-defined types in a manner closely resembling the build-in types, the user-defined types should all be descendants of the same base class. Virtual functions implement the operators needed to manipulate the data type. Usually you achieve this polymorphism by using a pointer to a virtual base class. The pointer accesses the appropriate virtual functions for the descendant class being manipulated.

When I implemented the heapsort algorithm, I needed to be able to exchange two objects as well as compare them. This led to the first major behavioral obstacle that I encountered. In order to swap two objects, you need to use a third object as temporary storage. Since a virtual base class cannot be instantiated, I had to fill in the holes in the virtual base class so it could stand on its own. I accomplished this by including virtual functions for every operation needed by descendants and stubbing out the ones that needed actual data to function.

I refer to the result as a "pseudo virtual base class." Now that I could create an instance of the virtual base class, I could use object references instead of pointers — an important step toward minimizing modifications to the heapsort routine.

Improvements

As I built on this technique, defining descendant classes, I found that accessing the descendant class through a reference to the base class worked well as long as an instance of the descendant class was the same size as the base class. As soon as I added data fields to the descendant class definition, all bets were off. This is hardly the source code transparency that I was seeking. First, I tried to use an anonymous union as the sole data field in the base class, and cram into it every data type I could think of. This approach functioned after a fashion but wasn't very extensible.

After further experimentation, I arrived at a better solution. Once a class has a virtual function, it carries the overhead of a pointer to its virtual method table. This overhead is constant. The class could have one or 50 virtual functions without altering the size of an instance of the class. I abandoned the anonymous union. Instead, I used a void pointer as the sole data field in my class. As I derived new classes, I overloaded virtual functions in the base class to manipulate and return information about the memory pointed to by the void pointer in my class.

Deep Vs. Shallow Copying

As things progressed, the issue of deep copy versus shallow copy cropped up. If the assignment operator is not overloaded, the compiler generates code to copy the values of each data field in the class when one object is assigned to another. This is known as a shallow copy. If memory is allocated to a pointer in one object, and the object is assigned to another object, the same dynamic memory may be deallocated more than once during destructor calls. As all C programmers know, dynamic memory bugs can be the most difficult to find.

Most C++ programmers avoid the problem by implementing a deep copy. They allocate more memory in the object receiving the assignment, and copy the data from the object on the right side of the assignment statement. When I used this technique, the runtime efficiency of my code suffered quite noticeably. I wanted to get the efficiency of the shallow copy code generated by the compiler without leaving allocation bugs lurking in my code. The sole data field in my class is a void pointer, which the default constructor sets to NULL. By using a constant object that was never assigned a value, I could use it to zero out the object used for temporary storage during exchanges. The destructor takes no action if the void pointer is NULL. Thus you avoid multiple deallocations while maintaining runtime efficiency.

The Base Class Thing

Next, I'll describe the pseudo virtual base class. This base class is named Thing (as generic a term as I could think of for an object to contain any type of data). As I mentioned previously, its sole data field is a void pointer. Descendant classes use this pointer, thing, to hold the address of whatever data the descendant class manipulates. This allows maximum flexibility of the shape of data in derived classes but requires the derived class to overload the comparison operators and create constructors for the data type it wishes to represent. The only serious clash occurs when you have an array or collection of Things in which the Things are actually instances of more than one derived class. In such a case, incompatible objects are likely to interact, so the programmer must create a typing mechanism to check object type compatibility and avoid meaningless comparisons.

Type Checking

Turbo Pascal 6 considers objects to be of the same type if they have the same virtual method table. It provides a TypeOf function that returns the address of this table. I could use the same technique in my class, but I don't know if the way Turbo C++ handles its VMT pointer will resemble any other C++ implementation. Consequenty, my type-member function is a virtual function that returns a long. The upper 16 bits represent an enumerated type for the general data type involved (e.g. int, long, double, struct). The lower 16 bits is the size of the memory pointed to by the pointer thing. The idea is that two struct data-type objects of different size would not be erroneously compared. This approach is flawed, but it seems reasonable to me that heterogeneous arrays will probably be iterated over, while only homogeneous arrays will have their elements compared and added. This problem seems to cry out for a perfect hashing function.

Derived Classes

I have derived classes from Thing to handle ints, longs, doubles, and structs. I used the numeric types mainly for developing the technique. The struct type is likely to be of serious value in an application. Also implemented is a ThingList class for creating a linked list of Things that you can access using array-indexing notation. Its main purpose is as a test bed for an iterator member function. The test program calls the ThingList iterate method with a member pointer to the Thing print virtual function to illustrate polymorphism. Now I know why some programmers love SmallTalk. The class hierarchy is already there. Unfortunately, you must use it whether you like it or not. I prefer the incremental approach to OOP that C++ offers.

The Code

C++ source code accompanies this article. Thing.hpp is the class definition for the Thing class. Thing.cpp holds only the const of type Thing named TheNullThing. This is the uninitialized constant used for zeroing out or erasing local Thing instances that I described earlier. The derived-class files generally follow the naming convention th_type.cpp, for the implementation, and th_type.hpp, for the C++ header, in which type is int or struct or whatever. The exception to this is th_list. This is the ThingList class containing the iterator method I spoke of.

To compile the program with Turbo C++, copy the headers to the directory you use for your header files and the .cpp files to your source code directory. Invoke the command-line compiler tcc to do the job:

tcc -P trything thing th_int th_doubl th_long
   th_struc th_list hsort
As with all design decisions, the tradeoffs must be considered. On the negative side, my approach requires thoughtful design of the base class and overloading of the virtual functions for every derived data type. On the positive side, unfamiliar algorithms may be easily adapted to operate on your data. Because you are more likely to be able to fix a class of your own design if it breaks during modification, my approach may help you produce a library of truly reusable code.

Listing 1

Listing 2

Listing 3

Listing 4

Listing 5

Listing 6

Listing 7

Listing 8

Listing 9

Listing 10

Listing 11

Listing 12

Listing 13

Listing 14

Listing 15