How an Accident of Hardware Design Encouraged Open Source
Pages: 1, 2
Portable OS and Incompatible Byte Order Leads to Preference for ASCII
As I think about it, I believe that this problem of byte-order incompatibility ended up being a major driving force that resulted in the openness we find in the descendants of Unix. OSes prior to Unix were written in assembly language and were thus inextricably tied to the instruction set specific to the CPU they were written to run on. Unix was the first major OS written in a higher level language, and so it was easy to compile it to instruction sets for many different kinds of CPUs. Because other OSes ran only on a single architecture, they never encountered the byte-order incompatibility problem. But Unix's portability guaranteed that it would run into the problem, big time. When people tried to move data between big-endian and little-endian machines, bytes within words came through scrambled. The scrambled results became the name of the problem; it was known as either the XINU problem or the NUXI problem, depending on the specific architectures of the two machines.
Programmers tried a variety of approaches to deal with the problem that data formats that depended on byte order didn't transfer well between architectures. A common approach was to build something into the data format to indicate the byte-order. The first two bytes of a TIFF image file, for example, are either "II" if integers in the file are in little-endian byte order or "MM" if integers in the file are in big-endian byte order.
Another way to solve the byte-order problem was to avoid it entirely by only dealing with data in one-byte chunks. People quickly noticed that programs written to fetch their data from files that contained printable ASCII worked fine without any extra programming effort. Representing numeric data as plain ASCII had the disadvantage of taking up a lot more space than a binary representation. Given the memory constraints of the time, this was a much bigger deal back then than it is now. It also had the disadvantage of requiring additional CPU time to convert the data between ASCII and binary. Again, this was a bigger deal back then than it is now, due to the comparatively slow CPU speeds of the time.
But memory and disk capacities were increasing, so the advantage of ASCII quickly came to outweigh its disadvantages. In all but the most storage-intensive applications, file formats that were byte-order dependent tended to wither and die. The experience of cpio vs. tar illustrates this.
In the early `80s, cpio and tar were two common Unix archiving utilities, which was right around when a lot of commercial interest in using Unix began. Because lots of different companies were building their own hardware, each one inventing their own instruction set and addressing scheme, getting Unix to run on their machines required each of them to make the Unix source code compile, link, and then run properly on their hardware--a process known as porting. It was often necessary to copy many files at a time, some ASCII and some binary, back and forth between machines with different architectures. That meant aggregating many files into a single file for transfer among different machines. Both cpio and tar were available to do this job. However, where cpio's file format contained binary integers, the design of tar's file format had avoided binary and instead stored integers as ASCII strings. A cpio backup only worked where the destination machine had the same byte-order and wordsize as the source machine. Tar format, unlike cpio format, could be used to move files between machines with different byte-order and wordsize. Although you still had to deal with byte-order dependencies inside any binary files within the tar archive after they were on the destination machine, this was vastly better than the situation with cpio where you couldn't even transfer the files. Unsurprisingly, people gravitated toward tar. As a result, cpio fell into disuse. tar, on the other hand, is still in common use today, a quarter of a century later.
Storing Data in Binary Fosters Closed Proprietary Systems; ASCII Fosters Openness
Representing data in 1-byte chunks rather than multibyte chunks turned out to have an interesting side-effect. When storing numeric data in multibyte chunks (binary), you have a stream of groups of bytes that may be of varying size. You might have 1-, 2- or 4-byte integers, 4- or 8-byte floating point numbers, and character strings, all mixed in with each other. Unless you have documentation on the file format, there's no way to tell where one piece of data ends and the next begins.
If you're storing your data in 1-byte chunks, the most natural representation is displayable ASCII text. That means that the data typically contains line terminators (LF on Unix, CR/LF on DOS, CR on the Mac) and field delimiters (TAB, ",", etc.). It also means that a sequence of characters containing only digits is probably a number, and a sequence containing non-digits is clearly not a number.
In an environment in which it's easy to make your data and configuration files incomprehensible (unless you make a point of providing detailed documentation on your file format), programmers and the companies they work for expect that everything can be kept secret, even from other programmers, unless they can gain some business advantage by documenting the format of some particular file.
On the other hand, in an environment in which data files, and especially configuration files, are commonly ASCII rather than binary, programmers come to expect that they'll be able to figure out a lot about what's in a file, even in the absence of detailed documentation on the file format.
Although Richard Stallman probably would have founded the GNU project anyway, GNU's progress would have been greatly impeded in an environment where all data was stored in binary. Many of the F/OSS projects that are GNU's progeny may never have happened in an environment in which undocumented binary file formats predominated. As just one example of how companies use undocumented binary formats to protect their turf and impede the progress of others, note that although Linux can read from an NTFS filesystem (the filesystem format that Microsoft has moved to), the ability to reliably create or modify files in an NTFS filesystem has been a longstanding problem for Linux. (See Does Linux Support NTFS? and HOWTO: NTFS with read/write support using the ntfs-3g.) Microsoft has managed to lock Linux out simply by using sufficiently complex structures inside NTFS and keeping the documentation secret.
Conclusion
The IBM 360 and DEC's PDP-11 were the predominant big and little computers of the 1970s. If they had both used the same byte-ordering scheme, it would have been possible to transfer multi-byte chunks of data between the two architectures without encountering byte-ordering problems. If the PDP-11's designers had decided to copy the byte-ordering used by IBM, the Unix world would not have been pushed toward encoding data in ASCII, and the unintended consequential openness in the Unix/GNU/Linux world might never have come about.
Return to ONLamp.com.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 5 of 5.
-
ASCII on the PDP-8
2007-03-05 22:00:53 Marc.Ramsey [Reply | View]
-
ASCII on the PDP-8
2007-03-05 23:14:28 Mark_Rosenthal [Reply | View]
There were a few different encoding schemes we used when I worked in DEC's Small Systems Group. It wasn't necessary to explain all of them in the article to make the point that on non-byte-addressable machines it was more cumbersome to write software to manipulate text than on byte-addressable machines.
When we could live with only uppercase letters, digits and a very few punctuation characters, we used a 6-bit encoding scheme called RAD50. But when we needed a fuller set of punctuation characters or lowercase letters, we used an 8-bit encoding which was 7-bit ASCII plus parity.
As for Gosling Emacs, I heard stories back then that Gosling had incorporated other programmers' contributions (including Stallman's) into his version of Emacs, and then sold it to Unipress who treated it as proprietary and violated the programmer's tradition of sharing code. Since Stallman hadn't included a copyright in the code he contributed, it was in the public domain so Unipress could claim proprietary rights to it. This is part of what made Stallman decide to copyright his later code and invent a license that would require recipients to behave better than Gosling and Unipress had done.
-
Open Source
2007-02-26 19:41:22 EldonZ [Reply | View]
My early rememberences of shared code are from articles in Communications of the ACM, Knuth, Numerical Recipes in C and the sample code distributed by manufactureres of specialized chips such as DSP's. Granted, much of the earliest was in pseudo code of one type or another.
The thought that there might be some advantage to having someone other that the original programmer read the code to catch errors was suggestive of open source but I don't recall that idea being all that popular with the original programmer at the time.
One of the problems with reading the code of another programmer was that the code, by and large, tended to be unreadable. "Spaghetti" code resulting from liberal use of "goto's" too often was the rule until Dijkstra's "Go to Statement Considered Harmful" letter to the Communications and the structured programming movement brought some hope of there being any value to the source being open.
The view often was that hardware made money while software cost money. This changed only when the hardware dropped in price due to integrated circuits and the hardware manufacturers allowed one company to control the operating system. So, you might reasonably say that the integrated circuit led to open source software. Nothing has yet done for software development what the integrated circuit did for hardware.
-
endianness
2007-02-23 13:41:36 Taniwha [Reply | View]
Gee - there's lots to disagree with here, most of it pedantic - there were both little-endian and big endian machines before the '11, most of the word oriented evil back then probably had more to do with not having a power of 2 bytes in a word, if I remember '11 longs (or was it floats) were actually stored partially bigendian, etc etc
Mostly though I suspect the 'storing stuff as ascii' idea was something that couldn't catch on until disk storage got cheap enough. The other big change Unix brought in - simple unstructured text files - needed us to break away from that record based 'every thing is a card image' mindset.
What I really want to address though is endianness itself - we're kind of stuck with it as a western cultural artifact .... originally Arabic numbers were embedded in Arabic script which is of course written right to left (think about it what do traders do most with numbers? add them - a process that goes from LSB to MSB - in their case along with the flow of their script) - apparently we got Arabic notation from Spanish monks working their way through the libraries the Moors left in Spain ... they picked them up and brought them into western languages but weren't smart enough (I blame them for our current endianess malaise) to turn the digits around when they were included in a right-to-left writing system.
There you have it: Arabic numbers in an Arabic writing system are naturally little endian, in ours they are big endian - we grow up and are taught that they should be this way from the beginning of life - we're stuck with a system that has this inbuilt mixed-endianness backwardness - it's hard to make that mental jump that they should be different (I bet users of right-to-left writing systems find the whole idea of big-endian systems just weird at a whole different level since they haven't grown up with the idea the both things should live together)
(and BTW remember even in our modern little endian computers we still tend to number the bits within bytes in a big endian way ...) -
Endianness, Arabic numbers, etc.
2007-02-27 15:31:51 Mark_Rosenthal [Reply | View]
Taniwha wrote: "there were both little-endian and big endian machines before the '11, most of the word oriented evil back then probably had more to do with not having a power of 2 bytes in a word, if I remember '11 longs (or was it floats) were actually stored partially bigendian, etc etc"
Since endianness only applies to how bytes are numbered within a word, the issue only arises in architectures that are byte-addressable and have a wordsize larger than one byte. So the constraint is not whether the architecture has a power of 2 bytes in a word, but whether the wordsize is a multiple of the number of bits in a byte and whether each byte is assigned its own address.
Regardless of whether other manufacturers produced byte-addressable machines of either byte order, the PDP-11 is of critical importance because it's the architecture that the first ubiquitous portable operating system (Unix) grew up on. The 360 is important for the same reason that Microsoft Word is important today. For good or ill, its manufacturer was the behemoth that drove the industry.
You are correct about the 11's representation of multi-word data. The idea of giving low-order bytes lower addresses than high-order bytes was only maintained within a single word. High-order words within floats and doubles were stored before low-order words. It seems likely that the hardware implementation of these datatypes was designed by a different hardware engineer, and nobody noticed the inconsistency until too late.
Taniwha wrote: "Mostly though I suspect the 'storing stuff as ascii' idea was something that couldn't catch on until disk storage got cheap enough."
Disk capacities have always been way ahead of memory capacities. While disk storage may have had a small effect, the real limiting factors were the things I mentioned in the article. Manufacturing core memory was so expensive that no machine had a lot of memory until the industry switched to RAM. And CPU speeds were slow enough that when you considered storing numeric data as ASCII, the first thought that popped into your mind was, "How many instruction cycles will it take to convert the data to a format I can use for computing?"
Taniwha wrote: "The other big change Unix brought in - simple unstructured text files - needed us to break away from that record based 'every thing is a card image' mindset."
The break from the record based 'every thing is a card image' mindset really didn't apply in the DEC world -- at least not in the Small Systems Group which was responsible for OS-8 on the PDP-8 and RT-11 on the PDP-11. Unlike the mainframe world where the basic assumption was that input came from a card reader and output went to a line printer, our basic assumption in the Small Systems Group was that input came from and output went to an ASR-33 Teletype or similar device. Paper tape and keyboard input was assumed to be variable-length ASCII records delimited by \r\n.
Taniwha wrote: "apparently we got Arabic notation from Spanish monks working their way through the libraries the Moors left in Spain"
I'm not terribly familiar with the history of Spain, so I don't know what influence the Spanish monks had, but http://en.wikipedia.org/wiki/Arabic_numerals reports that our numbering system was originally called Hindu numerals and that a 9th century treatise by a Persian scientist entitled "On the Calculation with Hindu Numerals" was translated into Latin in the 12th century. It also says that Fibonacci, an Italian, promoted the system in Europe after learning it in Algeria. So the system may have entered Europe through multiple avenues, and what was originally called "Hindu numerals" later came to be known as "Arabic numerals."
Taniwha wrote: "they picked them up and brought them into western languages but weren't smart enough (I blame them for our current endianess malaise) to turn the digits around when they were included in a right-to-left writing system."
This is a very interesting observation. Apparently the inconsistencies in modern hardware design are nothing new.
Taniwha wrote: "(and BTW remember even in our modern little endian computers we still tend to number the bits within bytes in a big endian way ...)"
Since neither the 360 nor the 11 instruction sets included any instruction that identified bits by their bit address, the bit-numbering was really a paper exercise. But our modern little-endian Intel architecture computers have instructions like BSF and BSR (bit scan forward/reverse) which return the bit address of the lowest bit that's turned on in the source operand. And the bit numbering used is little-endian, not big-endian.







The PDP-11 was the first computer I used ASCII on to any great extent. The endian issue was already being dealt with by ARPAnet protocols in 1975, when I was involved in interfacing the first PDP-11s to that network.
If you really want to know how open source software came about, ask Richard Stallman about EMACS and James Gosling...