Phlogiston Blue - Onward to America Online

Onward to America Online

This is the third of four articles published, or to be published in the journal C Vu. For more details please see the introduction to the first article in the series - 'Moving to the Web'.

In last issue's article I talked about the evolution of the multi-player game program Federation from its inception until we got it running on General Electric's GEnie service.

By the start of 1995 it was obvious that GEnie was on the way down. There were a number of reasons for this. The division of GE that ran it - GE Information Services (GEIS) - had never understood the service. Bill Loudon, GEnie's founder, had long since left, and it was drifting about rudderless. GEIS had long been pulling money out of GEnie to bolster its bottom line while refusing to invest in updated equipment, condemning its customers to use 2400 modems while everyone else had 9600. In addition, like most established services it refused to acknowledge the importance of the nascent Web.

Our contract with GE was not an exclusive one, and so for some time we had been looking to run Federation on the other services. It was about this time that we got a positive response from America Online. At the time AOL was big, but nothing like the size it is now. Its subscriber base was less than a million, and it was still an hourly subscription service.

After discussion with AOL it was decided that we would port the game to a Hewlett Packard Series 9000 RISC machine which AOL would provide and host in their machine room. The interface to AOL would be through a gateway originally designed for remote use. This sounded just fine, and I came back to the UK with all the documentation for Federation's programmer to study.

This gave us two things to do to get running on AOL - port the code to HP-UX, and interface the program to AOL's gateway system. The main thing we were worried about at this stage was porting to HP-UX, which at that time had a reputation for being a somewhat idiosyncratic flavour of Unix. In the event this proved not to be a problem and the port itself went reasonably smoothly. (Incidentally, over the last couple of releases HP-UX has moved into the Unix mainstream.)

What did cause problems was the gateway interface. There were a number of contributory factors to this:

1. The gateway had only been used internally by the group that wrote it.

2. The documentation for the gateway was inadequate, and in places incorrect.

3. The group handling the gateway appeared to have no line management!

All three of these things caused us problems both at while we were porting and on an ongoing basis. In particular, it meant that a two week trip to the US by Federation's programmer stretched to over six weeks. By the time the first royalty cheque arrived we literally had no money left in the bank.

Let's look at these problems one at a time...

First, the gateway code had only been used internally by the group that wrote it. I suspect that the gateway project was something that grew out of an internal project, rather than something designed as an external interface from the start. The routines used by the gateway group itself were very solid and very robust, their everyday use had long since ironed out the bugs. There were, however, a group of features that weren't used by the gateway programmers, and these proved to be much more flaky. I suspect they were added as an afterthought when someone realised that the interface could be used externally. In effect we were debugging this code as we went along.

This is the perennial 'library' problem. A company writes a library for internal use (3D graphics libraries are the most obvious culprits) and finds it very useful. Then some bright spark says 'Hey! We can sell this library and make back some of the money it cost us to develop.' Then features are found to be missing so they are added and the whole package goes onto the market as a 'revolutionary, full-featured xxx library'.

Unfortunately, the joins show. If you want to develop a library for external use it needs to be specified as such right from the start, all the features need to be properly tested, and the interface needs to be complete. Take it from me - never buy a third party library except from a vendor who only produces libraries. This is one of the main reasons why PD/Shareware libraries are often so full of promise, and so disappointing when you come to use them.

Second, the documentation was inadequate and/or incorrect. The documentation was obviously meant to be of a proper standard - it came in a loose leaf folder with a couple of updates. I have no doubt that it was fully documented from the point of view of the gateway group - but the documentation was part of the body of their work, email, source code comments, group discussion etc. Unfortunately all we had was the written specification, which didn't incorporate this material.

This is another aspect of third party libraries - the documentation must be comprehensive. (Oh and by the way - programmers should be able to browse through documentation in the bath, so trying to cut costs by providing electronic documentation is a big no-no.) Preferably the documentation should come with a lot of simple examples, each covering one, and only one, aspect of the library. Nothing clever, nothing complex. I seem to remember that Greenleaf's DOS comms library was a sheer delight from this point of view.

Third, the group producing the gateway code didn't seem to have a senior manager to whom they reported. They seemed to be completely autonomous. This meant that there was no one we could go to to iron out problems with the code. In practical terms we had to write work arounds. In the event this wasn't such a problem at the time of porting, because the other people we were dealing with at AOL were highly competent and helped sort things out. It did become a problem later on.

We finally got up and running on AOL in mid-1995. It was magic - we had more people than we had ever had playing the game. It soon became obvious, though, that AOL users were a different breed to those we had previously been used to. For the first time in its history Federation needed a game management staff - what are now called Hosts and Navigators - but that lies outside of the scope of this article. It also became obvious that we needed to tweak the game to cope with the increased number of players ( up from 10-20 players to 50-60).

It was while we were doing this that the first problems emerged with the gateway to AOL. There was a general problem with the gateway group making changes to their code and not telling us, which meant that our code would mysteriously 'break'. Then there was an ongoing project of theirs to port the gateway to HP-UX. Since they swore there was 'no difference' they didn't give us an opportunity to test their code with ours. One day they simply put it in without telling us. Needless to say, our code stopped working properly.

But the worst problem was changes in the interface specs. At one stage they increased the maximum length of the messages that the gateway sent us. Buffer overruns are incredibly difficult to spot! Fortunately, we had a very good working relationship with the Games Channel people who did everything in their power to smooth things over.

In late 1996 AOL announced that it was moving to a monthly subscription, instead of hourly charges.

Sadly, they hadn't thought through the implications of this for multi-player games, and failed to keep them charged on an hourly basis.

The results were pretty obvious - everyone wanted to play the games for hours on end. This caused both game design and coding design problems for us.

The game design problems were easy to define, but more difficult to correct. Federation was designed to be played in a pay per hour environment. There was an underlying assumption that cost would limit player's ability to stay on-line indefinitely. There were also problems with the puzzles that required the use of what had become, with several hundred simultaneous players, very scarce objects indeed! The resolution of these problems is outside the scope of this article.

Sorting out the coding design problems was more of an immediate problem. At this stage we were running flat out with about 200 simultaneous players, and hordes more knocking at the door. We approached AOL for a more powerful machine, but one was not forthcoming. Not surprising really, since more players would increase the load on their network without increasing their revenue in any way. We were obviously going to have to solve this at the software level.

At this stage a simplified view of what was happening looked like this:

AOL --- Gateway (AOL)--- Gateway (Fed) --- Fed driver --- Fed Server

Players came to us from AOL via the AOL part of the gateway. This was the gateway code running on multiple machines which multiplexed the users through to our machine. Our part of the gateway, running on our machine, was one process for each of AOL's machines. These processes de-multiplexed the input and passed it on to the appropriate driver for the player. The driver in turn communicated with the server. Remember that there was a driver for each player and that all the links between processes were using TCP/IP (the standard communications protocol for the Internet).

Now, TCP/IP is a very good protocol indeed when you are communicating over services where packets might be delayed or lost. But that robustness comes at a cost in processing, memory and time. With (say) 200 players your would have 400 TCP/IP connections inside the Fed host machine - 200 going into the drivers from the gateway, and 200 from the drivers to the host. Now, none of these connection was going out onto the network, so no packets were going to be lost, clearly TCP/IP for the internal communications was overkill, and we couldn't afford the cost. Various options were looked at and we settled on UNIX pipes as a replacement. The result of just a simple change was staggering - maximum usage shot up to 350 overnight.

Some of you might be wondering why we were using TCP/IP in the first place. There were two reasons - portability and scalability. TCP/IP was, when Fed was written, more readily available and at that time most implementations of pipes went through a disk file. Only relatively recently have pipes started to go through memory. Our original ideas on scalability suggested that the way to scale up the game to handle more players was to have the server on one machine and the drivers on another machine. Using TCP/IP meant this could be done with little or no re-coding.

But...

We still needed to get more players on to our host machine - 350, it was obvious, was still inadequate. Where to go next? Well, we decided to look at the problems caused by having all these different processes running on our host machine. In particular we looked at context switching. Context switching is what the processor does when it moves from one process to another. Before it switches it has to save all the information about the old task (the context) and load in all the information about the new task. This takes time. Lots of it.

To give you a feel for this problem let's scale down out 100Mhz processor by 100 million to human time scales - 1 cycle a second. The following table indicates the problem:

Processor Cycle time 1 second
Accessing the Cache 3 seconds
Accessing Memory 20 seconds
Context Switch 166 minutes
Disk Access 11 days

(Table information adapted from 'Practical Unix Programming' by Robbins & Robins, Prentice Hall)

As you can see context switching is expensive! The only way to reduce the context switching was to run less processes. Reluctantly, we decided to fold the driver processes into our end of the gateway code. Why reluctantly? Because this would massively hit the portability of the code, by tying a significant part of the Fed code into AOL's gateway code. None the less, we couldn't see any option - we were caught in the oh so familiar portability v efficiency dilemma. Nothing comes without a price, and the price of portability is less efficient code. On the other hand we would also halve the number of UNIX pipes we were using.

The results of this exercise were dramatic to say the least - usage shot up to 800+ players with processor resources to spare! We heaved a sigh of relief, and went on to start making game design changes to cope with additional load.

It was at this stage that AOL decided that they didn't want multi-player games any more...

But yet again I'm out of time and space, so the final part (I hope) of this saga will have to wait for the next issue.

Have fun programming.

Alan Lenton
16 August, 1998

[ A note on the concept of multiplexing.

Some of you may not have come across this concept before. Basically multiplexing involves taking a number of separate communication channels or streams and combining them into a single channel. At the other end they are de-multiplexed from the single channel back into the multiple channels. Multiplexing is a widely used technique since it allows scarce communication resources to be used efficiently.

Both hardware and software multiplexers exist. Hardware multiplexers are faster and more efficient, but you can't reprogram them as better multiplexing algorithms are discovered. Software multiplexers are slower, but programmable. You pays your money and takes your choice.

Multiplexing is a vast subject, and I don't have time to go into it here (or the specialist knowledge to do so). If you want to know more then you need to consult your local reference library for suitable texts.]

Read other articles about the history of online games

Back to the Phlogiston Blue top page