Computer and biological analogies

7 Avril 2005

I've noticed that when I try to explain computer technical points to people who are not specialists, which includes friends, colleagues and judges, I sometimes use a few analogies with biology. Here are a few of them explained : computer viruses and biological viruses, reverse engineering and reverse genetics, full disclosure and evolution, and finally open source and science.

Computer viruses and biological viruses.

This is the classical one. You can find many sites with a discussion about this analogy. There are interesting parallels between both, if we don't look too closely. Obviously, the name that Fred Cohen chose for the new concept of computer program he invented was not a mere coincidence.

During the lysogenic cycle of bacteriophages, which are typical viruses that infect bacteria, the viral genome is integrated inside the host genome and can be activated later. In this case, it basically hijacks the whole bacterium metabolism and energy to replicate and produce more copies of itself. They are parasites. Computer viruses - I am talking here about strict viruses according to Cohen's definition - integrate their code inside a host program, hijack the computer time and memory normally allocated for the infected program, and replicate.

[Some rambling]

Biological viruses are not really alive, but that of course depends on your own definition of what life is. It may look a bit strange, but life is something very difficult to define strictly. Viruses are very close to the frontier between life and no-life, wherever this frontier may be. They are, more or less, a bit of genetical material (DNA or RNA) with a few protein-encoding genes, packed inside a protective protein bubble.

Actually, there are "organisms" even simpler than viruses. They are called viroids, and are extremely short strands of naked RNA which does not code for anything but can still replicate. Viroids were only discovered in plants, and they generally cause deseases. In animals, you have another type of very simple self-replicating - and lethal - entity called prions. They don't carry any genetical material with them, they are just a single protein which is not folded correctly. They modify the structure of our own prion proteins, which is a normal protein in the brain, when they interact with them.

The basis of the analogy between computer and biological viruses is that their code, or the basical information necessary and sufficient to make them work in their environment, is completely linear and mono-dimensional, and is based on a very simple alphabet. Four bases for DNA, and 256 for computer code. You could say that computer code is actually based on a binary alphabet, bits equals to 0 and 1, but the minimal set that a microprocessor checks at every cycle, the minimal sequence to make an instruction, is actually composed of eight bits, which is a byte, which can have 2^8 values, or 256. It's still very simple.

Computer viruses are also very similar to retrotransposons, which is not a surprise, because these entities are themselves very similar to retroviruses. What are transposons ? They are mobile genetic elements that do not come from the outside, but from the inside. They are pieces of DNA inside the genome that stay silent for a while, and suddenly, for some reason, they are excised and copied elsewhere in the genome. Think about cut and paste. Retrotransposons are more like copy and paste, with an intermediate RNA copy. Transposons have probably a huge positive role in evolution, as they produce mutations, because they are not cut out precisely, or they can land in the middle of a gene and disrupt it.

Which gives me the occasion to talk about something often missed by people who do analogies between both types of viruses. Computer and biological viruses may both "mutate", but these are two fundamentally different processes. Polymorphic computer viruses indeed mutate, but it's a mechanism designed by their creator. It has nothing to do with random. All the possibilities - and you can calculate the theorical number of them - are encoded in the mutation engine. Biological viruses do not possess a mutation engine. They mutate at random, because of a gamma ray or an error during replication. It's not predictable.

I have heard only about one case from a credible source of something close to a spontaneous and unpredictable mutation in computer viruses. And curiously, it was positive, if we judge from an evolutionary perspective. It was actually an interesting mix, a curious monster, but I'm not sure it could survive further than one generation. The offspring of a very simple e-mail worm without any executable infection, destructive abilities or polymorphism, and a complex PE and kernel infector which was the opposite : heavily polymorphic and which carried the CIH routine to flash the BIOS on Christmas Day. The outcome of this computer sexual unique encounter - probably an infection of self-contained worm by the virus, followed by the normal transmission of this evil hybrid by the worm e-mail routine, which didn't do any self-integrity check - got the characteristics of both : it became a destructive polymorphic PE infector with e-mail sending abilities.

But this is very rare, for a simple reason. Most old school computer viruses are written in assembler. This is not a high-level programming language. It's more like direct instructions for the microprocessor, written one by one. So it's very compact. Everything is useful, there is no waste of space. Biological viruses are a little bit like that. Sometimes genes even overlap : the same sequence of DNA can be used for one gene in one direction, and for another gene in the opposite direction. It's a completely different picture in higher organisms like humans. More than 95% of the DNA does not code for genes. People sometimes call it "junk DNA", but it's a simplification. There may be many things in there, especially important in the long run, for evolutionary purposes, as a sort of reservoir of genetical material. My point is that a random point mutation in human DNA, and there are several every day in your own body, dear reader, will probably not have any consequence. Same with a computer program coded in, for example, Visual Basic, which includes many things that will never be needed for the program to run, plenty of routines which are never called. On the other hand, a small mutation in a biological virus, and even more in an assembler program, will likely have a negative consequence. The more compact, the more sensitive to mutations.

So we have to be careful with words. Just because two things are named the same does not mean it was for the same reason. Another example of that : retroviruses. In the computer world, this kind of viruses are called like it because they fight back against anti-viruses and may disable them, which is the opposite of what normally happens. Hence "retro". In the biological world, retroviruses (just like retrotransposons) use a very unusual step : at one point, they transform their RNA into DNA, using an enzyme called reverse transcriptase. This is opposite to the usual flow of information in biology which goes from DNA to RNA and then proteins (the "Central Dogma"). So, they are called "retro". Just to add some useless information, because this process and this enzyme do not usually exist in normal cells, they are targeted by anti retroviral drugs like in the case of HIV.

There are some other possible analogies which are worth noticing. For example, between antivirus scanners, which recognize a short sequence of computer code, and antibodies, which recognize a short strand of amino-acids, themselves encoded by some DNA. Also, the epidemiology for both kinds of viruses is very dependant on the environment and how information is exchanged or human populations travel around. Availability of plane flights modified the landscape of worldwide diseases, just like the internet completely changed how computer viruses are distributed. Another funny point is the bad consequences of monoculture. If 90% of the planet grow the same variety of corn or use the same operating system, a single pathogen for it may have disastrous consequences. Buy a Mac.

Disease and destruction are not included in the definition of viruses, biological or computer ones. Actually, many viruses are almost harmless. Some of them even have beneficial effects. Any analogy based on lethal effects is off the point. A virus is something that self-replicate, and that's it.

Reverse engineering and reverse genetics.

After the classical viruses analogy, let's now go into more unexplored territories. As a molecular biologist and an amateur reverse engineer, I can tell you that there are many analogies between both domains. Once again, their main basis is the similarity between linear genetic code and computer binary code.

First, let me explain to biologists (or anyone else) what is reverse engineering. It means that you start with a finished product, and you want to understand how it was made, its inner logic, how the programmer made it work. The finished product is a compiled computer program, what is called an executable. It has a special format and contains many things like data (images, text...), but the real heart of it is the actual executable code, another name for the list of instructions for the microprocessor to follow. Executables can be self-contained or call from time to time some external libraries, but that does not change the principle, because these libraries are themselves programs.

When you program a software, you type a quite readable and understandable text in some language (Basic, Pascal, C). Then you feed this text to a special program called a compiler. This compiler is going to parse your text, called a source code, and transform it into an executable that the microprocessor can use. When you reverse engineer a program, you do the opposite. Theorically, you may go from the end product to the original source code. That is called decompilation. Most of the time you don't have to do that, and it's very often impossible anyway. What you do is to check the raw machine language instructions. As we are more intelligent (but slower) than a stupid microprocessor, with some training, we can read and understand these instructions. If you do that statically, without executing the program, it's called disassembling. If you do that dynamically, checking the raw instructions one by one while the program is actually running, it's called debugging. The name obviously comes from the fact that it's an excellent method to check and remove programming errors. It is also useful for compatibility and analysis purposes, to find vulnerabilities, to understand what is the output of a software, or to modify and adapt a program for which the source code has been lost. It is also widely used to remove programs protections, which is called "cracking".

Imagine for example that you want to automate some task, and for this purpose, you want to remove a useless confirmation box like "Do you really want to do that ? [Yes/No]". You will debug the program until you reach an instruction that looks like a call to this confirmation routine. And you replace this call by a "do-nothing" instruction, which you write on the original program. You start the program again to see what happens. Bingo. No more confirmation asked. This routine was the good one to modify.

Now let me explain to computer hackers (or anyone else) what is reverse genetics. It's easy : it's exactly the same.

There is just one difference. You always have access to the entire code of a computer program. To know the entire genome of an organism, you have to wait for someone to sequence it in full. To sequence several billions of nucleotides (an average of 10 times each to avoid errors) was an herculean international effort a few years ago, but now it's getting faster and faster, Moore's law seems to apply to molecular biology techniques as well - oh, another analogy. There are more and more every year. Several model animals, including humans and mouse, are already fully sequenced, as well as some plants like rice and Arabidopsis (my favorite weed).

So you can make a few educated guesses : "Maybe this gene/routine is responsible for that. Let's modify it and see what happens". You will try to change some precise inner instructions in the DNA to alter a characteristic of the entire organism. In biological terms, you modify the genotype and want to see a change in the phenotype. You change a chosen gene and you check what it changes in the final organism (morphology, behavior...).

Let me invent a practical example. You think that the gene XYZ123 is involved in cold resistance of a plant, but you are not sure, because there are many similar genes, as you know by searching the full sequence of this plant. Remove this gene - just like replacing it by a do-nothing instruction - by creating a special kind of plant called a knock-out mutant. Now check the resistance of this mutant to cold. In other words, put a batch of them in a fridge, along with control plants. Imagine that your mutants are all going to die quickly, while the normal plant survives. It's definitely a link between the gene you knocked out and the mechanism of cold resistance : if you remove it, the plants become more sensitive. So you may formulate the hypothesis that the opposite is true : if you overexpress this gene, you may have plants that can resist to cold. Do that in a transgenic corn, patent it, sell it in Greenland and become rich.

That is reverse genetics. From the gene to the phenotype.

Just to be complete, I will explain what is forward genetics, and why it cannot be done in computing. A forward genetics strategy consists in screening a randomly mutated, or naturally variable, population, spot an interesting behavior or phenotype, and try to isolate which gene is responsible for it. To follow the same example : put thousands of plants in a fridge, come back two days later, and save the ones that are still alive for future analysis. Obviously, this is what was done historically, when no genome sequence was available. To pinpoint which gene was responsible involved a long and messy process with a lot of crosses and successive sequencing from a known marker to your unknown gene (called "chromosome walking"). So, forward genetics is from the phenotype to the gene.

What would be forward genetics in computing ? It's simple. Modify software at random, one byte every time, and, on the example above, wait for your annoying routine to disappear. It's not exactly practical. You would have to check billions of variants (imagine a 4 megabytes program with 256 values for each byte) before finding the good one. And many times you would probably never find it, for example if the code is compressed. Compare to more or less 30.000 different genes in humans and Arabidopsis (yes, we don't have any more genes than a stupid weed, isn't that humbling ?). That's one of the reasons why reverse genetics / engineering is much more powerful and fast. But you have to first have an idea of what you are looking for. A complete sequence / code, and some previous knowledge of molecular or biochemical mechanisms / microprocessor inner working.

Full disclosure and evolution.

I used this analogy in front of a court, to explain why full disclosure was a positive thing for everybody. Well, it didn't seem to work, because I was finally condamned. But I think it's a good one.

Let's define what is full disclosure first. I would say it is a diplomatic and fair strategy to put on the same discussion table people with very different goals, methods, philosophy, and financial possibilities. To force them to cooperate, in a way. Basically, a bug hunter, and a software editor. Full disclosure is a plan for everybody to follow to avoid confrontation and nasty legal threats. At the end, it's a triple-win. The bug hunter is happy because he helped, the company is happy because it has a better software, and the consumers are happy because they are better informed and protected.

How does it work ? Imagine that you find a vulnerability in a software. You will send an e-mail to the software editor with technical details. You discuss about it, they acknowledge the problem, take some time to update their software. You let them some time to do that. When all clients have an updated version of the software, you can disclose and publish the full technical details of the vulnerability, sometimes with a demonstration. The company know that you are going to publish this, so it's a bit of pressure for them to correct the flaw. So at the end everybody in the world is informed about that vulnerability, can learn and increase their awareness.

Now, of course, it often does not work like this. Bug hunters want to publish fast and don't want to wait for monthes. Companies want to avoid bad publicity at all cost, and think they can easily scare the little guy. And they don't always want to correct the bugs. They sometimes think that ignorance, even if it endangers their own clients, is better than negative reputation. For example, I am just a hobbyist, published less then twenty software analysis, and I have been already legally threatened twice (once for a steganography software, once for an anti-virus). In one case it went to court. So in my case, I've been threatened 10% of the time.

So what is the analogy with biology ? It's simple. Full disclosure applies a constant selective pressure on software, just like environmental conditions apply a evolutionary pressure on organisms.

Life is a constant battle to survive. You have to adapt fast. If you don't or can't, you become trapped in an evolutionary dead-end, and your species becomes extinct. It's the same for software. Bug hunters are always out there looking for problems in your software, scrutinizing every single piece of published program. They apply a strong and pitiless pressure and, unless your software is perfect, they will force it to evolve. The worse softwares from lousy editors are not going to resist the pressure and will eventually disappear. The good ones are going to survive and get fitter and better, much more secure. And that is good for people who use them, us, the final consumers. Don't forget that there are plenty of predators out there in this internet era, and computers are becoming central in our lives, so it's more and more important to be well protected. Full disclosure is positive for you.

It may sometimes look harsh for editors, but that's good for increasing the global security level. I am not the only one to think that. Look what happened in the cryptography world. When the DES algorithm became really too weak, and the Government of the USA decided it was time to choose a new standard for encryption, that's exactly the strategy they used to get the best one. An accelerated evolutionary slaughter. Any researcher could submit a new algorithm. Then, everybody was asked to find flaws in the other teams algorithms. After a few rounds of selection, only the strongest cyphers stayed in the race, and finally they chose just one which became the new standard called AES.

Open source and science.

Free (as in free speech) and open source softwares like GNU/Linux and many other excellent ones, often better then their commercial counterparts, have an incredible and deserved success these days. And I enjoy that. People discovered that free sharing of information and source code is very positive, because new programmers can build on top of what others already produced, they can collaborate freely, get insights and fresh ideas from many others across the world. The whole information is available for anybody who want to collaborate on an existing project or start his own. I personally really love this philosophy. Share with people. Put knowledge in common.

Maybe people didn't notice, but that's how academic science works since a few centuries. And that's probably why I work in this domain. You get some interesting results, based on previous research by others, and you publish them in a journal, including the whole detailed protocol on how you obtained them. That could be like the source code. You give to others everything they need to reproduce your experiments, and confirm or infirm them. Later on, other researchers, in different labs and different countries, are going to use your own work to add their little piece to the global advancement of science.

All this scientific knowledge is freely available to everybody. That's how it works, and it's efficient. Scientific researchers invented the open source movement centuries before computer was invented.

[I'm bored with this text. I don't consider it as finished, but I publish it anyway, or I may trash it. I will modify it later.]