Limit of Software Reuse

Roman Suzi
Star Gazers
Published in
7 min readApr 13, 2021

--

Research shows the limit to code re-use. How much can be reused depends on the problem domain.

Fallen tree reminds a log-log plot for rank/frequency. (by R.Suzi)

Software development values reusable code in many different forms and sizes. While the hype of achieving high reusability is over, the goal of producing reusable software has not disappeared. Most common ways to code reuse are software libraries and frameworks. It is (hopefully) a normal practice in the development of software products and projects to set aside frequently used functions so development time (including testing and future maintenance) is saved. Good practices like clean coding also advance code reuse by suggesting to decompose the code into smaller functions. More exotic example of facilitating code reuse is Haskell API search engine ( https://hoogle.haskell.org/ ), where Haskell programmers can find functions by their types.

Let’s talk only about reusable functions here even though many other programming artifacts and assets can be reused.

It is also worth mentioning that millions of people worked on a huge number of functions over past years even in the Open Source domain alone. It may feel now everything possible has been written by now, and “google” can find it. This may be quite near the reality, but your mileage with quality and availability for your platform may vary. And the rest can be copy-pasted from StackOverflow and fixed with a hammer, right?

Let’s imagine we are not reusing the code by copy-pasting though, but maintaining it in the libraries and software packages. Then there is a very nice mathematical notion, applicable to what we actually do: Namely, we are compressing our programs by minimizing their Kolmogorov complexity. This is because instead of having some function definition inlined, only it’s name is used. The more functions we make into a common library, the greater this compression effect.

This is, of course, to some degree simplified: For instance, we are hardly interested in making source code shorter as many developers try to write code for humans, not only machines. Also, compilers can inline the function for optimization reasons. However, we can still assume that the resulting machine (or virtual machine) code is what gets shorter when the code is reused.

Bad news

I was looking about some facts on Generic Programming, when I encountered this article: Software Libraries and Their Reuse: Entropy, Kolmogorov Complexity, and Zipf’s Law by Todd L. Veldhuizen. The abstract lets the truth out:

“A common theme in the software reuse literature is that if we can only get the right environment in place — the right tools, the right generalizations, economic incentives, a “culture of reuse” — then reuse of software will soar, with consequent improvements in productivity and software quality. The analysis developed in this paper paints a different picture: the extent to which software reuse can occur is an intrinsic property of a problem domain, and better tools and culture can have only marginal impact on reuse rates if the domain is inherently resistant to reuse.”

Explanations

It’s impossible to explain three concepts presented in the cited article, so explanation here is not rigorous.

First of all, let’s compare our programs with literature. There is some concept or function behind the words we use for communication. And the language continuously changes. With fixed-wing aircraft becoming a normal sight in everyday life, we talk about them as airplanes or even just planes. Electronic mail becomes email. We are compressing language so frequently used words become shorter. And if that is not enough, there are all kinds of abbreviations. To put it simply, we do the same with the source code: Frequently used functions are given names and referenced instead of being inlined.

Kolmogorov complexity of some object can be quite informally defined as the shortest program (in some programming language), which can output the object. It can be proven that the choice of the language is not relevant: Kolmogorov complexity can be defined up to a constant, making Turing machine and Java compiler equally suitable for the measurement. The article also proposes that random compiled C programs are no more compressible in Kolmogorov complexity sense. (It’s similar to how random data can’t be compressed by an archiver.)

There is a statistical law, which arises for all natural (and not so natural) languages — Zipf’s law. It originally stated, that “given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table” (Wikipedia). Here rank in the frequency table can be roughly understood as a place some word has in a list of words, sorted by frequency of appearance in the given corpus. The law emerges in many other areas, not just linguistics, but what is remarkable, and the article by Veldhuizen finds out, is that function reuse in software also follows Zipf’s law. Citing Wikipedia again: “Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc”. And the same happens to functions. Here, of course, we are talking in a statistical sense. If we plot rank (x) and frequency (y) on a log-log plot, we will get an almost straight line (like that shown by a fallen tree on the top of this article).

Similarly, it’s beyond scope here to explain what entropy (also called Shannon entropy) is in information theory. Entropy (of some random variable) is the average amount of information, needed to communicate that variable’s outcomes. For example, to communicate a result of the ideal coin toss, we need 1 bit of information, so the entropy of coin toss is 1 bit. Entropy is thus related to the level of surprise some random source gives.

The discussed article says, that “due to the tendency of libraries to evolve so that programmers can write as little code as possible, which in turn implies evolution toward maximum entropy in compiled code”, and that in turn causes the observation of Zipf’s law found in the empirical study. Note the analogy with the natural languages here.

There are quite a few definitions, claims and theorems in the article, but the bridge built there between entropy in the resulting code, Zipf’s law, and Kolmogorov complexity of the programs, boils down to just one parameter called H, which can take values from 0 to 1 inclusively. The parameter H formalizes the informal notion of how far from uniform distribution the distributions of programs in certain domains are. H=1 means that distribution is uniform, so no compression is possible. Situation is more interesting when H is less than 1.

And here comes one more intriguing result: “In a problem domain with entropy parameter H, the expected proportion of code that may be reused from a library is at most 1-H”.

This can be cited when some managers insist on reusing 90% of the code, just because the goal can be unattainable in principle.

Pure speculations

There is of course little comfort to know that there are theoretical limits to perfecting our beloved software. Maximizing entropy is the goal for any refactoring. Interested readers with some background in statistics and math can take a look at Todd Veldhuizen’s article: There are a lot of beautiful metaphors and intuitions on programmer’s behaviours backed by rigorous math reasoning.

The theoretical restriction of H does not sound that problematic because we really do not know the value of H for a domain we work in. If I am to guess, the problem may be more evident in embedded programming, where devices are quite limited and the length of the program may matter. The same can probably be true for APU/GPU programming. In a sense, it’s not that big trouble for the rest of us.

What follows is purely my own speculation. And it is closely related to the so-called zero-code or low-code programming. The same limit applies there as well! Just imagine a non-programmer, advanced user is to program for a specific domain in such low-code system, backed up by some gigabytes of software, hidden from sight (and those gigabytes are actually irrelevant in measuring entropy, because they are no surprise — they are already in the system). We can easily see that the amount of information in the user input is quite low. I believe that most of the entropy will be the natural language, embedded in such programs. I can also claim that normal, textual source code written by professional programmers is much more entropy-rich.

As a corollary, (graphical) user interface is actually a very entropy-poor way to get information into the system unless you are drawing some image using touch-sensitive stylus. (Here it’s interesting to observe, that for example OK and Cancel buttons are not like tossing a coin: Cancel button gives more (surprising) information to the system, than OK button.)

I am coming to automatic programming, which utilizes artificial intelligence. It is obvious that the theoretical limit covers this case as well. Artificial intelligence is not capable of compressing random data any better than the best compressing software: It’s theoretically impossible. In a sense, AI-assisted coding (like Kite) can help one create code. That is because AI can capture a common pattern, which can be expressed in a function of its own. This, of course, is helpful, but still inside the same limits. If said AI learned only from the same codebase, the codebase may have a good potential to be compressed.

It’s more interesting to see how more advanced use of AI — autoformalization — fits the picture, especially, if it leverages tons of human knowledge. Some really interesting code can appear. It most probably will not just work out of the box. For the same reason the amount of information content needed to be transferred from the “product owner” to the “AI-programmer”. But at least there is a hope that AI can elaborate corner cases better.

There is no conclusion to the topic of software reuse, but now we know that there are hard limits that way.

--

--