Sunday, April 6, 2008

Why The Array Index Should Start From 0

I recently came across an article from Dijkstra(photo) explaining a fact that looked obvious to me before: Why should start indexing an array from 0. I was amused to find out that there is some clever thinking behind this.

All popular languages, like C/C++, Java or Perl start indexing an array from 0 while the last index is the array length minus 1. While this is usual to most developers, it is not a de facto situation for all programming languages. For example in Fortran, when you declare an array with 'integer a(10)', aka an int array having 10 elements, the index starts from 1 and ends at 10 (however you can override this behavior using a statement like, integer a(0:9), declaring an array with indices from 0 to 9, but anyway).

The discussion over why the array index should start at zero is not a trivial one and relates to interesting concepts from computer science. First of all, it has strong relation to language design. For example in C, the name of an array is essentially a pointer, a reference to a memory location, and so the expression array[n] refers to a memory location n-elements away from the starting element. This means that the index is used as an offset. The first element of the array is exactly contained in the memory location that array refers (0 elements away), so it should be denoted as array[0]. Most programming languages have been designed this way, so indexing from 0 is pretty much inherent to the language.

However, Dijkstra explains why we should index from 0. This is a problem on how to denote a subsequence of natural numbers, say for example 1,2,3,...,10. We have four solutions available:
a. 0<i<11
b. 1
<=i<11
c. 0
<i<=10
d. 1
<=i<=10

Dijkstra argues that the proper notation should be able to denote naturally the two following cases:
1. The subsequence includes the smallest natural number, 0
2. The subsequence is empty


Requirement 1. leaves out a. and c. since they would have the form -1<i which uses a number not lying in the natural number set (Dijkstra says this is ugly). So we are left with b. and d. Now requirement 2. leaves out d. since for a set including 0 that is shrunk to the empty one, d. takes the form 0<=i<=-1, which is a little...well, messed up! Subtracting the ranges in b. we also get the sequence length, which is another plus. Hence we are left with b. which is by far the most widely used notation in programming now.

Now you know. So, remember and take pride in the fact that each time you write something like

for(i=0;i
<N;i++)
{
sum+= a[i];
}
you are not just following the rules of language notation. You are also promoting mathematical beauty!