If you are a software developer or even remotely related to software, you would have heard about some thing called a database.

In the most simplest term, a database is software program to manage persistent data, similar to your spreadsheet or a text file. So what makes databases so special.

Lets begin conceptually creating our own database, we would take a CSV (Comma Seperated Value) file as a starting point and then try to make a database out of it. In the beginning our database would look something like this.

1,JOHN DOE,45,MALE,100000
2,JANE DOE,34,FEMALE,100000
3,TIM DOE,32,MALE,80000

we can keep adding our employees to the file, this seems to be a fine way of storing data, but has a serious flaw, how do we find the data?

You may be tempted to say that if you know the line number aka ID, you should be able to jump to that line, however the actual data in file is not stored as above, it would be more like this.


This does not look pretty right and now since whole thing is in same line, we do not know where is our data and what the offset for say record number 4.

Lets try to fix this issue at first, ID is a number, if we can fix that it would be a 32 bit integer only, and similarly fix the length for each of the column, but we have an issue.. We have a variable length string, Name, which could be of any length. How about if we could fix the length of string to max length and put empty spaces to fill it up to max length, sure it would be a waste of space but it would make our calculations easier. So lets fix length for each column.

ID 32 bit integer
NAME 255 character ( 8*255 bits)  
AGE 32 bit integer  
GENDER 6 character ( 8*6 bits)
SALARY 32 bit integer

So now each of our record is 2184 bits long, now if we know the line number, we can jump to the record. In this case we are assuming that the user would know the line number, but what if he does not know it?

We could assume that the ID is line number and is sequential but it is not necessary, and if there are gaps in data it would lead to us picking the wrong record. We need some one to know which ID maps to which line number, we need an Index, we can store it in a seperate file like this.


But if we do, then we would need an Index for our Index, we need a better solution. How about if we keep it in memory, it would use up a lot of memory but it would make our life easier. Lets pick up the fastest data structure for searching in memory, ie. Sorted Array. If we have a sorted array, we can run a binary search which is fastest algorithm for searching. So if we store our Index in array like below we should be fine.


What would happen when the array gets too big to hold in memory? We would need to store the array on to disk some how and only keep the part relevant in memory, but how do we split the array. Since this is a sorted array if we split it at mid point, we should still have a sorted arrays. We also need some way of knowing which of the two child arrays contains the data we are looking for. Lets split the array at mid point and hold the mid point in memory, so that if the number we are looking at is bigger than mid point, we pick the right array else we pick the left array.

[[1,1],[2,2]]              [[5,4]]

If we keep splitting the array like this, and lets say each array is 1000 keys, at third level we would have 100010001000 + 1000*1000 + 1000 keys, which is 1 Billion records, enough to count the population of India. This structure is known as B Tree , if you thought B-Tree stands for Binary Tree, no it does not.

Most of the database indexes are modelled on B-Tree or a variant, we know some of the databases add some of the data instead of location in the tree, this is known as B Tree (B Plus Tree).

If the indexed column is unique, the database would almost always use a BTree, which is why database architects or database designers always ask you to use unique indexes or have unique combination indexes as it is order of magnitude faster than other Indexes.

If you have nullable columns or string columns, the database would most likely use HashTable for index which is not as efficient, so as far as possible keep your searchable columns as non null unique numbers.

Now that we have our indexes sorted out, lets try to optimise our search a bit, the slowest part of reading information is reading from disk. It does not matter how much data you read from disk what matters is how many times you read the data. The I/O speed is determined by disk rotation speed in normal disks, higher the rotation speed, faster is the disk (This is not relevant for Solid State Drive which is much faster than older disk). So if we are going to read something why read only one row, lets read much more, say 10 records or 100 records and keep it in cache. This number of bytes which a database reads from disk in one I/O is called Page Size.

Now that we have our cache filled up, if some one requests us for the data which is in our cache we should immediately return it without hitting the disk, however this data would be sequential, so the chances of cache hit is significantly more if we store the data in the sequence we are going to fetch. If we define a unique key which should also serve as the sequence we store the data, our chances of cache hit is significantly increased, this Index is called Clustered Index

In most of the database, Primary Key is used as Clustered Index which is why your database architect is going to ask you to use combination of keys which may be unique to be used as primary key, which may sound very strange to you but they have a perfectly good reason to say that.

We would continue to build this Database through our post series and in next article we would look at Concurrency and Transaction.



There are currently no comments.