Huffman coding – part 2

The huffman coding is mainly used to compress files that are not already compressed already ( the reason why I say this for is because if you are trying to compress a already compressed file then the assignment 5 will add on more header details onto the file for decompressing the compressed file.

This is a follow on from the part 1 of the assignment 5 this is the decompression of the compressed file.

So to start with, I need to read in the header file of the compressed file. For example if I was going to compress the file

hello there

then the compressed file will be

104=2|101=3|108=2|111=1|32=1|116=1|114=1|-1=1|;u

Huffman coding – part 1

The huffman coding is mainly used to compress files that are not already compressed already ( the reason why I say this for is because if you are trying to compress a already compressed file then the assignment 5 will add on more header details onto the file for decompressing the compressed file.

The basics of this assignment, I have included the assignment PDF to the attached file, is to read in the file and place the characters that are read in (1 character at a time) into a mapped of characters that have there associated value of how many times that there are present in the file, so that the most present characters in the file will end up with a very short bit patten in the compressed file (this is how the file will be smaller, e.g. less space taken up on the disk per character within the file) lets say that there is a sentence of “hi there”, then e will be the most used value e.g. 2, so a character is allot bitter in a bit pattern (to store all of the available characters e.g. a-z, smiles etc.) than a bit patten of 2 in length (01) if that was the case.

So to get started this post is going to be doing the compression of the file, and the second post will be the decompression of the file.

Here is how I am reading in that mapped of characters to there occurrences in the file, theMap is a vector of queueI structure, so that we know that will always have the same values inserted into a vector array, thus when it comes to decompressing the file we want to build up the huffman encoding with the same input values.

   struct queueI {
	   int letter;
	   int value;
   };
 
void Encoding::readFile(ibstream &inStream)
{
	theMap.clear();
	char *readIn = new char(1);
	int i;
	queueI insert;
	while (!inStream.eof())
	{
		inStream.read(readIn,1);
		if (!inStream.eof())
		{
			for (i = 0; i < theMap.size(); i++)
			{
				if (theMap[i].letter == readIn[0]) 
					break;
			}
			if (i < theMap.size())
				theMap[i].value++;
			else
			{
				insert.letter = readIn[0];
				insert.value = 1;
				theMap.add(insert);
			}
		}
	}
	// add the EOF
	insert.letter = PSEUDOEOF;
	insert.value = 1;
	theMap.add(insert);
	delete readIn;
}

The last value is the end of file, a pseudo value that will denote that we have reached the end of file value, I then will need to build up a queue(as taken from the last assignment 4 of this course) of the values taken in from the build of theMap above, so to start with I just load all of the values into queue with there order value (value equals the number of occurrences that they are present in the file) and this is how I am loading the values into the queue.

void Encoding::loadPQueue()
{
	theQueue.clear();
	queueT *qt;
	for (int i =0 ; i < theMap.size(); i++)
	{
		qt = new queueT();
		qt->letter = theMap[i].letter; 
		qt->value = theMap[i].value;
		qt->left = NULL;
		qt->right = NULL;
		qt->filled = true;
		theQueue.add(qt, theMap[i].value);
	}	
}

and here is how I build up the huffman encoding list, what it does is to take the lowest two values (the lowest two occurrences, or later on the lowest two joined values) and put them together with a new node that points to them both and takes the value of both of the values added together, so lets say that the character a has a value of 2 and the letter c has the value of 3, then the new node will have a value of 5 and points to both the characters a,c on the left/rights nodes accordlying.

void Encoding::buildTree()
{
	queueT *newQ, *first, *second;
	int valueIn;
	while (!theQueue.isEmpty())
	{
		newQ = new queueT();
		first = theQueue.extractMin();
		valueIn =first->value;
		newQ->left = first;
		if (!theQueue.isEmpty())
		{
			second = theQueue.extractMin();
			valueIn += second->value;
			newQ->right = second;
		}
		newQ->value = valueIn;
		newQ->filled = false;
		if (!theQueue.isEmpty())
			theQueue.add(newQ, valueIn);
	}
	theQueue.add(newQ, valueIn);
}

right that the main part done!, since when we decompress the file we want to have the same order of the loading into the queue we need to write out the header details to the new decompressed file,

void Encoding::outHeaderToFile(obstream &outStream)
{
	string outStr;
	for (int i =0; i < theMap.size(); i++)
	{
		outStr = IntegerToString(theMap[i].letter);
		outStr +="=";
		outStr += IntegerToString(theMap[i].value);
		outStr +="|";
		outStream.write(outStr.c_str(),outStr.length());	
	}
	// end of header details
	outStr = ";";
	outStream.write(outStr.c_str(), 1);
}

the ; is the end of the header details, thus after that we will write out the bit patterns of the compressed characters :). the output of the header details is the character code=number of occurrences so for character a with 10 occurrences then it would output 97=10|

So since we now need to write out the bit patterns of the characters to the file, we need to find out how we placed those characters into the huffman tree, so I wrote a recursive function that will output the bit pattern of where the character is in the huffman tree

string Encoding::printQueue(queueT *node, int letter,string val)
{
	string returnVal;
	if (node != NULL)
	{
		if (node->left != NULL)
		{
			returnVal = printQueue(node->left, letter, val+"0");
			if (returnVal != "")
				return returnVal;
		}
		if (node->filled)
			if (node->letter == letter)
				return val;
		if (node->right != NULL)
		{
			returnVal = printQueue(node->right, letter, val+"1");
			if (returnVal != "")
				return returnVal;
		}
	}
	return "";
}

If we are heading down the left node, then add a 0 to the bit pattern, if we have found the character and it is classed as a filled node (e.g. only nodes that have a character attached to them are filled) then return how we found that character, or try the right node with adding a 1 to the bit pattern.

Then to just write out the uncompressed file to the compressed file we just go through the file again and then write out all of the bit patterns with ending with the pseudo EOF character.

void Encoding::outToFile(ibstream &inStream, obstream &outStream)
{
	queueT *head;
	string foundStr;
	char *readIn = new char(2);
	while (!inStream.eof())
	{
		inStream.read(readIn,1);
		if (!inStream.eof())
		{
			head = theQueue.peek();
			foundStr = printQueue(head, readIn[0], "");
			for (int i =0; i < foundStr.length(); i++)
			{
				int ch = ((int)foundStr[i])-48;
				outStream.writebit(ch);
			}
		}
	}
	// add in the pseudo-EOF
	head = theQueue.peek();
	foundStr = printQueue(head,PSEUDOEOF, "");
	for (int i =0; i < foundStr.length(); i++)
	{
		int ch = ((int)foundStr[i])-48;
		outStream.writebit(ch);
	}
}

that is about it, I shall do a post on how I do the uncompress-ion next.