Huffman coding – part 1

The huffman coding is mainly used to compress files that are not already compressed already ( the reason why I say this for is because if you are trying to compress a already compressed file then the assignment 5 will add on more header details onto the file for decompressing the compressed file.

The basics of this assignment, I have included the assignment PDF to the attached file, is to read in the file and place the characters that are read in (1 character at a time) into a mapped of characters that have there associated value of how many times that there are present in the file, so that the most present characters in the file will end up with a very short bit patten in the compressed file (this is how the file will be smaller, e.g. less space taken up on the disk per character within the file) lets say that there is a sentence of “hi there”, then e will be the most used value e.g. 2, so a character is allot bitter in a bit pattern (to store all of the available characters e.g. a-z, smiles etc.) than a bit patten of 2 in length (01) if that was the case.

So to get started this post is going to be doing the compression of the file, and the second post will be the decompression of the file.

Here is how I am reading in that mapped of characters to there occurrences in the file, theMap is a vector of queueI structure, so that we know that will always have the same values inserted into a vector array, thus when it comes to decompressing the file we want to build up the huffman encoding with the same input values.

   struct queueI {
	   int letter;
	   int value;
   };
 
void Encoding::readFile(ibstream &inStream)
{
	theMap.clear();
	char *readIn = new char(1);
	int i;
	queueI insert;
	while (!inStream.eof())
	{
		inStream.read(readIn,1);
		if (!inStream.eof())
		{
			for (i = 0; i < theMap.size(); i++)
			{
				if (theMap[i].letter == readIn[0]) 
					break;
			}
			if (i < theMap.size())
				theMap[i].value++;
			else
			{
				insert.letter = readIn[0];
				insert.value = 1;
				theMap.add(insert);
			}
		}
	}
	// add the EOF
	insert.letter = PSEUDOEOF;
	insert.value = 1;
	theMap.add(insert);
	delete readIn;
}

The last value is the end of file, a pseudo value that will denote that we have reached the end of file value, I then will need to build up a queue(as taken from the last assignment 4 of this course) of the values taken in from the build of theMap above, so to start with I just load all of the values into queue with there order value (value equals the number of occurrences that they are present in the file) and this is how I am loading the values into the queue.

void Encoding::loadPQueue()
{
	theQueue.clear();
	queueT *qt;
	for (int i =0 ; i < theMap.size(); i++)
	{
		qt = new queueT();
		qt->letter = theMap[i].letter; 
		qt->value = theMap[i].value;
		qt->left = NULL;
		qt->right = NULL;
		qt->filled = true;
		theQueue.add(qt, theMap[i].value);
	}	
}

and here is how I build up the huffman encoding list, what it does is to take the lowest two values (the lowest two occurrences, or later on the lowest two joined values) and put them together with a new node that points to them both and takes the value of both of the values added together, so lets say that the character a has a value of 2 and the letter c has the value of 3, then the new node will have a value of 5 and points to both the characters a,c on the left/rights nodes accordlying.

void Encoding::buildTree()
{
	queueT *newQ, *first, *second;
	int valueIn;
	while (!theQueue.isEmpty())
	{
		newQ = new queueT();
		first = theQueue.extractMin();
		valueIn =first->value;
		newQ->left = first;
		if (!theQueue.isEmpty())
		{
			second = theQueue.extractMin();
			valueIn += second->value;
			newQ->right = second;
		}
		newQ->value = valueIn;
		newQ->filled = false;
		if (!theQueue.isEmpty())
			theQueue.add(newQ, valueIn);
	}
	theQueue.add(newQ, valueIn);
}

right that the main part done!, since when we decompress the file we want to have the same order of the loading into the queue we need to write out the header details to the new decompressed file,

void Encoding::outHeaderToFile(obstream &outStream)
{
	string outStr;
	for (int i =0; i < theMap.size(); i++)
	{
		outStr = IntegerToString(theMap[i].letter);
		outStr +="=";
		outStr += IntegerToString(theMap[i].value);
		outStr +="|";
		outStream.write(outStr.c_str(),outStr.length());	
	}
	// end of header details
	outStr = ";";
	outStream.write(outStr.c_str(), 1);
}

the ; is the end of the header details, thus after that we will write out the bit patterns of the compressed characters :). the output of the header details is the character code=number of occurrences so for character a with 10 occurrences then it would output 97=10|

So since we now need to write out the bit patterns of the characters to the file, we need to find out how we placed those characters into the huffman tree, so I wrote a recursive function that will output the bit pattern of where the character is in the huffman tree

string Encoding::printQueue(queueT *node, int letter,string val)
{
	string returnVal;
	if (node != NULL)
	{
		if (node->left != NULL)
		{
			returnVal = printQueue(node->left, letter, val+"0");
			if (returnVal != "")
				return returnVal;
		}
		if (node->filled)
			if (node->letter == letter)
				return val;
		if (node->right != NULL)
		{
			returnVal = printQueue(node->right, letter, val+"1");
			if (returnVal != "")
				return returnVal;
		}
	}
	return "";
}

If we are heading down the left node, then add a 0 to the bit pattern, if we have found the character and it is classed as a filled node (e.g. only nodes that have a character attached to them are filled) then return how we found that character, or try the right node with adding a 1 to the bit pattern.

Then to just write out the uncompressed file to the compressed file we just go through the file again and then write out all of the bit patterns with ending with the pseudo EOF character.

void Encoding::outToFile(ibstream &inStream, obstream &outStream)
{
	queueT *head;
	string foundStr;
	char *readIn = new char(2);
	while (!inStream.eof())
	{
		inStream.read(readIn,1);
		if (!inStream.eof())
		{
			head = theQueue.peek();
			foundStr = printQueue(head, readIn[0], "");
			for (int i =0; i < foundStr.length(); i++)
			{
				int ch = ((int)foundStr[i])-48;
				outStream.writebit(ch);
			}
		}
	}
	// add in the pseudo-EOF
	head = theQueue.peek();
	foundStr = printQueue(head,PSEUDOEOF, "");
	for (int i =0; i < foundStr.length(); i++)
	{
		int ch = ((int)foundStr[i])-48;
		outStream.writebit(ch);
	}
}

that is about it, I shall do a post on how I do the uncompress-ion next.

3 thoughts on “Huffman coding – part 1”

  1. Hello,
    I’m working on a huffman compression program for my grad school application and am a little confused by what people mean when they refer to the Pseudo EOF character. I understand that it is used to denote the end of the compressed file, but how would one go about choosing it? I am currently assigning the NULL character a frequency of 1 and using the huffman code that I give it as my pseudo eof. Am I on the right track here?

    Thank you.

  2. Hi Luke, I would say that you are in the right direction, but since the huffman encoding could give you a value of 0, then may be better to set the pseudo EOF character to something like -2, -1 sort of thing, since there is no character that has that value.

    The pseudo EOF basically means the end of file character since you are reading in bits of the file and not in chunks as such, the actual EOF may have extra information in there that you really do not want, so you want to stop the decoding when you need to and not when you are forced too by the operating systems file structure.

    HTH
    genux

  3. hello, how would the main for this would look like and what standard libraries are included for this

Leave a Reply

Your email address will not be published. Required fields are marked *