Data Compression Project Mini Project Report
Submitted to DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
By Samir Sheriff Satvik N
In partial fulfilment of the requirements for the award of the degree degree
BACHELOR OF ENGINEERING IN COMPUTER SCIENCE AND ENGINEERING
R V College of Engineering (Autonomous Institute, Affiliated to VTU) BANGALORE - 560059
May 2012
DECLARATION We, Samir Sheriff and Satvik N bearing USN number 1RV09CS093 and 1RV09CS095 Compression Project” respectively, respectively, hereby declare that the dissertation entitled “ Data Compression
completed and written by us, has not been previously formed the basis for the award of any degree or diploma or certificate of any other University.
Bangalore
Samir Sheriff USN:1RV09CS093 Satvik N USN:1RV09CS095
R V COLLEGE OF ENGINEERING (Autonomous Institute Affiliated to VTU) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that the dissertation entitled, “ Data Compression Project”, which is being submitted herewith for the award of B.E is the result of the work completed by Samir Sheriff and Satvik N under my supervision and guidance.
Signature of Guide (Name of the Guide)
Signature of Head of Department (Dr. N K Srinath) Name of Examiner 1:
2:
Signature of Principal (Dr. B.S Sathyanarayana) Signature of Examiner
ACKNOWLEDGEMENT The euphoria and satisfaction of the completion of the project will be incomplete without thanking the persons responsible for this venture. We acknowledge RVCE (Autonomous under VTU) for providing an opportunity to create a mini-project in the 5th semester. We express our gratitude towards Prof. B.S. Satyanarayana , principal, R.V.C.E for constant encouragement and facilitates extended
in completion of this project. We would like to thank Prof. N.K.Srinath , HOD, CSE Dept. for providing excellent lab facilites for the completion of the project. We would personally like to thank our project guides Chaitra B.H. and Suma B. and also the lab in charge, for providing timely assistance & guidance at the time. We are indebted to the co-operation given by the lab administrators and lab assistants, who have played a major role in bringing out the mini-project in the present form. Bangalore Samir Sheriff 6th semester, semester, CSE USN:1RV09CS093 Satvik N 6th semester, semester, CSE USN:1RV09CS095
i
ABSTRACT Techniques is aimed at developing programs that The Project “ Data Compression Techniques
transform a string of characters in some representation (such as ASCII) into a new string (of bits, for example) which contains the same information but whose length is as small as possible. Compression is useful because it helps reduce the consumption of resources such as data space or transmission capacity. The design of data compression schemes involve trade-off s among various factors, including the degree of compression, the amount of distortion introduced (e.g., when using lossy data compression), and the computational resources required to compress and uncompress the data. Many data processing applications require storage of large volumes of data, and the number of such applications is constantly increasing as the use of computers extends to new disciplines. Compressing data to be stored or transmitted reduces storage and/or communication costs. When the amount of data to be transmitted is reduced, the effect is that of increasing the capacity of the communication channel. Similarly, compressing a file to half of its original size is equivalent to doubling the capacity of the storage medium. It may then become feasible to store the data at a higher, thus faster, level of the storage hierarchy and reduce the load on the input/output channels of the computer system.
ii
Contents ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
1 INTRO INTRODUC DUCTIO TION N
1
1.1 1.1 SCOPE COPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2 REQUIREMENT REQUIREMENT SPECIFICA SPECIFICATION TION
3
3 Comp Compres ressi sion on
4
3.1 A Naive Naive App Approa roach ch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.2 3.2 Th Thee Basi Basicc Idea Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3.3 Buildi Building ng the the Huffm Huffman an Tree Tree . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.4 3.4 An Exam Exampl plee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.4.1 3.4.1
An Exampl Example: e: ”go go gophers gophers”” . . . . . . . . . . . . . . . . . . . .
6
3.4.2 3.4.2
Exampl Examplee Encodin Encodingg Table able . . . . . . . . . . . . . . . . . . . . . . .
8
3.4. 3.4.33
Encod En coded ed Stri String ng . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
4 Decomp Decompres ressio sion n
9
4.1 Storin Storingg the the Huffma Huffman n Tree Tree . . . . . . . . . . . . . . . . . . . . . . . . . .
9
4.2 Creat Creating ing the the Huffm Huffman an T Table able . . . . . . . . . . . . . . . . . . . . . . . . .
10
4.3 4.3 Stor Storin ingg Size Sizess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
5 CONCLUSI CONCLUSION ON AND AND FUTURE FUTURE WORK WORKS S
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
12
14
APPENDICES15 APPENDICES15
iv
Chapter 1 INTRODUCTION The Project “Data Compression Techniques is aimed at developing programs that transform a string of characters in some representation (such as ASCII) into a new string (of bits, for example) which contains the same information but whose length is as small as possible. Compression is useful because it helps reduce the consumption of resources such as data space or transmission capacity. The design of data compression schemes involve trade-offs among various factors, including the degree of compression, the amount of distortion introduced (e.g., when using lossy data compression), and the computational resources required to compress and uncompress the data.
1.1
SCOPE
The data compression techniques find applications in almost all fields. To list a few, compression reduces the transmission bandwidth and storage re• Audio data compression
quirements of audio data. Audio compression algorithms are implemented in software as audio codecs. Lossy audio compression algorithms provide higher compression at the cost of fidelity, are used in numerous audio applications. These algorithms almost all rely on psychoacoustics to eliminate less audible or meaningful sounds, thereby reducing the space required to store or transmit them. Video
1
Software Requirements Sp ecification
Data Compression Techniques
• Video compression uses modern coding techniques to reduce redundancy in video data. Most video compression algorithms and codecs combine spatial image compression and temporal motion compensation. Video compression is a practical implementation of source coding in information theory. In practice most video codecs also use audio compression techniques in parallel to compress the separate, but combined data streams.
• Grammar-Based Codes They can extremely compress highly-repetitive text, for instance, biological data collection of same or related species, huge versioned document collection, internet archives, etc. The basic task of grammar-based codes is constructing a contextfree grammar deriving a single string. Sequitur and Re-Pair are practical grammar compression algorithms which public codes are available.
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
2
Chapter 2 REQUIREMENT SPECIFICATION Software Requirement Specification (SRS) is an important part of software development process. We describe about the overall description of the Data Compression Project, the specific requirements of the Data Compression Project, the software requirements and hardware requirements and the functionality of the system. Software Requirements
• Front End: Qt GUI Application. • Back End: C++ • Operating System: Linux. Hardware Requirements
• Processor: Intel Pentium 4 or higher version • RAM: 512MB or more • Hard disk: 5 GB or less
3
Chapter 3 Compression We’ll look at how the string ”go go gophers” is encoded in ASCII, how we might save bits using a simpler coding scheme, and how Huffman coding is used to compress the data resulting in still more savings.
3.1 3.1
A Naiv Naive e Appr Approa oac ch
With an ASCII encoding (8 bits per character) the 13 character string ”go go gophers” requires 104 bits. The table below on the left shows how the coding works. 4
Compression
Data Compression Techniques
The string ”go go gophers” would be written (coded numerically) as 103 111 32 103 111 32 103 111 112 104 101 114 115. Although not easily readable by humans, this would be written as the following stream of bits (the spaces would not be written, just the 0’s and 1’s) 1100111 1101111 1100000 1100111 1101111 1000000 1100111 1101111 1110000 1101000 1100101 1110010 1110011 Since there are only eight different characters in ”go go gophers”, it’s possible to use only 3 bits to encode the different characters. We might, for example, use the encoding in the table on the right above, though other 3-bit encodings are possible. Now the string ”go go gophers” would be encoded as 0 1 7 0 1 7 0 1 2 3 4 5 6 or, as bits: 000 001 111 000 001 111 000 001 010 011 100 101 110 111 By using three bits per character, the string ”go go gophers” uses a total of 39 bits instead of 104 bits. More bits can be saved if we use fewer than three bits to encode characters like g, o, and space that occur frequently and more than three bits to encode characters like e, p, h, r, and s that occur less frequently in ”go go gophers”.
3.2
The Basic Idea
This is the basic idea behind Huffman coding: to use fewer bits for more frequently occurring characters. We’ll see how this is done using a tree that stores characters at the leaves, and whose root-to-leaf paths provide the bit sequence used to encode the characters. We’ll use Huffman’s algorithm to construct a tree that is used for data compression. We’ll assume that each character has an associated weight equal to the number of times the character occurs in a file, for example. In the ”go go gophers” example, the characters ’g’ and ’o’ have weight 3, the space has weight 2, and the other characters have weight 1. When compressing a file we’ll need to calculate these weights, we’ll ignore this step for now and assume that all character weights have been calculated.
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
5
Compression
3.3 3.3
Data Compression Techniques
Buil Buildi ding ng the the Huff Huffma man n Tre Tree e
Huffman’s algorithm assumes that we’re building a single tree from a group (or forest) of trees. Initially, all the trees have a single node with a character and the character’s weight. Trees are combined by picking two trees, and making a new tree from the two trees. This decreases the number of trees by one at each step since two trees are combined into one tree. The algorithm is as follows:
• Begin with a forest of trees. All trees are one node, with the weight of the tree equal to the weight of the character in the node. Characters that occur most frequently have the highest weights. Characters that occur least frequently have the smallest weights.
• Repeat this step until there is only one tree: • Choose two trees with the smallest weights, call these trees T1 and T2. Create a new tree whose root has a weight equal to the sum of the weights T1 + T2 and whose left subtree is T1 and whose right subtree is T2.
• The single tree left after the previous step is an optimal encoding tree..
3.4 3.4. 3.4.1 1
An Example An Exa Examp mple le:: ”go ”go go goph gopher ers” s”
We’ll use the string ”go go gophers” as an example. Initially we have the forest shown below. The nodes are shown with a weight/count that represents the number of times the node’s no de’s character character occurs.
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
6
Compression
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Data Compression Techniques
Feb 2012 - May 2013
7
Decompression
3.4.2 3.4.2
Data Compression Techniques
Exam Exampl ple e Enc Encodin oding g Tab Table le
The character encoding induced by the last tree is shown below where again, 0 is used for left edges and 1 for right edges.
3.4. 3.4.3 3
Encod Encoded ed Stri String ng
The string ”go go gophers” would be encoded as shown (with spaces used for easier reading, the spaces wouldn’t appear in the real encoding). 00 01 100 00 01 100 00 01 1110 1101 101 1111 1100
Once again, 37 bits are used to encode ”go go gophers”. There are several trees that yield an optimal 37-bit encoding of ”go go gophers”. The tree that actually results from a programmed implementation of Huffman’s algorithm will be the same each time the program is run for the same weights (assuming no randomness is used in creating the tree).
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
8
Chapter 4 Decompression Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed.
4.1 4.1
Stor Storin ing g the the Huffm Huffman an Tree ree
• In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency.
• Otherwise, the information to reconstruct the tree must be sent a priori. • A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use.
• Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads 9
Decompression
Data Compression Techniques
the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet).
Many other techniques are possible as well. In any case, since the compressed data can include unused ”trailing bits” the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however).
4.2 4.2
Crea Creati ting ng the the Huff Huffma man n Tab Table le
To create a table or map of coded bit values for each character you’ll need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making an entry in the table each time you reach a leaf. For example, if you reach a leaf that stores the character ’C’, following a path left-left-right-right-left, then an entry in the ’C’-th location of the map should be set to 00110. You’ll need to make a decision about how to store the bit patterns in the map. At least two methods are possible for implementing what could be a class/struct BitPattern:
• Use a string. This makes it easy to add a character (using +) to a string during tree traversal and makes it possible to use string as BitPattern. Your program may be slow because appending characters to a string (in creating the bit pattern) and accessing characters in a string (in writing 0’s or 1’s when compressing) is slower than the next approach.
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
10
Conclusion
Data Compression Techniques
• Alternatively you can store an integer for the bitwise coding of a character. You need to store the length of the code too to differentiate between 01001 and 00101. However, using an int restricts root-to-leaf paths to be at most 32 edges long since an int holds 32 bits. In a pathological file, a Huffman tree could have a root-to-leaf path of over 100. Because of this problem, you should use strings to store paths rather than ints. A slow correct program is better than a fast incorrect program.
4.3
Stor toring ing Sizes izes
The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in ”chunks”, e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they’re written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these ”extra” bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this. int bits; bits; while (input.readbits(1 (input.readbits(1, , bits)) { // proce process ss bits bits }
To avoid this problem, you can write the size of a data structure before writing the data structure to the file.
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
11
Chapter 5 CONCLUSION AND FUTURE WORKS Summary Limitations
1. Huffman code is optimal only if exact exact probability distribution distribution of the source symbols is known. 2. Each Each symbol is encoded encoded with integer number number of bits. 3. Huffman Huffman coding is not efficient efficient to adapt with the changing source source statistics. statistics. 4. The length length of the codes of the least least probab probable le symbol symbol could could be very very large to store store into a single word or basic storage unit in a computing system. Further enhancements The huffman coding the we have considered is simple binary
Huffman codingbut many variations of Huffman coding exist, 1. n-ary Huffman coding: The n-ary Huffman algorithm uses the {0, 1, ... ... , n 1} alphabet to encode message and build an n-ary tree. This approach approach was considered considered by Huffman in his original paper. The same algorithm applies as for binary (n equals 2)codes, except that the n least probable symbols are taken together, instead of just the 2 least probable. Note that for n greater than 2, not all sets of source words 12
Conclusion
Data Compression Techniques
can properly form an n-ary tree for Huffman coding. In this case, additional 0probability place holders must be added. If the number of source words is congruent to 1 modulo n-1, then the set of source words will form a proper Huffman tree. 2. Adaptive Huffman coding: A variation called adaptive Huffman coding calculates the probabilities dynamically based on recent actual frequencies in the source string. This is some what related to the LZ family of algorithms. template algorithm: algorithm: Most often, the weights used in implementa3. Huffman template
tions of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs,frequencies etc) 4. Length-limited Huffman coding: Length-limited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant. The package-merge algorithm solves this problem with a simple greedy approach very similar to that used by Huffman’s algorithm. Its time complexity is O(nL),where L is the maximum length of a codeword. No algorithm is known to solve this problem in linear or linear logarithmic time, unlike the presorted and unsorted conventional Huffman problems, respectively.
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
13
Bibliography
14
App endix A: Source Code
Data Compression Techniques
Appendices Appendix A : Source Code Listin Lis tingg 5.1 5.1:: The definito definiton n of the cla class ss Cha Charnod rnodee eac each h node of the huffman huffman tree is an object of this class. #ifndef
C h a rn rn o d e h
#define
C h a rn rn o d e h
#define DE DEBU BUG G 1 #i f DEBUG #define LOG( s ) cout <
usi ng namesp namespace ace std ;
template
<
c l a s s TYPE>
c l a s s Charnode
{ TYPE ch ; i n t count ;
Charnode ∗ l e f t ; Charnode ∗ r i g h t ;
public :
Cha rno de (TY (TYPE ch , i n t c o u n t = 0 ) ; Charnode( const Charnode ∗ New); i n t GetCoun GetCountt ( ) ; Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
15
App endix A: Source Code
Data Compression Techniques
i n t Value Value ( ) ; void Se tL eft ( Cha Charno rnode de ∗ l e f t ) ; void SetR igh t ( Charnode Charnode ∗ r i g h t ) ;
Charnode ∗ G e t L e f t ( void ) ; Charnode ∗ GetRight( void ) ; TYPE GetCha r ( void ) ; void show show ( ) ; bool bool operato operator r
(Charnode &obj2 );
<
void set Cha r (TY (TYPE ch ) ;
};
template
<
c l a s s TYPE>
Charnode
::Charnode(TYPE ch , i n t c o u n t )
{ LOG(”new Charnode ”<c h = c h ; this −>c o u n t = c o u n t ; this −> l e f t = this −> ri g h t = NU NULL;
}
template
<
c l a s s TYPE>
Charnode ::Charnode( const Charnode ∗ New)
{ LOG(”new Charnode ”<count <<” r e q u e s t e d ” ) ;
this −>ch = New−>ch ; this −>count = New−>count ; Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
16
App endix A: Source Code
Data Compression Techniques
this −> l e f t = N e w−> l e f t ; this −> r i g h t = N e w−> r i g h t ;
}
template
<
c l a s s TYPE>
i n t Charnode ::GetCount()
{ return count ;
}
template
<
c l a s s TYPE>
i n t Charnode :: Value ()
{ return count ;
}
template
<
c l a s s TYPE>
void Charnode :: Set Le ft (Charnod (Charnodee ∗ l e f t )
{ this −> l e f t = l e f t ;
}
template
<
c l a s s TYPE>
void Charnode :: SetRi ght (Charnode (Charnode ∗ r i g h t )
{ this −> r i g h t = r i g h t ;
}
template
<
c l a s s TYPE>
Charnode ∗ Charnode : : G et et L e f t ( void ) Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
17
App endix A: Source Code
Data Compression Techniques
{ return l e f t ;
}
template
<
c l a s s TYPE>
Charnode ∗ Charnode : : G e t R i g h t ( void )
{ return r i g h t ;
}
template
<
c l a s s TYPE>
TYPE Ch Charn arn ode od e ::GetChar( void )
{ return ch ;
}
template
<
c l a s s TYPE>
void Charnode : : s h o w ( )
{ cout <
}
template
<
c l a s s TYPE>
bool Charnode : : operator
(Charnode &obj2 )
<
{ return ( c o u n t
<
obj 2 . GetC GetCoun ountt () ) ;
}
template
<
c l a s s TYPE>
void Charnode :: set Cha r (TY (TYPE ch ) Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
18
App endix A: Source Code
Data Compression Techniques
{ this −>c h = c h ;
}
#endif
Listing Listi ng 5.2: The definition definition of the class Huffman this class helps in buildi building ng the huffman tree for an input file. #include
<
iostream>
#include ”Charnode . h” #include ” g l o b a l s . h” h” #include ” b i t o p s . h ”
vector >
#include
<
#include
<
#include
<
#ifndef
HuffmanCode h
#define
HuffmanCode h
map> fstream>
usi ng namesp namespace ace std ;
template
<
c l a s s TYPE>
c l a s s Huffman
{ private :
v e c t o r ∗> charactermap ; Charnode ∗ huffmanTreeRo ot ; map t a b l e ;
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
19
App endix A: Source Code
Data Compression Techniques
map f r e q t a b ;
private : void p r o c e s s f i l e ( const const char char ∗ f i l e n a m e , ma map & charmap );
v e c t o r ∗> convertToVector(map &chama chamap p);
bool compare(Charnode ∗ i , Ch Char arno node de ∗ j ) ;
void MinHeapif MinHeapify y ( vector ∗> & charactermap charactermap , i n t i , co n
void BuildMi BuildMinHea nHeap p ( vec tor ∗> &charactermap ) ;
void buildHuffmanTree ();
void delNode(Charnode ∗ ) ;
public :
Huffma Huffman n ();
const char char ∗ f i l e n a m e ) ; Huffman( const
˜Huffm ˜Huffman an ( ) ;
void createHuffmanTable(Charnode ∗ t r e e , i n t code , i n t h e i g h t )
void d i s p l a y C h a r a c t e r m a p ( ) ;
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
20
App endix A: Source Code
Data Compression Techniques
void d i s p l a y H u f fm fm a n T a b l e ( ) ;
Charnode ∗ g e t Ro Ro o t ( ) ;
map getHuffmanTab getHuffmanTable le () ;
map getFrequenc getFrequencyMap yMap () ;
i n t g e t C h a r V e c Si Si z e ( ) ;
};
template i n t Huffman :: getCharVecSize ()
{ return c h a r a c te te r m a p . s i z e ( ) ;
}
template
<
c l a s s TYPE>
void Huffman : : p r o c e s s f i l e ( const const char char ∗ f i l e n a m e , ma map & ch
{ i b st s t re r e a m i n f i l e ( f i l e na na m e ) ;
int i n b i t s ; while ( i n f i l e . re ad bi ts (BI (BITS PER WORD, i n b i ts ) != f a l s e )
{ / / c o u t << (TY (TYPE) in b i t s ; charmap [ (TY (TYPE) in b i t s ]++;
} Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
21
App endix A: Source Code
Data Compression Techniques
LOG( ” \ n \ n \nEND\ n ” )
}
template
<
c l a s s TYPE>
v e c t o r ∗> Huffman :: convertToVector(map &cha
{ v e c t o r ∗> charactermap ;
f o r ( typename map : : i t e r a t o r
i i =c =cha ma p . b e g i n ( ) ; i i ! =c =cha m ap
{ / / c o u t << ( ∗ i i ) . f i r s t
<<
”: ”
<<
( ∗ i i ) . s e c o n d << e n d l ;
new Charnode (( ∗ i i ) . f i r s t , ( ∗ i i ) Charnode ∗ ch = new
charactermap . push back ( ch ) ; #i f DEBUG
//ch −>show show ( ) ; i f ( ch −>GetLeft()==NULL && ch−>GetRight()==NULL)
LOG( ” L e af a f No d e i n i t i a l i z e d p r o p er er l y ” ) ; #endif
}
return charactermap ;
}
template
<
c l a s s TYPE>
bool Huffman ::compare(Charnode ∗ i , Ch Char arno node de ∗ j )
{ return ( ∗ i <∗ j ) ;
}
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
22
App endix A: Source Code
template
<
Data Compression Techniques
c l a s s TYPE>
void Huffman :: MinHeapif MinHeapify y ( vecto r ∗> & charactermap , i n
{ int l e f t = 2∗ i + 1 ; int r ig ht = l e f t + 1 ; i n t s m a l l e s t = − 1;
i f ( l e f t Value() < charactermap [ i ]−> Value ( )
s m a ll l l e st st = l e f t ; else
smallest = i ; i f ( r i g h t Value() < c h a r a ct c t e r m a p [ s m a l l e s t ]−
smallest = right ;
i f ( s m a l l e s t ! = i )
{ Charnode ∗ temp temp = chara cterma p [ i ] ; charactermap [ i ] =charactermap [ sm al le st ] ; c h a r a c te t e r m a p [ s m a l l e s t ] = t em p ;
M i nH nH e ap ap i fy fy ( c h a ra r a c t er er m a p , s m a l l e s t , n ) ;
} }
template
<
c l a s s TYPE>
void Huffman :: BuildMin BuildMinHeap( Heap( vec tor ∗> &charactermap)
{ i n t n = c h a ra ra c te t e r ma ma p . s i z e ( ) ; f o r ( i n t i = n / 2; 2 ; i >=0 ; i −−)
M i nH nH e ap ap i fy fy ( c h a ra r a c t er er m a p , i , n ) ; Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
23
App endix A: Source Code
Data Compression Techniques
}
template
<
c l a s s TYPE>
void Huffman :: buildHuffmanTree ()
{ LOG(
f u nc
);
v e c t o r ∗> c h a r a c t e r m a p = this −>charactermap ;
/ ∗ HUFFMAN ( C ) Ref er CLRS (non −u n ic i c od od e c h a r a c t e r s . )
∗ / i n t n = c h a ra ra c te t e r ma ma p . s i z e ( ) ;
LOG( ” S i z e o f t h e c h a r ma map = ”<
{ LOG( i <<” th th i t e r a t i o n ” ) BuildM BuildMinH inHeap( eap( charactermap ) ; new Charnode(ch ara cter map [ 0 ] ) Charnode ∗ l e f t = new
LOG( l e f t −>GetC GetCoun ountt ( ) ) ; charactermap . era se (charactermap . begin ()+0 ); BuildM BuildMinH inHeap( eap( charactermap ) ; new Charnode(char acterm ap [ 0 ] Charnode ∗ r i g h t = new
charactermap . era se (charactermap . begin ()+0 ); LOG( right −>GetC GetCoun ountt ( ) ) ;
new Charnode( ’ \ 0 ’ , l e f t −>Value() Charnode ∗ z = new
z−>S e t L e f t ( l e f t ) ; Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
24
App endix A: Source Code
Data Compression Techniques
z−>S e t R i g h t ( r i g h t ) ;
LOG( z−>GetCount()) LOG( z−>GetLeft() −> GetC GetCoun ountt ( ) ) ; LOG( z−>GetRight() −> GetC GetCoun ountt ( ) ) ;
c h a r a c te te r m a p . p u s h b a c k ( z ) ;
}
huffmanTreeRoot huffmanTreeRoot = chara ctermap [ 0 ] ;
// I n i t i a l i
}
template
<
c l a s s TYPE>
Huffman ::Huffman()
{}
template
<
c l a s s TYPE>
const char char ∗ f i l e n a m e ) Huffman ::Huffman( const
{ map charmap ; p r o c e s s f i l e ( f i l en en a m e , charmap ) ; charactermap = convertToVector (charm (charmap ap ) ; f r e q t a b = c ha ha rm rm ap ap ;
buildHuffmanTree (); c r e a t e Hu H u f f m a n T a bl b l e ( h uf u f fm f m a nT nT re r e eR eR o ot ot , 0 , 0 ) ;
}
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
25
App endix A: Source Code
template
<
Data Compression Techniques
c l a s s TYPE>
void Huffman ::delNode(Charnode ∗ node)
{ i f ( no de == NUL NULL) return ;
delNode(node−>G e t Le Le f t ( ) ) ; delNode(node−>G e tR tR i gh gh t ( ) ) ;
delete node ;
}
template
<
c l a s s TYPE>
Huffman ::˜Huffman()
{ delNode( huffmanT huffmanTreeR reeRoot oot ) ; huffmanTreeRoot = NULL;
}
template
<
c l a s s TYPE>
void Huffman :: createHuffma nTable (Charno (Charnode de ∗ t r e e , i n t code , i n t
{ LOG(
f u nc
);
i f ( t r e e== e==NULL)
/ / T hi h i s c o d i t i o n n e ve v e r o c cu cu r s !
return ; i f ( tre e −>Ge tL ef t()==N t()==NULL && && t r e e −>GetRight()==NULL) / / L e a f N o d e :
{ / / c o u t <<” C h a r a c t e r ”<< t r e e −>GetChar() < < ’ \ t ’ << <<”Count = ”<< t / / c o u t <<”C o de de : ” ;
s t r i n g c o d eS e S t r i ng ng = ” ” ; Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
26
App endix A: Source Code
Data Compression Techniques
f o r ( i n t j = h e i g h t − 1; j >=0; j −−)
{ i f ( c o d e & ( 1<< j ) )
{ / / c o u t < < ’1’; c o d e S t r i n g += ” 1 ” ;
} else
{ / / c o u t < < ’0’; c o d e S t r i n g += ” 0 ” ;
} } / / co ut<
t a b l e [ t r e e −>G et et Ch Ch ar ar ( ) ] = c o d e S t r i n g ;
return ;
} c o d e = c o d e <<1; c r e a t e Hu H u f f m a n T a bl b l e ( t r e e −>GetLe ft () , code , hei gh t +1); c r e a t e Hu H u f f m a n T a bl b l e ( t r e e −>GetRight () , code | 1 , heig ht +1); +1);
}
template
<
c l a s s TYPE>
void Huffman :: displa yCharacte rmap ()
{ LOG(
f u nc
);
i n t n = c h a ra ra c te t e r ma ma p . s i z e ( ) ; Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
27
App endix A: Source Code
Data Compression Techniques
LOG(”Size = ”<
charactermap [ i]−> show show ( ) ; cout <
}
template
<
c l a s s TYPE>
Charnode ∗ Huffman : : g e t R oo oo t ( )
{ return huffmanTreeRo ot ;
}
template
<
c l a s s TYPE>
void Huffman :: displayHuffmanTable ()
{ LOG( LOG( ”HU ”HUFFMAN TABLE” TABLE” ) ; f o r ( typename map
>
: : i t e r a t o r i i =t = t a b l e . b e g in in ( ) ; i i != t a b
{ cout
<<
endl
<<
(∗ i i ) . f i r s t
<<
”\ t ”
<<
( ∗ i i ) . second ;
} cout
<<
endl ;
}
template
<
c l a s s TYPE>
map Huffman :: getHuffmanTab getHuffmanTable le ()
{ return t a b l e ;
}
Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
28
App endix A: Source Code
template
<
Data Compression Techniques
c l a s s TYPE>
map Huffman :: getFrequen getFrequencyMap cyMap ()
{ return f r e q t a b ;
}
#endif
Listing Listi ng 5.3: The definition definition of the class CompressionW CompressionWritin ritingg this class helps in writ writing ing the bits to the compressed file. #ifndef COMP H #define
COMP H
#include
<
iostream>
#include
<
vector >
#include
<
#include
<
string>
#include
<
fstream>
map>
#include ” g l o b a l s . h” h” #include ” b i t o p s . h ” #include ”Charnode . h”
usi ng namesp namespace ace std ;
template c l a s s CompressionWriting
{ Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
29
App endix A: Source Code
Data Compression Techniques
map huffmanTable ; Charnode ∗ huffmanTreeRo ot ; s t r i n g o u t pu p u t Fi F i l e na na m e ; s t r i n g i n pu p u t Fi F i l en en a m e ;
map freqMap ;
private : i n t c o n v e r tS t S t r i n g T oB o B i t P a t t er er n ( s t r i n g
str );
i n t totalNumOfBits( void ) ;
public :
CompressionWriting () { }
CompressionWriting(Charnode ∗ root , map t a b l e ,
void writeCompressedDataToFile ();
void d i s p l a y O u t p u t Fi Fi l e ( ) ;
void writeHuff manTreeB itPatte rn (Charn (Charnod odee ∗ t r e e , o bs b s tr t r ea ea m & o u
};
template
CompressionWriting :: CompressionWriting ( Ch Char arno node de ∗ root , map
{ huffmanTreeRoot = root ; h u ff f f m an a n T ab ab l e = t a b l e ; Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
30
App endix A: Source Code
Data Compression Techniques
outputFilename = oname; inputFilename = in naa me me ;
freqMap freqMap = freMap freMap ;
}
template void CompressionWriting :: writeCompressedDataToFile ()
{ LOG( ” \ n W r i ti t i n g P a t t e r n : \ n” ) ;
i b s tr t r e a m i n f i l e ( i n pu p u t Fi F i l en e n a me me . c s t r ( ) ) ; o bs b s tr t r ea e a m o u t f i l e ( o u tp t p u tF t F il i l en e n a me me . c s t r ( ) ) ;
o u t f i l e . w r i t e b i t s ( BI BITS PE PER IN INT , f re re qM qM ap ap . s i z e ( ) ) ; w r i t e H u f f m a n T r e e B i t P a t te t e r n ( h uf u f fm f m a nT nT re r e eR eR o ot ot , o u t f i l e ) ;
//
o u t f i l e . w r i t e b i t s ( BI BITS PE PER IN INT , t o ta ta l Nu N u m Of O f B it it s ( ) ) ;
/ / W r i t i n g C oom m p re re ss s s ed e d D aatt a
int i n b i t s ;
i n f i l e . r ew e w in in d ( ) ; while ( i n f i l e . re ad bi ts (BI (BITS PER WORD, i n b i t s ))
{ / / c o u t << (TY (TYPE) i n b i t s
<<
” = ”
<<
huff manT able [ (TY (TYPE) i n b i
i n t b i t P a t t e r n = c o n v e r t S t r i n g T o B i t P a t t e r n ( h uf u f fm f m a nT nT ab a b llee [ ( T
/ / c o u t << ” Dept ept. of CSE, SE, R V C E, Banga angallore. re.
=
”
Feb 2012 - May 2013
<<
b i t P a t t e r n << e n d l ; 31
App endix A: Source Code
Data Compression Techniques
o u t f i l e . w r i t e b i t s ( h u ff f f m an a n T ab a b l e [ ( TY TYPE) i n b i t s ] . l e n g t h ( ) , b i t
}
ou tfi le . flush bits (); in fi le . close (); ou tfi le . close ();
}
template
<
c l a s s TYPE>
i n t CompressionWriting :: totalNumOfBits ()
{ int c o u n t = 0 ; i n t n = f re re qM qM ap ap . s i z e ( ) ;
f o r (map : : i t e r a t o r
i i = f re re qM qM ap ap . b e g i n ( ) ; i i ! = f re re qM qM ap ap . e
{ / / L en e n gt g t h o f e ac a c h c h a r a c t er e r c od o d e ∗ n um u m o f t im i m es e s t h e c ha ha r a p count += huffmanTable [( ∗ i i ) . f i r s t ] . l e n g t h ( ) ∗ ( ∗ i i ) . second
}
LOG(”Count = ”
<<
count
<<
endl ) ;
return count ;
}
template i n t CompressionWriting : : c o n v e r t S t r i n g T o B i t P a t t e r n ( s t r i n g Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
st r ) 32
App endix A: Source Code
Data Compression Techniques
{ int b i t P a t t e r n = 0 ; int n = s t r . l e n g t h ( ) ; f o r ( i n t i = 0; 0; i
b i t P a t t e r n += ( 1
<<
( n−i − 1 )) ∗ ( s t r [ i ] − ’ 0 ’ ) ;
return b i t P a t t e r n ;
}
template void CompressionWriting : : d i s p l a y O u t p u t F i l e ( )
{ i b st s t re r e a m i n f i l e ( o u tp t p ut u t Fi F i le l e na n a me me . c s t r ( ) ) ; o f s t re r e a m o u t f i l e ( ” xx xxx ” ) ;
cout
<<
” \ n D i s p la la y i n g Output F i l e : ”
<<
endl ;
int i n b i t s ; while ( i n f i l e . r e a d b i t s ( 1 , i n b i t s ) ! = f a l s e )
{ cout
inbits ;
<<
outfile
<<
inbit s ;
}
ou tfi le . close ();
}
template void CompressionWriting :: writeHuf fmanTree BitPatt ern (Charn (Charnode ode ∗
{ i f ( no de == NUL NULL) Dept ept. of CSE, SE, R V C E, Banga angallore. re.
Feb 2012 - May 2013
33
App endix A: Source Code
Data Compression Techniques
return ;
i f (node−>Ge tL ef t ( ) == NU NULL && && node −>Get Ri gh t ( ) == NULL)
{ out fil e . writebits (1 , 1); o u t f i l e . w r i t eb i t s (BI (BITS PER WORD, nod nodee −>GetC GetChar har ( ) ) ;
}
else
{ out fil e . writebits (1 , 0); writeHuffmanTreeBitPattern (node −>G et e t Le L e ft ft ( ) , o u t f i l e ) ; writeHuffmanTreeBitPattern (node −>G et et Ri Ri gh gh t ( ) , o u t f i l e ) ;
} }
#endif
Listing 5.4: The main program of the huffman compression algorithm. #include #include < c s t d i o > #include #include #include < c s t r i n g > #include