Demonstration of Exact String Matching Algorithms using CUDA
Author List Raymond Tay Tay (Autodesk, formerly Linden Lab)
Summation In this chapter, the author presents a demonstration application of three commonly used exact string matching algorithms using NVIDIA CUDA Technology. The algorithms are namely the Brute-force, QuickSearch and Horspool. The author attempts to apply known CUDA techniques to implement, test and optimize where applicable;challenges the author faced was mapping CUDA's threading and memory model to what is normally an algorithm designed to execute on the single core CPU. The author hopes that through this effort, to demonstrate the power of CUDA to the budding GPU developer.
Introduction, Problem Statement, and Context String-matching is a very important subject in the wider domain of text processing. Stringmatching matching algorithms algorithms are basic components components used in implementat implementations ions of practical practical softwares softwares existi existing ng under under most most operat operating ing sys system tems. s. String String-mat -matchi ching ng consis consists ts of finding finding one or more more occurrences of a pattern a pattern in a body of text. All the algorithms in this work locates all occurrences of the pattern in the text body aided by GPU acceleration. The algorithms developed were tested for patterns whose length are shorter and greater than the alphabet. The pattern is denoted by x=[0..m-1] and m denotes its length, the text is denoted by y=[0..n-1] where n denotes denotes its length; length; the alphabet alphabet of the text and pattern pattern refers to all symbols used to represent represent strings (e.g. the alphabet of a binary string is ∑={0,1}) and is denoted by ∑ with the size equal to
∂ (e.g. the size of the alphabet for binary strings is ∂=2). The author is aware the wide applicability of string matching algorithms ranging from text editors, the popular Unix tool grep, virus scanning technology, locating DNA sequences. The author believes that the techniques devised here can be leveraged by current mid-range workstations as they normally come equipped with CUDA/OpenCL enabled graphics cards.
Core Method The methods applied to the development includes the following 1) Find ways to paralle parallelize lize the sequential sequential code 2) Minimize Minimize data data transfe transferr between between the host host and and device device 3) Global memory memory should should be coales coalesced ced as much as as possible possible 4) Avoid Avoid branch branch divergen divergence ce within within a CUDA CUDA warp The work here for all algorithms revolves around getting a CUDA thread to thread to execute the scanning and locating a match; if it does find a match the CUDA thread will update a data structure revealing the position where the pattern was found. The data structures needed by the CUDA threads will be provided by the CUDA kernel .
Algorithms, Implementations, and Evaluations The sequential form consists of a function, BF (acronym for BruteForce) where it attempts to match the pattern to the text by scanning the text from left to right. In the sequential code, a single thread is conducting the search and when it finds a match the algorithm will output to console the position it was found. In the CUDA version, N threads could be conducting the same search. Each of the N threads attempts to scan for a match of the text, in parallel, and when it discovers a match a data structure for storing the found indices will be updated. The source codes for the sequential and parallelized(CUDA) code is shown below for illustration purposes.
Illustration 1: Sequential Brute Force
Each CUDA thread can potentially and possibly read each character and obtain a match, in the event that the pattern follows one another in the string; hence this translates to (N*m) bytes of
data being read. Each CUDA thread potentially writes at most n/m times (assuming the pattern follows one after another other) but in general, the text and pattern could be absolutely random.
Illustration 2: CUDA Brute Force
The sequential QuickSearch is a variant of the popular Boyer-Moore Algorithm where it does not suffer from the problem of sub-optimal performance per formance when it comes to matching patterns that inherit from small alphabets like DNA. In the classic QuickSearch, the inventor of the algorithm dropped the “good suffix shift” aka “matching shift” computation in favour of the “bad-character shift” aka “occurrence shift” computation. This algorithms precomputes the “bad-character shift” for the pattern before using the results of the previous computation to aid in its search for pattern in the text body. body. In the CUDA version, the classic QuickSearch has been reorganized so that the “bad-character shift” is parallelized; and in the scanning code the “skipping distance” data structure (which is a 1D array containing the skipping distances regardless of a match or mismatch and each valid element is a CUDA thread's id) is pre-computed which will be used by the CUDA kernel. In the CUDA kernel, kernel, the thread will only execute the scanning code if it can locate its id in the “skipping distance” data structure mentioned earlier.
The source codes for the sequential and CUDA version of QuickSearch is presented below:
Illustration 3: Sequential QuickSearch
Illustration 4: CUDA QuickSearch
In the classic Horspool algorithm, the implementation favours the use of the bad-character shift computation alone and it's not very efficient when the pattern is shorter than the alphabet i.e. m < ∂. The “bad-character shift” computation is the same as the one shown in the sequential QuickSearch. In the CUDA version, the approach the author's taken is very similar to the implementation of the CUDA version of QuickSearch i.e. In the CUDA version, the classic QuickSearch has been reorganized so that the “bad-character shift” is parallelized; and in the scanning code the “skipping distance” data structure (which is a 1D array containing the skipping distances regardless of a match or mismatch and each valid element is a CUDA thread's id) is precomputed which will be used by the CUDA kernel. In the CUDA kernel, the thread will only execute the scanning code if it can locate its id in the “skipping distance” data structure mentioned earlier. The source codes for the sequential and CUDA Horspool is shown below:
Illustration 5: Sequential Horspool
Illustration 6: CUDA Horspool
The author subjected the three sequential and their CUDA equivalent algorithms to benchmarking and applied some, but not all, CUDA techniques techniques and technology. Each test was ran with 100 iterations and taking the average. The tests were ran on a 32-bit Ubuntu OS, GTX480 Nvidia Card, 8-core Intel i7 CPU, 6GB of System RAM. Two Two sorts of tests were conducted: (1) pattern was shorter than the alphabet size (2) pattern was longer than the alphabet size. One observation from the tests is that the speedup factor of the CUDA to the sequential code ranges from 31 to 106. Another observation is that the CUDA versions of the code do exhibit branch divergence and bank conflicts and this behavior is highly dependent on the pattern and the text involved. Here is the summary:
Algo Algori rith thm m Type Type Opti Optimi miz z Search runtime ation
(milliseconds)
GPU Effective
Speedu
bandwidth (GBps)
p factor
brute-force
SE SEQ
brute-force
CUDA None
QuickSearc SEQ h
-O2
24
N/A
-
0.24
11.9
100
Shared memory
0.24
11.9
Pagelocked memory
0.41
7.1
59
-O2
16
N/A
-
0.18
15.87
88
0.15
19.77
106
16
N/A
-
0.19
15.62
84
0.16
18.55
100
QuickSearc CUDA None h Shared memory Horspool
SEQ
-O2
CUDA None Shared memory
Table 1: Test results for pattern shorter than alphabet size
Algo Algori rith thm m Type Type Opti Optimi miz z Search runtime ation
(milliseconds)
GPU Effective
Speedu
bandwidth (GBps)
p factor
brute-force
SEQ
21.2
N/A
-
brut brutee-fo forc rce e
CUDA Sha Share red d memory
0.55
5.35
38
17.2
N/A
-
QuickSearc CUDA Shared h memory
0.47
6.29
36
Horspool
16.8
N/A
-
0.53
5.56
31
QuickSearc SEQ h
SEQ
-O2
-O2
-O - O2
CUDA Shared memory
Table 2: Test results when pattern is longer than the size of the alphabet
The author believes that performance gains would be better if the implementation was in (a) Asynchronous concurrent execution since multiple kernels execution concurrently would possibly improve the run times. The author investigated that optimizations beyond -O2 for the sequential algorithms did not seem to affect the overall run times. The author's initial experimentation with page-locked/zero-copy in was not encouraging as effective bandwidth lagged significantly on the linux operation system; the author cannot offer an explanation at this point in time, why this is the case. The author hoped to implement a multi-GPU solution but due to lack of resources, it cannot be pursued in the near future though the author would get a big kick out of it!
•
•
•
David Kirk and Wen-mei Hwu of Programming Massively Parallel Processors 2010 first edition. AHO, A.V., A.V., 1990, Algorithm Algorithms s for finding patterns patterns in strings. strings. in Handbook of Theoretical Computer Science, Volume A, Algorithms and complexity , J. van Leeuwen ed., Chapter 5, pp 255-300, Elsevier, Amsterdam. 1980, Pract Practica icall fast fast search searching ing in string strings, s, Softwa Software re - Practi Practice ce & HORSPOOL R.N., 1980, Experience, Experience, 10(6):501-506.
•
SUNDAY D.M. , 1990, A very fast substring search algorithm, Communications of the ACM . 33(8):132-142.
•
Quick Search Algorithm from http://www-igm.univ-m http://www-igm.univ-mlv lv.fr/~lecroq/string/ .fr/~lecroq/string/
•
Horspool Algorithm from http://www-igm.univ-mlv http://www-igm.univ-mlv.fr/~lecroq/string/ .fr/~lecroq/string/
•
Brute-force Algorithm from http://www-igm.univ-ml http://www-igm.univ-mlv v.fr/~lecroq/string/
•
NVIDIA CUDA Programming Guide 3.0
•
NVIDIA CUDA Reference Manual 3.0
•
NVIDIA CUDA Best Practices Guide