International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823
High Performance Network Intrusion Detection Model Using Graph Databases D.P.Jeyepalan1 and E.Kirubakaran2 1 Research Scholar, School of Computer Science, Engineering and Applications, Bharathidasan University, Tiruchirappalli, Tamilnadu, India. 2 SSTP (Systems), Bharat Heavy Electricals Ltd,, Tiruchirappalli, India. Abstract Intrusion Detection is a dynamic scenario that constantly requires different mechanisms in order to compensate for the ever changing intrusions. Though various methods are available for detecting intrusions, they tend to get out dated very frequently. Hence the necessity of this scenario is a flexible system that adapts itself to the intrusion scenario accordingly. Further, a new challenge called the Big Data is seen on the rise due to the increase in the amount of data being generated and the inability of the traditional data mining system to cope up with the volume and the velocity of the data. The current study presents systems that can be used to perform effective intrusion detection in the dynamic environment existing in the current context, and also discusses the future enhancements that can be carried out in the system to make it flexible and effective. Keywords: Intrusion Detection; Graph Database; Hadoop; Significant Terms Aggregation 1.
Introduction
Intrusion is the process of attempted or succeeded illegal access on a computer system or a network. Intrusions have been attempted (and sometimes succeeded) ever since the first of the computers were created. Since the first intrusion attempt, attempts to evade or stop these attacks have been in place [1]. But the evolution has been mutual. As the system evolves to fight back the attacks, the intensity of the attacks metamorphose back more powerful to compromise the system. This is a cycle that had been repeating all through the evolution of computers [2][3]. The intensity of attack and the number of attacks that can be carried out by an adversary has seen manifold increase due to the availability of Botnets and other such features. Moreover, these facilities are available online and are lent for specific periods, which increases the complexity of creating an Intrusion Detection System (IDS). Hence until the current generation, attacks and intrusion attempts were on a one to one basis, however now it is available as a service to everyone (even naïve users) interested to perform intrusion, and there is no specified limit to the level of attack, since these services are created and lent to users on an hourly basis [13][14]. In the current generation of high performance computing, the nature of the attacks that we witness has shown a tremendous increase in terms of volume, velocity and variety (variations in attack). Even though the type of data returned from a network is semi-structured, the presence of volume and the speed at which the data should be processed (velocity) makes it a Big Data problem [25]. Hence by the basic definition of Big Data, it becomes clear that intrusion detection, in the current scenario cannot be processed using traditional algorithms [26]. This paper discusses conventional IDS, the current intrusion scenario in networks and why conventional methods tend to fail in the current scenario, mechanisms that can be used for intrusion detection, their pros and cons and finally proposes a graph data structure and a graph based intrusion detection model that facilitates real time detection in networks. The remainder of this paper is structured as follows; section II presents the related works dealing with conventional IDS methods and parallel computation based IDS, section III presents the current demands of an IDS and the probable architectures that can solve this problem and discusses their pros and cons. Analysis is carried out and it was concluded that Graph Databases work best for the process of intrusion detection. Section IV concludes the study.
4
International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823 2.
Related Works
2.1.
Conventional IDS Methods
It has been observed that the process of Intrusion Detection was facilitated and improvised by the usage of mining algorithms and statistical analysis tools. Clustering and Classification algorithms had been the major contributors for Intrusion Detection Systems[4][5]. Further, the branch of Data Mining has also provided various prediction algorithms to facilitate early detection of intrusions. The applications for intrusion detection were developed as single threaded applications utilizing the capacities of a CPU [6]. A graph based Clustering method was proposed in [20]. A graph representing the entire network is created. The nodes are clustered according to the transmission history of each node. An added advantage to this method is that it does not bind the algorithm to create fixed number of clusters or defined shaped cluster. Outlier detection method is used to sort out the outliers. The foremost drawback faced by systems of this category is the large amount of time taken for completing the detection process. Further, these algorithms are not capable enough to handle dynamic datasets. In order to reduce time and provide scalability, hybridized algorithms and metaheuristic based algorithms came into existence. A game theoretic based intrusion detection system that also identifies the candidate nodes for IDS deployment is proposed in [21]. The process of candidate identification is carried out by clustering the network and determining the cluster heads, which is performed by the Ant Colony Clustering algorithm. Similar strategies have been discussed in [23][24]. All these methodologies are single threaded. They were designed in the earlier stages, hence do not leverage the parallel computation capabilities of the current processors. 2.2.
Parallel Computation based IDS
Hence we migrate to the next method of achieving accuracy by providing speedup in algorithms by improving the hardware architecture. This becomes possible due to the increase in the processing capabilities of the CPUs and the introduction of GPUs [7-9]. Usage of CPUs and GPUs were greatly believed to enhance the processing abilities[10]. Usage of the parallel processing nature of GPUs to perform the process of packet inspection is proposed in [8]. It leverages the ability of GPUs to perform faster processing in parallel. A similar but enhanced method is proposed in [9], which performs stream categorization and intrusion detection on GPU using CUDA. The feasibility of using a GPU for performing the intrusion detection is discussed in [7]. It highlights the limit of GPUs in performing concurrent and asynchronous processes for detecting intrusions. Finally, it concludes saying that several modifications are to be performed on the graphics cards for efficient usage. A probability distribution based pattern recognition method is presented in [10]. This method tends to detect network anomalies by analyzing various guidelines such as probability distribution, relative and normalized relative entropies. The authors [10] also perform analysis on CPU using serial algorithm and using its parallel variants in GPU and Map Reduce environments. A similar approach that uses agent based IDS is described in [22]. ACO being an intrinsically parallel algorithm performs well in parallel environments. Experiments were carried out using multi core CPUs and many core GPUs and the results shows an increase in performance, when compared to single threaded applications. But even this scheme has its own limitations, governed by Amdahl’s Law, Gustafson’s Law and Moore’s Law. While the maximum speed up that can be achieved by a given problem is limited by its sequential part (as stated by Amdhal’s Law [17]), the speedup is also limited by the speed of the storage device in which the required data is stored (according to Moore’s Law [15]). Hence by Gustafson’s law [16], speedup can be achieved only by increasing the number of processors, whose feasibility is in question. Further, the limited memory capacity of the GPUs (1GB to 12GB) had become a bottleneck when large datasets were to be used. Even though this problem can be overcome by using the secondary storage, the speed of the processing will be limited by the data transfer time as dictated by the Moore’s Law. Hence, even if patterns of intrusion/anomaly were found, they could not be stored on an in memory data structure so that the entire dataset could be searched for similar patterns. Constant fast memory in a GPU is limited to 64KB and could not hold all the patterns, while storing the data on the device memory will affect the performance considerably. 3.
Our approach
3.1.
Current IDS Requirement
According to the current network scenario, the IDS should be capable of handling a huge amount of traffic (scalable). The algorithm being implemented should be capable of incorporating dynamic data (high and unpredictable network traffic) and provide efficient results in the fastest possible time. An online intrusion detection system is recommended. Online detection involves providing the detection results on the go, unlike traditional 5
International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823 systems which takes time to perform the prediction. This is implemented using a combination of statistical, machine learning and data mining techniques. 3.2.
Recommended architectures for IDS : Pros and Cons
The environment in which the applications are run also play a major role in their performance. As discussed earlier, the maximum performance that can be extracted from a hardware is limited, hence a distributed architecture was proposed to improve the performance of the system. One such architecture is the Hadoop ecosystem. 3.2.1.
Hadoop : MapReduce
Hadoop is a reliable, scalable and distributed computing architecture. While the Hadoop Distributed File System (HDFS) manages the storage, Hadoop YARN and MapReduce provides a framework for job scheduling and cluster resource management and parallel processing of the data stored in HDFS. Though Hadoop can store and process a large amount of data, it is not free of intermediate reads and writes in the secondary memory. Writing the intermediate results to the disk creates a lot of latency and not so very effective when considering the current scenario. So a scale-up solution with huge memory, could be preferred for the current application in contrast to a cluster of commodity machines. Since the dataset size is of the order of GBs and not TBs, it is better to go for an in memory approach as suggested in Hadoop Version 2 using Spark, RDD. Spark is a fast and general computation engine for Hadoop. The advantage of Spark is that it provides a simple and expressive programming model, which supports a wide range of applications, such as ETL, machine learning, stream processing, and graph computation. The major drawback of the Hadoop ecosystem is its tendency to perform batch processing. Intrusion detection mechanism requires real time predictions, which involve interactive processing. Hence usage of Hadoop for the process of online intrusion detection is not recommended. 3.2.2.
Significant Terms Aggregation
Significant terms aggregation is the process of analyzing a dataset to determine interesting or unusual occurrences of a particular data in an itemset. This method analyzes the patterns that stand out from the background, in other words, it identifies the anomalies that do not fit into the regular pattern of the data under study. These terms are not absolute anomalies. They actually are more common in certain types of queries, while their presence when compared with the entire dataset is scarce. These are generally termed as the “uncommonly common” data items. This method in general considers two categories of sets; the foreground set and the background set. The foreground set is considered to be the items of interest, and the background set is considered to be the base set used for statistical comparisons. The background set contains the entire data, while the foreground set contains specific data about the items of interest. This method works on the basic principle that the most commonly occurring term might not be the actual term of interest. For example, the mostly occurring term in any document would be ‘the’, but it is hardly significant. A sample example is shown in the figures 1 and 2, that analyzes the uncommonly common words in a document set. The x axis contains the % of documents containing the word and the y axis contains the % of documents containing the word from a random sample of documents. This forms a diagonal connecting the bottom left and the top right points, i.e., (0,0) and (100,100). The words contained in the top right corner are the frequently occurring words. These words are of least help in the analysis, as they are contained in all the documents, while the words in the bottom left corresponds to the rare occurrences. They are the items of interest. In order to determine the true interestingness of the word, the diagonal is overlaid with x axis as the % of documents containing the word and the y axis as % of search results containing the word. The overlaid graph displays the uncommonly common data to the top left, the area which is categorized as the area of interest.
6
% o f rand om d ocument sam ples contain ing the wor d
International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823
th e a but
c lus t er bigda ta had oo p % of documen ts cont ain ing the wor d
Figure 1: Sample graph depicting common words (Diagonal Construction)
Area o f uncommon ly common t erms
% of search results containin g the wor d
ha doop bigd at a clust er
the a bu t
% o f d ocument s con ta ining th e word
Figure 2: Sample graph depicting uncommonly common words
The above description actually deals with the standard example of text mining. This can be directly mapped to our scenario of Intrusion Detection by considering every transaction as a record and by plotting the aggregated transaction result on the graph. Since intrusions are uncommon, they do not explicitly reveal themselves when analyzed with the normal transactions, i.e., the background data. But when overlaid on the transactions containing the intrusion records alone, they can be observed in the top left corner, which is the area for uncommonly common data items.
7
International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823 3.2.3.
Graph Databases
A graph database uses nodes and edges to store and represent data. The advantage of using such a structure is that it provides index free adjacency. Every node is directly connected to another node and direct traversal can be performed on it rather than index based lookups. Further, the advantage of using a graph database is that it provides partition free solution. Usage of data on Hadoop will obviously lead to partitioning the data between systems. This will lead to inaccurate solutions when performing operations such as Clustering or Classification which require the availability of the entire data rather than a subset of the data. Graph Databases were found to be a suitable structure for IDS applications because of the very nature of the problem and due the fact that graph databases work well in scale up options and also they are not so very suitable for scale out environments. Graph DBs uses the entire graph data for processing, hence allows us to match patterns of intrusion/anomaly with the entire dataset of millions of nodes in the order of milliseconds as opposed to any other mechanism which will take seconds or even minutes depending on dataset size. Graph DBs are optimized for such pattern matching and anomaly detection is one classic example for that. The current modeling technique uses the NSL-KDD dataset for modeling graphs. The KDD CUP 99 [19] dataset has been one of the mostly used benchmarks for anomaly detection (Classification). It contains the raw TCP dump data covering nine weeks of observations, obtained from the Lincoln Labs. It contains four categories of attacks, DOS, R2L, U2R and probing, comprising of 24 training attack types. The KDD CUP 99 dataset was analyzed and two major drawbacks had been observed [18][27]. The dataset contains very large number of records with a high redundancy rate. It has been observed that about 78% of the records had been duplicated in the training set and 75% of the records are duplicated in the test set. This will lead to a bias in the learning algorithm towards the most frequently occurring data, while infrequent data tends to get neglected. Hence evaluation results tend to get biased in this process. The NSL KDD has evolved, which overcomes these issues and provides a functional dataset for researchers. The shortcomings that NSL KDD overcomes from the KDD dataset are, •
Redundancy in the training set and the test set is eliminated.
• The count of the selected records from each difficulty level is inversely proportional to the percentage of records in the original data set. •
Reduction in the number of records in the training and the test sets
These justify the use of NSL KDD dataset for our study. 3.3.
Proposed IDS Architecture
Extensive research has been carried out in the area of intrusion detection on the available network data using the sterling algorithms, a modified form of the algorithm or a hybridized form of the algorithm. But the usage of a different data structure has not yet been experimented with. Hence the authors propose a graph database that can be used for processing intrusion data. The authors have modeled the problem of intrusion detection as a Neo4j property graph. A property graph is one that has properties associated with its vertices or edges. The graph is created by considering every transmission as a node and by taking transaction properties as the properties of the node. Querying is performed naturally by means of the CYPHER Query language that comes as part of the Neo4j Graph DB. 3.3.1.
Graph Model
The initial phase of the detection process deals with grouping the nodes depending on the properties. Clustering similar nodes such that one cluster contains normal transmission data, while each of the other clusters contain data pertaining to a specific attack. Since we use a graph database, the tasks are accomplished through queries and the results can be directly visualized. The query language being used varies depending on the graph database. Since we use KDD CUP 99 dataset, the training data is already classified, hence the attack property alone is sufficient for grouping data. In case of datasets without labels, clustering can be performed by considering all the properties (numerical and nominal; strings are ignored) and measuring their distance with respect to each other and grouping the properties that are nearest to each other.
8
International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823
Protocol
TCP oco lT Pro t is o f
y pe
Protocol
UDP Service
Transmission 1 i s of Serv ice
o lT yp e
pe e Ty rvic e of S
of P
is
Tpyepe Ty
ro t oc
is
r of P
co l o to
pe
is At
Service
DTH
REJ Service
RSTO
e p
is o f Pr
Ty
oto c
ck tta
ol Ty
of is
Service
RSTR
Service
is
Transmission 2
SF
Service
e e T yp r vic e of S
SO Service
SH is of
Transmission 3
k ac tt At
Ty
e yp eT c rv i
Attack
e p
e of S i s iso fA ttta ck T
yp e
back Attack
ftp write
ac kTy pe is o f At tt
Transmission 4
Attack
imap Attack
per l
Figure 3: Sample Process Graph
9
International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823 The advantage of using a graph database is that, interactive querying is made possible in opposition to the static nature of the existing approaches to intrusion detection, especially those based on data mining and machine learning approaches. Hence if a new data is to be added to the system, all we need to do is to run the query that is used for cluster creation and the system gets updated with the new data. Classification deals with binding an element with a category by analyzing the similarities. This is the process that is to be carried out as soon as a transmission arrives. The process of clustering is based on all the properties present in the node. The properties of current transmission are compared to the aggregated properties of each cluster. The cluster with minimum difference in properties is selected as the destination cluster, and its label is added to the current transaction. This can be used to label the transaction legitimate or anomalous. Typical running time for a query in graph databases is to the order of milliseconds. Thus the database updating process is faster and results are obtained in real time. Another major advantage of this approach is that it facilitates online prediction. As soon as a transaction is recorded, it can be added as a node in the graph and querying can be performed to classify the information to the cluster it belongs. This query will provide the result of whether the transaction is normal or anomalous. The Figure 3 shows a simple process graph for the KDD Cup 99 dataset. It can be observed that all the transmissions and transmission properties are depicted as nodes. The edges correspond to the properties pertaining to specific transmissions. Every transmission node is connected to its corresponding property. All nominal properties are constructed as vertices with the corresponding property labels, while numerical properties are encoded as attributes in the transaction node. A similar graph is constructed for the entire KDD Cup 99 dataset, containing 10.5 million records with 42 properties including the attack type. Data mining approaches allows us to visualize the elements at the end and in the case of a graph DB, this process provides the necessary visualizations without any additional effort. Using the graph DB and the associated query language, we can do interactive analysis and querying which allows us to further optimize the results. The results can also be filtered immediately without any additional processing, hence providing deeper insights from the results. Performing these processes on the obtained results is not so very feasible in the case of any machine learning approach. Further advantage of this approach includes making use of the graph visualization approaches to manually browse through the various clusters. This can provide much better understanding about the problem at hand, and it can also provide certain hidden results that were not found in earlier approaches that work directly on the data. It is also possible to mark and analyze the areas of interest identified to contain too many outliers. Several graph visualization tools are available for visualization of Big Data and graph data in particular and the same could be made use of in effective identification and for real time analysis of the obtained results. Some of the graph visualization tools available for Big Data are the Graphviz, Warlus and Gephi. Since our current modelling involves Neo4j as the graph DB, it provides an inbuilt Neo4j Browser that can function as the visualization tool. This is a powerful and customizable data visualization tool, that is based on D3.js library. In case of additional assistance in visualization, SVG based graph interaction, Keylines Neo4j Graph visualization or Tom Sawyer Perspectives can be utilized. 4.
Conclusion
This paper presents various latest technologies that can be utilized for intrusion detection. This becomes a necessity due to the fact that in the current scenario, network data has become so huge that it is classified as Big Data. Hence traditional methods of data processing systems fail to work on this data. Due to the presence of the Velocity component along with the Volume, it becomes mandatory for the processing system to be storage efficient and fast. In this paper, we discuss the three probable methods of graph processing, Hadoop, Elastic Search and Graph Databases. After analysis is was found that various shortcomings exist in the architecture of Hadoop and Elastic Search ( in accordance with our application ). Hence we zero in on Graph databases and elucidate on their efficiency in processing graph structures. Graph DB performance could be further improved by creating custom User Defined Functions (Query Functions) that are executed on GPUs rather than on CPUs. The number crunching work could be done on the GPUs and thereby reducing the load on the CPUs to a considerable extent. GPU based algorithms for graph processing could be considered for further optimization of the problem solution.
10
International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823
5.
References
[1]
K. Jackson, (1989), “Viewgraphs on Intrusion Detection: User Authentication Profiles at Los Alamos”, Los Alamos National Laboratory.
[2]
H. S. Javitz , A. Valdes , D. E. Denning and P. G. Neumann, (1986), “Analytical Techniques Development for a Statistical Intrusion Detection System (SIDS) based on Accounting Records”, SRI International.
[3]
T. F. Lunt, (1988), “Automated Audit Trail Analysis and Intrusion Detection: A Survey", Proceedings of the 11th National Computer Security Conference.
[4]
Shieh, Shiuhpyng Winston, and Virgil D. Gligor, (1991), "A pattern-oriented intrusion-detection model and its applications." Research in Security and Privacy, Proceedings. 1991 IEEE Computer Society Symposium on. IEEE.
[5]
Lunt, Teresa F., et al., (1989), "Knowledge-based intrusion detection." AI Systems in Government Conference, Proceedings of the Annual. IEEE.
[6]
Snapp, Steven, et al., (1991), "A system for distributed intrusion detection." COMPCOM Spring 91: 170176.
[7]
Riedmüller, R., et al., (2010), "Constraints on autonomous use of standard GPU components for asynchronous observations and intrusion detection”, Security and Communication Networks (IWSCN), 2010 2nd International Workshop on. IEEE.
[8]
Huang, Nen-Fu, et al., (2008), "A gpu-based multiple-pattern matching algorithm for network intrusion detection systems", Advanced Information Networking and Applications-Workshops, . AINAW 2008. 22nd International Conference on. IEEE.
[9]
Khabbazian, M.H. ,Eslami, H. ; Totoni, E. ; Khadem, (2010), “A.High-throughput stream categorization and intrusion detection on GPU”, Formal Methods and Models for Codesign (MEMOCODE),8th IEEE/ACM International Conference on,81 – 84.
[10]
Quan Qian ,Hongyi Che ; Rui Zhang ; Mingjun Xin, (2010), “The Comparison of the Relative Entropy for Intrusion Detection on CPU and GPU”,Computer and Information Science (ICIS), IEEE/ACIS 9th International Conference on,141 - 146.
[11]
Wu, Chengkun, et al., (2009), "An efficient pre-filtering mechanism for parallel intrusion detection based on many-core GPU", Security Technology. Springer Berlin Heidelberg, 298-305.
[12]
Vokorokos, Liberios, Michal Ennert, and Ján Radušovský, (2014), "A Survey of parallel intrusion detection on graphical processors", Central European Journal of Computer Science 4.4: 222-230
[13]
Tiirmaa-Klaar, Heli, et al, (2013), "Botnets: How to Fight the Ever-Growing Threat on a Technical Level", Botnets. Springer London, 41-97.
[14]
Silva, Sérgio SC, et al., (2013), "Botnets: A survey", Computer Networks 57.2: 378-403.
[15]
Fairchild's Director of R & D, (2007), "Moore's Law" Predicts the Future of Integrated Circuits",Computer History Museum, Retrieved 2009-03-19.
[16]
John L Gustafson, (1988), “Reevaluating Amdahl's Law”, Communications of the ACM 31(5), pp. 532533.
[17]
Rodgers, David P. (June 1985). "Improvements in multiprocessor system design". ACM SIGARCH Computer Architecture News archive (New York, NY, USA:ACM) 13 (3): 225–231. doi:10.1145/327070.327215. ISBN 0-8186-0634-7. ISSN 0163-5964.
[18]
Tavallaee, Mahbod, et al., (2009), "A detailed analysis of the KDD CUP 99 data set", Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications .
[19]
http://iscx.ca/NSL-KDD/ Referred on : 30 October 2014
11
International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9 ISSN: 1837-7823 [20]
D. P. Jeyepalan, E. Kirubakaran ,(April 2013),"A Novel Graph Based Clustering Approach for Network Intrusion Detection", International Journal of Computational Intelligence and Information Security, Vol. 4 No. 4,ISSN: 1837-7823.
[21]
D. P. Jeyepalan, E. Kirubakaran,(2014), "A Co-operative Game Theoretic Approach to Improve the Intrusion Detection System in a Network using Ant Colony Clustering", International Journal of Computer Applications,Volume 87 - Number 14.
[22]
D. P. Jeyepalan, E. Kirubakaran,(2014), "Agent Based Parallelized Intrusion Detection System Using Ant Colony Optimization “,",International Journal of Computer Applications (0975 – 8887), Volume 105 – Number 10.
[23]
Marinakis, Yannis, et al., (2011), "A hybrid ACO-GRASP algorithm for clustering analysis." Annals of Operations Research 188.1: 343-358.
[24]
Ganapathy, Sannasi, et al., (2013), "Intelligent feature selection and classification techniques for intrusion detection in networks: a survey." EURASIP Journal on Wireless Communications and Networking 2013.1: 1-16.
[25]
daCosta, Francis, and Francis daCosta., (2013), "Small Data, Big Data, and Human Interaction", Rethinking the Internet of Things: A Scalable Approach to Connecting Everything : 77-94.
[26]
Feifei Li,Suman Nath, (2014), "Scalable data summarization on big data", Distributed and Parallel Databases, 32(3). DOI: 10.1007/s10619-014-7145-y.
[27]
J. McHugh, (2000), “Testing intrusion detection systems: a critique of the 1998 and 1999 Darpa intrusion detection system evaluations as performed by Lincoln laboratory,” ACM Transactions on Information and System Security, vol. 3, no. 4, pp. 262–294.
12