«Master’s Thesis Anomaly detection and analysis of web traﬃc Bc. Richard Richter Supervisor: Martin Rehák, Ph.D. Study Programme: Otevřená ...»
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Computer Graphics and Interaction
Anomaly detection and analysis of web traﬃc
Bc. Richard Richter
Supervisor: Martin Rehák, Ph.D.
Study Programme: Otevřená informatika, strukturovaný, Navazující
Field of Study: Softwarové inženýrství
May 10, 2011
I would like to express my thanks to my thesis advisor Martin Rehák, Ph.D. for many ideas behind this work, his patient help and support throughout the project and to Ing.
Martin Bílý for his support with providing data from school proxy server. I would like also express my thanks to my family and friends for their patience.
vi vii Declaration I hereby declare that I have completed this thesis independently and that I have listed all the literature and publications used.
I have no objection to usage of this work in compliance with the act §60 Zákon č. 121/2000Sb.
(copyright law), and with the rights connected with the copyright act including the changes in the act.
In Prague 9. 5. 2011
viii Abstract This diploma thesis deals with anomaly detection and analysis of web traﬃc based on GET requests to web servers. The ﬁrst section addresses the method of obtaining data from the Squid proxy server and sending them in the form of Netﬂow packets to the collector. The second section is based on the analysis of obtained data and implementing several algorithms which have been adapted for the current state of the URL and GET headers, as there has been a lot of changes in habits over years.
Abstrakt Tato diplomová práce se zabývá analýzou a detekcí anomálií ve webovém provozu na základě GET požadavků na webové servery. V první části se řeší způsob získávání dat ze Squid proxy serveru a jejich odesílání ve formě Netﬂow paketů do kolektoru a ve druhé jejich analýza na základě několika již existujících algoritmů, které byly upraveny pro současný stav URL a GET hlaviček, jelikož v nich došlo v průběhu let ke změnám zvyklostí.
ix x Contents 1 Introduction 1
1.1 CAMNEP Introduction............................ 1
1.2 Squid analysis............
Introduction Computer systems and networks are facing number of diﬀerent attacks, and every day there is a considerable number of new threats that are trying to ﬁnd a loophole in the security of organizations and either get to classiﬁed information, or use aﬀected machine’s computing power to take part in the botnet, which will later send out spam or disturb the peace of the global network in other ways. Although there are various tools to prevent unwanted intruders from entering our systems - virus scanners on the client computers, ﬁrewalls and other network infrastructure elements - the attackers still manage to ﬁnd loopholes and penetrate into the seemingly secure computer networks.
You can’t always follow existing attack patterns to successfully capture new types of attacks as threats from newly discovered vulnerability would not be identiﬁed. IDS systems solve this problem by creating models of normal traﬃc and user behaviour on the network and attacks are detected by an anomaly compared to normal traﬃc. For example, if network traﬃc grows by dozens of percent for no obvious reason the network is then treated as a target of an attack.
The aim of this work is to ﬁnd these anomalies by informations provided by proxies based on users GET requests to Web sites. The following section will introduce CAMNEP IDS system, for which the result of my work will be used, followed by analysis of usable information provided by the Squid proxy server and sending the necessary informations using NetFlow format.
The next chapter introduces the structure of the URL and the GET header. The following is a list of existing algorithms to detect anomalies and an extended description of those I’ve chosen as suitable for implementation in my work (implementation will be described in the following chapter ) and useful for analysis of the expected data. After implementation chapter follows chapter with description of individual experiments and their results.
CAMNEP is a research prototype of a network intrusion detection system. It is based on a collaboration of a community of detection agents, each of which embodies an existing anomaly detection model. The agents use extended trust modelling, a technique established in a multi-agent research ﬁeld to improve the quality of classiﬁcation provided by individual models.
The agents process unsampled data acquired by dedicated high-performance NetFlow aggregation cards.
1.2 Squid analysis Squid is a caching proxy server which is used for network’s web server queries optimization and is also deployed as well as the CTU FEE. Since the proxy server must serve all the web services client requests and work directly with GET headers, this point in the network is ideal for eavesdropping requests, processing them and transferring them for further analysis.
I was looking for the most eﬃcient way to intercept the data ﬂowing into the Squid proxy server and there were several options to choose from. As Squid source codes are available for public, the possible solution would be creation of a module that would capture the communication and then forward it to the collector. This module would then be implemented straight into Squid. Although this alternative could lead to the most eﬀective processing, complications may arise. Mainly in the module maintainability.
This module would probably have to be upgraded with every new version of Squid server. Also it would not be possible to deploy this solution on previously implemented systems without having to use a modiﬁed version of the proxy server.
Another solution that was at stake was intercepting the incoming packets before they actually arrive in the proxy server, process them and pass them to squid where they would go on on their designated path. However this solution seemed too complicated after the comparison with the solution I have chosen (see below).
After examining existing tools relying on information obtained from the Squid proxy server, I found that the majority of the tools rely on parsing the access log and in the end I have decided for the same solution. Squid provides a very ﬂexible conﬁguration options of logging and server administrator can choose which information will be kept.
By default the set of information isn’t too wide and is inappropriate for analysing the URL and GET headers and therefore the logformat directive needs to be extended by other elements. Table 1.2 shows suﬃcient conﬁguration, but it is also possible to use diﬀerent one. Table 1.3 contains explanation of each used parameter in logformat directive.
Table 1.3: Logformat parameters explanation a Client source IP address A Server IP address or peer name p Client source port st Request+Reply size including HTTP headers Ss Squid request status (TCP_MISS etc.
) ts Seconds since epoch tu subsecond time (milliseconds) tr Response time (milliseconds) ru Request URL rp Request URL-Path excluding hostname h Original request header. Optional header name argument on the format header[:[separator]element] rm Request method (GET/POST etc.) h Reply header. Optional header name argument as for h
1.3 Netﬂow and IPFIX data format Now that we have decided what data could be useful for analysis, we need to ﬁnd a way to transfer it to the collector. The best choice would probably be Netﬂow which is a network protocol developed by Cisco Systems used for collecting IP traﬃc information.
First I’ve experimented with Netﬂow v5, but this version’s data format has strictly deﬁned structure and doesn’t support transfer of any text information. Therefore there isn’t a way of using it for the URL transfer or any other information from the GET headers. The most ﬂexible structure is IPFIX, which I have already started to implement into the collecting script, however CAMNEP system has not supported for IPFIX yet.
The ﬁnal decision was to use Netﬂow v9, which is very similar to IPFIX, although it has few limitations. For example it doesn’t support extension using enterprise speciﬁed element data formats and because it contains no speciﬁc URL nor any other text data ﬁelds we need to make our own data ﬁeld speciﬁcation for this stuﬀ.
Because NetFlow v9 doesn’t support variable record’s length and it is necessary that each item’s size is deﬁned in a template, I had to decide how to transfer the text records that don’t have a strictly given length. I have analysed URLs which I’ve obtained from FEEs proxy server and decided to use length which should cover as many requests - 226 characters. This value was chosen from the chart 1.1, where you can see that most of the values are located in the range of around 70 characters. The higher value was chosen to better cover longer requests. Shorter records are ﬁlled with blank characters and longer entries are truncated. I also consider adding additional information record containing real length of transferred record, but in the end I have decided to reject this
4 CHAPTER 1. INTRODUCTIONsolution and handle it on the collector side. I did the same steps in case of the length of GET headers. Because the data source,that I used for the analysis and learning, trimmed all data to the length of 1024B, it was not possible to determine the threshold that would cover the maximum number of GET headers appropriately. Based on the graph 1.2 I chose as a GET length of 900B as a threshold.
Figure 1.1: URL lengths
For our purposes following items have been deﬁned:
URL - 226B GET - 900B Sending whole GET header isn’t the most ideal solution and in my future work I would like to analyse the most useful parts of GET header and send just them. After that it will be probably possible to reduce the size of the item for GET header in the NetFlow packet.
Nowadays IPv6 is coming and it is necessary to consider it in our solution. When we are detecting source and destination addresses we have to determine if it is IPv4 or IPv6 (or hostname which should be translated to IP by the script) and choose the appropriate data sending template based on the resulting IP version information. There are 4 templates, which are diﬀerent in the combination of IP addresses. The ﬁrst one of them has both - source and destination - addresses in IPv4 format. The second one has both in IPv6 format and the last two each have one in IPv4 format and one in IPv6 and conversely.
1.3. NETFLOW AND IPFIX DATA FORMAT 5
Algorithms In thich chapter are described algorithms which could be used for anomaly detection in the web traﬃc. I implemented ﬁrst three of them (N-Grams, GSAD, DFA). For further I described why their implementation is not applicable, or suggest as a further extension of this work. First is necessary to describe the input data, which are processing by these algorithms and their are described in the following subsection.
2.1 Input data 2.1.1 URL structure Over the years there have been signiﬁcant changes in use of the Web sites and services around them. In the nineties the Internet was full of static content and Web sites have only an informative function. Over time, web space began to ﬁll dynamic content and diﬀerent applications and services.
As the sites themselves kept evolving, a need for a change in URL which are used for accessing every resource was inevitable. Previously most of the attributes that should specify the contents of the server were passed in the query part using parameters. This solution started to be inconvenient, because the individual URLs became chaotic, too long and most importantly brought many problems in search optimization (SEO) and potential caching.