Summary:
Aalto University School of Science Degree Programme in Life Science Techonologies ABSTRACT OF MASTER’S THESIS Author: Inka Saarinen Title: Adaptive real-time anomaly detection for multi-dimensional streaming data Date: February 22, 2017 Pages: vi + 85 Major: Bioinformatics Supervisor: Professor Samuel Kaski Advisor: Yrj ̈o H ̈ame D.Sc. (Tech.) Timo Simil ̈a D.Sc. (Tech.) Data volumes are growing at a high speed as data emerges from millions of devices. This brings an increasing need for streaming analytics, processing and analysing the data in a record-by-record manner. In this work a comprehensive literature review on streaming analytics is pre- sented, focusing on detecting anomalous behaviour. Challenges and approaches for streaming analytics are discussed. Different ways of determining and identi- fying anomalies are shown and a large number of anomaly detection methods for streaming data are presented. Also, existing software platforms and solutions for streaming analytics are presented. Based on the literature survey I chose one method for further investigation, namely Lightweight on-line detector of anomalies (LODA). LODA is designed to detect anomalies in real time from even high-dimensional data. In addition, it is an adaptive method and updates the model on-line. LODA was tested both on synthetic and real data sets. This work shows how to define the parameters used with LODA. I present a couple of improvement ideas to LODA and show that three of them bring important benefits. First, I show a simple addition to handle special cases such that it allows computing an anomaly score for all data points. Second, I show cases where LODA fails due to lack of data preprocessing. I suggest preprocessing schemes for streaming data and show that using them improves the results significantly, and they require only a small subset of the data for determining preprocessing parameters. Third, since LODA only gives anomaly scores, I suggest thresholding techniques to define anomalies. This work shows that the suggested techniques work fairly well compared to the- oretical best performance. This makes it possible to use LODA in real streaming analytics situations
Technology Use: .Net Or Java Or Python
Modules:
Algoritham Use: Not Defined
