Time series are an increasingly prevalent form of mass dataset, due in no small part to the upsurge in human behavioural data that is now being recorded in an unparalleled fashion. In large data sets algorithms that are efficient in both time and space are required. One area where speed and storage costs can be reduced is via symbolization as a pre-processing step, additionally opening up the use of an array of discrete algorithms.
The use of such a pre-processing step is not new – there exists a number of well used pre-existing approaches to symbolisation from the literature that are typically employed. In this work we show, however, that these standard approaches are sub-optimal in (at least) the broad application area of time series comparison leading to unnecessary data corruption and potential performance loss before any real data mining takes place.
Outcomes and Impact
After demonstrating the sub-optimal nature of existing symbolisation techniques we present a novel algorithm which is shown to be optimal under some broadly applicable assumptions. Subsequently we show that the specific, but common, application area of outlier detection (when taking a standard definition based on time series comparisons) benefits from special consideration. Building on our prior approach we provide a symbolisation algorithm to directly optimise the representation for the task of outlier detection.
The paper describing the approach to symbolising time series for arbitrary comparisons is available in the Proceedings of the 2013 IEEE International Conference on Data Mining (ICDM) Workshops, Dallas, TX, USA. A pre-print is also available.
The paper describing the approach to symbolising time series for outlier detection is available in the Proceedings of the 2015 IEEE International Conference on Big Data, Santa Clara, CA, USA. A pre-print is also available.
Slides used for the presentation of the above work are included below.