Day 125(ML) — A gentle intro to TimeSeries Analysis

Let’s explore another application where data science plays a crucial role. The topic for this post is time series analysis. A simple analogy for time series data is sales information for contiguous years. Using the history of the sales details, the machine learning model could make predictions for the future.

For instance, let’s consider an online learning platform and the objective is to figure out which courses are preferred higher. If we have the historical information of how many online learners have paid for the corresponding courses, we can forecast the preference in future years. Some improvisations could be made in terms of the content(more real-time examples and case studies can be included) for the less chosen courses.

CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=647831

The above picture is an example of the time-series data. The main prerequisite for the time series modelling is that data should be available as a sequence over regular intervals. Another interesting data sample is the quarterly turnover of an organisation and daily closing stock price. If we are dealing with daily information, it can only contain the corresponding details but not the monthly or yearly results.

Training and Testing data split in Timeseries: Usually, in ML model building we randomly split the data into train and test set for training and validation respectively. But this method of data split is not applicable to the Timeseries. Since we want to train the model based on the sequence, the splitting for train and test should also be in a sequential manner. For instance, if we have monthly data(12 months), training data can be the first 10 months while the rest of the two months can be assigned to the testing set. The sequence is one of the prime keys when handling the times series data.

Data collection for Timeseries: Before going into the algorithms, the most crucial step is setting up the data pipelines(base for all the models). Now comes the question of, how do we choose the data for model training?. The gathered data should not be for a short duration, as the model will not learn the patterns effectively. If the data collected is for a longer period, the model may miss the current trends. So an intermediate-range is an ideal choice. Another point to remember is the length of the forecast, we can predict the future only for a certain length of time and not way too ahead. The reason being, the future is the gradual progress of the present details.

Missing Data in Timeseries: The input data series cannot be broken in the middle. If there is any missing information, we can manually impute the missing values to bring the continuity back.

In the upcoming articles, we will unravel more concepts related to the time series analysis.

Recommended Reading:

AI Enthusiast | Blogger✍