Detecting and coping with non-constant variance in time sequence
A time sequence is heteroskedastic if its variance adjustments over time. In any other case, the info set is homoskedastic.
Heteroskedasticity impacts the modeling of time sequence. So, you will need to detect and cope with this situation.
Let’s begin with a visible instance.
Determine 1 under exhibits the favored airline passengers’ time sequence. You’ll be able to see that the variation is totally different throughout the sequence. The variance is increased within the latter a part of the sequence. That is additionally the place the extent of the info can be increased.
Adjustments in variance are problematic for forecasting. It impacts the becoming of an sufficient mannequin, thereby affecting forecasting efficiency.
However, visible inspection solely will not be sensible. How will you detect and cope with heteroskedasticity in a extra systematic method?
You’ll be able to verify whether or not a time sequence is heteroskedastic utilizing statistical assessments. These embrace the next:
The primary enter to those assessments is the residuals of a regression mannequin (e.g. strange least squares). The null speculation is that the residuals are distributed with equal variance. If the p-value is smaller than the importance stage we reject this speculation. Because of this the time sequence is heteroskedastic. The importance stage is commonly set to a worth as much as 0.05.
The statsmodels Python library has an implementation of the three assessments above. Right here’s a snippet that wraps these in a single class:
import pandas as pd
import statsmodels.stats.api as sms
from statsmodels.method.api import olsTEST_NAMES = ['White', 'Breusch-Pagan', 'Goldfeld-Quandt']
FORMULA = 'worth ~ time'
class Heteroskedasticity:
@staticmethod
def het_tests(sequence: pd.Sequence, take a look at: str) -> float:
"""
Testing for heteroskedasticity
:param sequence: Univariate time sequence as pd.Sequence
:param take a look at: String denoting the take a look at. Certainly one of 'white','goldfeldquandt', or 'breuschpagan'
:return: p-value as a float.
If the p-value is excessive, we settle for the null speculation that the info is homoskedastic
"""
assert take a look at in TEST_NAMES, 'Unknown take a look at'
sequence = sequence.reset_index(drop=True).reset_index()
sequence.columns = ['time', 'value']
sequence['time'] += 1
olsr = ols(FORMULA, sequence).match()
if take a look at == 'White':
_, p_value, _, _ = sms.het_white(olsr.resid, olsr.mannequin.exog)
elif take a look at == 'Goldfeld-Quandt':
_, p_value, _ = sms.het_goldfeldquandt(olsr.resid, olsr.mannequin.exog, different='two-sided')
else:
_, p_value, _, _ = sms.het_breuschpagan(olsr.resid, olsr.mannequin.exog)
return p_value
@classmethod
def run_all_tests(cls, sequence: pd.Sequence):
test_results = {okay: cls.het_tests(sequence, okay) for okay in TEST_NAMES}
return test_results
The category Heteroskedasticity incorporates two capabilities. The operate het_tests applies a selected take a look at (White, Breusch-Pagan, or Goldfeld-Quandt). The operate run_all_tests applies all three assessments in a single go. The output of those capabilities is the p-value of the corresponding take a look at.
Right here’s how one can apply this code to the time sequence in Determine 1.
from pmdarima.datasets import load_airpassengers# https://github.com/vcerqueira/weblog/blob/foremost/src/heteroskedasticity.py
from src.heteroskedasticity import Heteroskedasticity
sequence = load_airpassengers(True)
test_results = Heteroskedasticity.run_all_tests(sequence)
# {'Breusch-Pagan': 4.55e-07,
# 'Goldfeld-Quandt': 8.81e-13,
# 'White': 4.34e-07}
The p-value of all assessments is near zero. So, you possibly can reject the null speculation. These assessments give compelling proof for the presence of heteroskedasticity.
Right here’s the distribution of the residuals within the first and second half of the time sequence:
The distribution of residuals is totally different in these two elements. The Goldfeld-Quandt take a look at makes use of such a cut up to check for heteroskedasticity. It checks if the variance of the residuals is totally different in two information subsamples.
A standard treatment to heteroskedasticity in time sequence is to remodel the info. Taking the logarithm of the time sequence is useful to stabilize its variability.
Right here’s the identical time sequence as earlier than however log-scaled:
This time, the variability seems regular alongside the sequence. Let’s re-run the assessments utilizing the log-scaled time sequence:
import numpy as nptest_results = Heteroskedasticity.run_all_tests(np.log(sequence))
# {'Breusch-Pagan': 0.033,
# 'Goldfeld-Quandt': 0.18,
# 'White': 0.10}
The p-values are higher this time. Solely one of many assessments (Breusch-Pagan) rejects the speculation of fixed variance. That is assuming a significance stage of 0.05.
Reverting the log transformation
Suppose you’re making predictions utilizing log-transformed information. In that case, it is advisable to revert the predictions to the unique scale. That is carried out with the inverse of the transformation — within the case of log, it’s best to use the exponential.
So, the steps of the forecasting course of are the next:
- Rework the info to stabilize the variance;
- Match a forecasting mannequin;
- Get the forecasts, and revert them to the unique scale.
Right here’s an instance.
import numpy as np
from pmdarima.datasets import load_airpassengers
from pmdarima.arima import auto_arima
from sklearn.model_selection import train_test_splitsequence = load_airpassengers(True)
# leaving the final 12 factors for testing
prepare, take a look at = train_test_split(sequence, test_size=12, shuffle=False)
# stabilizing the variance within the prepare
log_train = np.log(prepare)
# constructing an arima mannequin, m is the seasonal interval (month-to-month)
mod = auto_arima(log_train, seasonal=True, m=12)
# getting the log forecasts
log_forecasts = mod.predict(12)
# reverting the forecasts
forecasts = np.exp(log_forecasts)
- Time sequence are heteroskedastic if the variance will not be fixed;
- You’ll be able to take a look at if a time sequence is heteroskedastic utilizing statistical assessments. These embrace the White Breusch-Pagan, or Goldfeld–Quandt assessments;
- Use the log transformation to stabilize the variance;
- Don’t neglect to revert the forecasts to the unique scale.
Thanks for studying and see you within the subsequent story!