Monday, December 19, 2022
HomeData ScienceEasy methods to Detect Heteroskedasticity in Time Sequence | by Vitor Cerqueira...

Easy methods to Detect Heteroskedasticity in Time Sequence | by Vitor Cerqueira | Dec, 2022


Photograph by Jannes Glas on Unsplash

A time sequence is heteroskedastic if its variance adjustments over time. In any other case, the info set is homoskedastic.

Heteroskedasticity impacts the modeling of time sequence. So, you will need to detect and cope with this situation.

Let’s begin with a visible instance.

Determine 1 under exhibits the favored airline passengers’ time sequence. You’ll be able to see that the variation is totally different throughout the sequence. The variance is increased within the latter a part of the sequence. That is additionally the place the extent of the info can be increased.

Determine 1: Month-to-month passengers in an airline. The information set is publicly out there within the pmdarima Python library. Picture by writer.

Adjustments in variance are problematic for forecasting. It impacts the becoming of an sufficient mannequin, thereby affecting forecasting efficiency.

However, visible inspection solely will not be sensible. How will you detect and cope with heteroskedasticity in a extra systematic method?

You’ll be able to verify whether or not a time sequence is heteroskedastic utilizing statistical assessments. These embrace the next:

The primary enter to those assessments is the residuals of a regression mannequin (e.g. strange least squares). The null speculation is that the residuals are distributed with equal variance. If the p-value is smaller than the importance stage we reject this speculation. Because of this the time sequence is heteroskedastic. The importance stage is commonly set to a worth as much as 0.05.

The statsmodels Python library has an implementation of the three assessments above. Right here’s a snippet that wraps these in a single class:

import pandas as pd
import statsmodels.stats.api as sms
from statsmodels.method.api import ols

TEST_NAMES = ['White', 'Breusch-Pagan', 'Goldfeld-Quandt']
FORMULA = 'worth ~ time'

class Heteroskedasticity:

@staticmethod
def het_tests(sequence: pd.Sequence, take a look at: str) -> float:
"""
Testing for heteroskedasticity

:param sequence: Univariate time sequence as pd.Sequence
:param take a look at: String denoting the take a look at. Certainly one of 'white','goldfeldquandt', or 'breuschpagan'

:return: p-value as a float.

If the p-value is excessive, we settle for the null speculation that the info is homoskedastic
"""
assert take a look at in TEST_NAMES, 'Unknown take a look at'

sequence = sequence.reset_index(drop=True).reset_index()
sequence.columns = ['time', 'value']
sequence['time'] += 1

olsr = ols(FORMULA, sequence).match()

if take a look at == 'White':
_, p_value, _, _ = sms.het_white(olsr.resid, olsr.mannequin.exog)
elif take a look at == 'Goldfeld-Quandt':
_, p_value, _ = sms.het_goldfeldquandt(olsr.resid, olsr.mannequin.exog, different='two-sided')
else:
_, p_value, _, _ = sms.het_breuschpagan(olsr.resid, olsr.mannequin.exog)

return p_value

@classmethod
def run_all_tests(cls, sequence: pd.Sequence):

test_results = {okay: cls.het_tests(sequence, okay) for okay in TEST_NAMES}

return test_results

The category Heteroskedasticity incorporates two capabilities. The operate het_tests applies a selected take a look at (White, Breusch-Pagan, or Goldfeld-Quandt). The operate run_all_tests applies all three assessments in a single go. The output of those capabilities is the p-value of the corresponding take a look at.

Right here’s how one can apply this code to the time sequence in Determine 1.

from pmdarima.datasets import load_airpassengers

# https://github.com/vcerqueira/weblog/blob/foremost/src/heteroskedasticity.py
from src.heteroskedasticity import Heteroskedasticity

sequence = load_airpassengers(True)

test_results = Heteroskedasticity.run_all_tests(sequence)

# {'Breusch-Pagan': 4.55e-07,
# 'Goldfeld-Quandt': 8.81e-13,
# 'White': 4.34e-07}

The p-value of all assessments is near zero. So, you possibly can reject the null speculation. These assessments give compelling proof for the presence of heteroskedasticity.

Right here’s the distribution of the residuals within the first and second half of the time sequence:

Determine 2: Distribution of the residuals within the first and second half of the time sequence. Picture by writer.

The distribution of residuals is totally different in these two elements. The Goldfeld-Quandt take a look at makes use of such a cut up to check for heteroskedasticity. It checks if the variance of the residuals is totally different in two information subsamples.

A standard treatment to heteroskedasticity in time sequence is to remodel the info. Taking the logarithm of the time sequence is useful to stabilize its variability.

Right here’s the identical time sequence as earlier than however log-scaled:

Determine 3: Like determine 1, however log-scaled. Right here, the time sequence exhibits a steady variance over time.

This time, the variability seems regular alongside the sequence. Let’s re-run the assessments utilizing the log-scaled time sequence:

import numpy as np

test_results = Heteroskedasticity.run_all_tests(np.log(sequence))

# {'Breusch-Pagan': 0.033,
# 'Goldfeld-Quandt': 0.18,
# 'White': 0.10}

The p-values are higher this time. Solely one of many assessments (Breusch-Pagan) rejects the speculation of fixed variance. That is assuming a significance stage of 0.05.

Reverting the log transformation

Suppose you’re making predictions utilizing log-transformed information. In that case, it is advisable to revert the predictions to the unique scale. That is carried out with the inverse of the transformation — within the case of log, it’s best to use the exponential.

So, the steps of the forecasting course of are the next:

  1. Rework the info to stabilize the variance;
  2. Match a forecasting mannequin;
  3. Get the forecasts, and revert them to the unique scale.

Right here’s an instance.

import numpy as np
from pmdarima.datasets import load_airpassengers
from pmdarima.arima import auto_arima
from sklearn.model_selection import train_test_split

sequence = load_airpassengers(True)

# leaving the final 12 factors for testing
prepare, take a look at = train_test_split(sequence, test_size=12, shuffle=False)
# stabilizing the variance within the prepare
log_train = np.log(prepare)

# constructing an arima mannequin, m is the seasonal interval (month-to-month)
mod = auto_arima(log_train, seasonal=True, m=12)

# getting the log forecasts
log_forecasts = mod.predict(12)

# reverting the forecasts
forecasts = np.exp(log_forecasts)

Determine 4: Reverting the forecasts to the unique scale after modeling the reworked sequence. Picture by writer.
  • Time sequence are heteroskedastic if the variance will not be fixed;
  • You’ll be able to take a look at if a time sequence is heteroskedastic utilizing statistical assessments. These embrace the White Breusch-Pagan, or Goldfeld–Quandt assessments;
  • Use the log transformation to stabilize the variance;
  • Don’t neglect to revert the forecasts to the unique scale.

Thanks for studying and see you within the subsequent story!

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments