From geospatial data to a pandas dataframe for time sequence evaluation
Time sequence evaluation of geospatial information permits us to research and perceive how occasions and attributes of a spot change over time. Its use instances are huge ranging, notably in social, demographic, environmental and meteorology/local weather research. In environmental sciences, for instance, time sequence evaluation helps analyze how land cowl/land use of an space adjustments over time and its underlying drivers. Additionally it is helpful in meteorological research in understanding the spatial-temporal adjustments in climate patterns (I’ll shortly show one such case examine utilizing rainfall information). Social and financial sciences vastly profit from such evaluation in understanding dynamics of temporal and spatial phenomena similar to demographic, financial and political patterns.
Spatial illustration of information is kind of highly effective. Nevertheless, it may be a difficult process to research geospatial information and extract attention-grabbing insights, particularly for an information scientist/analyst who’s not skilled in geographical data science. Luckily, there are instruments to simplify this course of, and that’s what I’m making an attempt on this article. I wrote my earlier article on a few of the fundamentals of geospatial information wrangling—be at liberty to test that out:
On this article I’ll undergo a sequence of processes — ranging from downloading raster information, then transferring information right into a pandas dataframe and establishing for a standard time sequence evaluation duties.
Information supply
For this case examine I’m utilizing spatial distribution of rainfall in Hokkaido prefecture, Japan between the intervals 01 January to 31 December of 2020 — accounting for three hundred and sixty six days of the 12 months. I downloaded information from an open entry spatial information platform ClimateServe — which is a product of a joint NASA/USAID partnership. Anybody with web entry can simply obtain the info. I’ve uploaded them on GitHub together with codes if you wish to observe alongside. Right here’s the snapshot of some raster photographs in my native listing:
Setup
First, I arrange a folder the place the raster dataset is saved so I can loop by them afterward.
# specify folder path for raster dataset
tsFolderPath = './information/hokkaido/'
Subsequent, I’m importing just a few libraries, most of which might be acquainted to information scientists. To work with raster information I’m utilizing the rasterio
library.
# import libraries
import os
import rasterio
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Visualize information
Let’s take a look at how the raster photographs appear like in a plot. I’ll first load in a random picture utilizing rasterio
after which plot it utilizing matplotlib
performance.
# load in raster information
rf = rasterio.open('./information/hokkaido/20201101.tif')fig, ax = plt.subplots(figsize=(15,5))
_ = ax.imshow(rf.learn()[0], cmap = 'inferno')
fig.colorbar(_, ax=ax)
plt.axis('off')
plt.title('Day by day rainfall Jan-Dec 2020, Hokkaido, Japan');
As you’ll be able to see, this picture is a mixture of pixels, the worth of every pixel represents rainfall for that specific location. Brighter pixels have excessive rainfall worth. Within the subsequent part I’m going to extract these values and switch them right into a pandas
dataframe.
Extract information from raster recordsdata
Now into the important thing step — extracting pixel values for every of the 366 raster photographs. The method is straightforward: we are going to loop by every picture, learn pixel values and retailer them in a listing.
We’ll individually preserve observe of dates in one other record. The place are we getting the dates data? In case you take a more in-depth take a look at the file names, you’ll discover they’re named after every respective day.
# create empty lists to retailer information
date = []
rainfall_mm = []# loop by every raster
for file in os.listdir(tsFolderPath):
# learn the recordsdata
rf = rasterio.open(tsFolderPath + file)
# convert raster information to an array
array = rf.learn(1)
# retailer information within the record
date.append(file[:-4])
rainfall_mm.append(array[array>=0].imply())
Observe that it didn’t take lengthy to loop by 366 rasters due to low picture decision (i.e. giant pixel dimension). Nevertheless, it may be computationally intensive for top decision datasets.
So we simply created two lists, one shops the dates from file names and the opposite has rainfall information. Listed below are first 5 objects of two lists:
print(date[:5])
print(rainfall_mm[:5])>> ['20200904', '20200910', '20200723', '20200509', '20200521']
>> [4.4631577, 6.95278, 3.4205956, 1.7203209, 0.45923564]
Subsequent on to transferring the lists right into a pandas
dataframe. We’ll take an additional step from right here to vary the dataframe right into a time sequence object.
Convert to a time sequence dataframe
Transferring lists to a dataframe format is a straightforward process in pandas
:
# convert lists to a dataframe
df = pd.DataFrame(zip(date, rainfall_mm), columns = ['date', 'rainfall_mm'])
df.head()
We now have a pandas
dataframe, however discover that ‘date’ column holds values in strings, pandas
doesn’t know but that it characterize dates. So we have to tweak it somewhat bit:
# Convert dataframe to datetime object
df['date'] = pd.to_datetime(df['date'])
df.head()
df['date'].information()
Now the dataframe is a datetime object.
Additionally it is a good suggestion to set date column because the index. This facilitates slicing and filtering information by totally different dates and date vary and makes plotting duties simple. We’ll first type the dates into the proper order after which set the column because the index.
df = df.sort_values('date')
df.set_index('date', inplace=True)
Okay, all processing accomplished. You are actually prepared to make use of this time sequence information nevertheless you want. I’ll simply plot the info to see the way it appears to be like.
# plot
df.plot(figsize=(12,3), grid =True);
Lovely plot! I wrote just a few articles previously on the right way to analyze time sequence information, right here’s one:
Extracting attention-grabbing and actionable insights from geospatial time sequence information could be very highly effective because it exhibits information each in spatial and temporal dimensions. Nevertheless, for information scientists with out coaching in geospatial data this is usually a daunting process. On this article I demonstrated with a case examine how this tough process could be accomplished simply with minimal efforts. The information and codes can be found on my GitHub if you wish to replicate this train or take it to the following degree.
Thanks for studying. Be happy to subscribe to get notification of my forthcoming articles on Medium or just join with me through LinkedIn or Twitter. See you subsequent time!