Ideas and Tips for Loading Excel Spreadsheets into Pandas DataFrames | by Wei-Meng Lee | Dec, 2022

December 31, 2022

3

Learn to correctly load your worksheets into Pandas DataFrames

In most information analytics tasks, one of the widespread and common information file codecs that you’ll often encounter is CSV. Nonetheless, individuals from the monetary sector usually take care of one other format — Excel spreadsheets.

Whereas quite a lot of articles on Pandas DataFrame concentrate on loading utilizing CSV recordsdata, on this article I’ll focus my dialogue on Excel. I’ll present you easy methods to load worksheets from Excel into your dataframes, in addition to some suggestions and tips for loading a part of the worksheet onto a dataframe.

For this text, I’ll use two pattern Excel spreadsheets.

Notice: All photographs and screenshots on this article have been created by writer

Associated Readings

Ideas and Tips for Loading Massive CSV Information into Pandas DataFrames — Half 2 (https://towardsdatascience.com/tips-and-tricks-for-loading-large-csv-files-into-pandas-dataframes-part-2-5fc02fc4e3ab)

Ideas and Tips for Loading Massive CSV Information into Pandas DataFrames — Half 1 (https://towardsdatascience.com/tips-and-tricks-for-loading-large-csv-files-into-pandas-dataframes-part-1-fac6e351fe79)

My first pattern Excel spreadsheet (mydata.xls; self-created) accommodates two worksheets:

The information from the 2 worksheets have been obtained from the Flights Delay dataset — https://www.kaggle.com/datasets/usdot/flight-delays. Licensing: CC0: Public Area

Here’s what the airports worksheet seems to be like:

Here’s what the airways worksheet seems to be like:

Observe that the airways worksheet doesn’t comprise a header.

To load the older .xls format Excel file, it is advisable set up the xlrd bundle:

xlrd is a library for studying information and formatting info from Excel recordsdata within the historic .xls format.

pip set up xlrd

Let’s attempt to load the information within the Excel spreadsheet utilizing the read_excel() operate in Pandas:

import pandas as pddf = pd.read_excel('mydata.xls')
df

Right here is the content material of the dataframe:

Discover that solely the primary worksheet is loaded, despite the fact that our Excel spreadsheet has two worksheets.

To load all of the worksheets within the Excel spreadsheet, set the sheet_name parameter to None:

df = pd.read_excel('mydata.xls', 
sheet_name=None)
print(df)

The outcome will not be a Pandas DataFrame anymore; somewhat it’s a dictionary of dataframes:

{'airports':     IATA_CODE                              AIRPORT  
0         ABE  Lehigh Valley Worldwide Airport   
1         ABI             Abilene Regional Airport   
2         ABQ    Albuquerque Worldwide Sunport   
3         ABR            Aberdeen Regional Airport   
4         ABY   Southwest Georgia Regional Airport   
..        ...                                  ...   
317       WRG                     Wrangell Airport   
318       WYS               Westerly State Airport   
319       XNA  Northwest Arkansas Regional Airport   
320       YAK                      Yakutat Airport   
321       YUM           Yuma Worldwide Airport   CITY STATE COUNTRY  LATITUDE  LONGITUDE  
0                         Allentown    PA     USA  40.65236  -75.44040  
1                           Abilene    TX     USA  32.41132  -99.68190  
2                       Albuquerque    NM     USA  35.04022 -106.60919  
3                          Aberdeen    SD     USA  45.44906  -98.42183  
4                            Albany    GA     USA  31.53552  -84.19447  
..                              ...   ...     ...       ...        ...  
317                        Wrangell    AK     USA  56.48433 -132.36982  
318                West Yellowstone    MT     USA  44.68840 -111.11764  
319  Fayetteville/Springdale/Rogers    AR     USA  36.28187  -94.30681  
320                         Yakutat    AK     USA  59.50336 -139.66023  
321                            Yuma    AZ     USA  32.65658 -114.60597  
[322 rows x 7 columns], 'airways':     UA         United Air Strains Inc.
0   AA        American Airways Inc.
1   US               US Airways Inc.
2   F9        Frontier Airways Inc.
3   B6               JetBlue Airways
4   OO         Skywest Airways Inc.
5   AS          Alaska Airways Inc.
6   NK              Spirit Air Strains
7   WN        Southwest Airways Co.
8   DL          Delta Air Strains Inc.
9   EV   Atlantic Southeast Airways
10  HA        Hawaiian Airways Inc.
11  MQ  American Eagle Airways Inc.
12  VX                Virgin America}

The keys within the outcome are the names of the worksheets. The next assertion shows the dataframe loaded for the airports worksheet:

show(df['airports'])

Likewise, the next assertion shows dataframe loaded for the airways worksheet:

show(df['airlines'])

Don’t fear in regards to the headers; I’ll repair it shortly.

If you wish to load particular worksheets, retailer the names of the worksheets you need to load as an inventory and go it to the sheet_name property:

df = pd.read_excel('mydata.xls', 
sheet_name=['airports','airlines'])
print(df)

The above assertion hundreds the airports and airways worksheets and the result’s a dictionary of dataframes.

You may look at the information kind for every of the columns loaded within the airports dataframe utilizing the dtypes attribute:

df['airports'].dtypes

Observe that Pandas will load them in response to the forms of values for every column within the worksheet:

IATA_CODE     object
AIRPORT       object
CITY          object
STATE         object
COUNTRY       object
LATITUDE     float64
LONGITUDE    float64
dtype: object

Particularly, for the airports dataframe, the STATE and COUNTRY columns ought to ideally be represented because the class kind, somewhat than the object kind. That is to cut back the reminiscence footprint of the dataframe. You may carry out the sort conversion throughout loading time utilizing the dtype parameter:

import numpy as npdf = pd.read_excel('mydata.xls', 
sheet_name='airports',
dtype= {
'STATE': 'class',
'COUNTRY':'class'
})
df.dtypes

Within the above assertion, I load the airports worksheet and point out that each the STATE and COUNTRY columns be loaded utilizing the class information kind. You may confirm this after the dataframe has been loaded:

IATA_CODE      object
AIRPORT        object
CITY           object
STATE        class
COUNTRY      class
LATITUDE      float64
LONGITUDE     float64
dtype: object

Moreover loading columns to be of a selected kind, you may also carry out values conversion throughout loading time utilizing the converters parameter. For instance, suppose I need to convert all of the latitude and longitude values to the levels, minutes, and seconds format.

I first outline the operate to do the conversion:

# convert from decimal levels to levels, minutes, seconds
def deg_to_dms(deg):
deg = float(deg)
m, s = divmod(abs(deg)*3600, 60)
d, m = divmod(m, 60)
return int(-d if deg < 0 else d), int(m), s

Then, specify the column(s) you need to convert utilizing the converters parameter:

df = pd.read_excel('mydata.xls', 
sheet_name='airports',
dtype= {
'STATE': 'class',
'COUNTRY':'class'
},
converters={
'LATITUDE': deg_to_dms,
'LONGITUDE': deg_to_dms,
})
df

Right here it the results of the conversion:

Inspecting the information forms of the columns:

df.dtypes

You will note that each the LATITUDE and LONGITUDE columns are of object kind:

IATA_CODE      object
AIRPORT        object
CITY           object
STATE        class
COUNTRY      class
LATITUDE       object
LONGITUDE      object
dtype: object

If you happen to attempt to extract the latitude worth of the primary row:

lat_first_row = df.iloc[0,5]
kind(lat_first_row)

You will note that the worth is of kind tuple:

tuple

Earlier I discussed that the airways worksheet doesn’t have a header. And so if you happen to load it right into a dataframe:

df = pd.read_excel('mydata.xls', 
sheet_name='airways')
df

The result’s that the content material of the primary row can be used because the headers:

Clearly this isn’t acceptable. To repair it, use the header parameter and set it to None:

df = pd.read_excel('mydata.xls', 
sheet_name='airways', 
header=None)
df

The default headers of 0 and 1 will now be used:

You may change the headers by utilizing the names parameter:

df = pd.read_excel('mydata.xls', 
sheet_name='airways', 
header=None,
names=['IATA_CODE','AIRLINE'])
df

The column names are actually set:

If you happen to solely need to load particular columns within the worksheet, you need to use the usecols parameter:

df = pd.read_excel('mydata.xls', 
sheet_name='airports', 
usecols=['IATA_CODE','AIRPORT','CITY','STATE','COUNTRY'])
df

Within the above assertion, I solely need to load the ‘IATA_CODE’, ’AIRPORT’, ’CITY’, ’STATE’, and ’COUNTRY’ columns:

Basically, you might be omitting the LATITUDE and LONGITUDE columns. On this case, it could be simpler to specify what to omit, utilizing a lambda operate:

df = pd.read_excel('mydata.xls', 
sheet_name='airports', 
usecols=lambda column: column not in ['LATITUDE','LONGITUDE'])
df

One other method is to specify the column names to load within the Excel spreadsheet. Hey, keep in mind we’re working with an Excel spreadsheet? The next assertion reveals easy methods to load columns F (LATITUDE) by way of G (LONGITUDE) utilizing the Excel-way:

df = pd.read_excel('mydata.xls', 
sheet_name='airports', 
usecols='F:G')
df

Right here is the outcome:

If you happen to additionally need the IATA_CODE and AIRPORT columns, you are able to do it simply, like this!

df = pd.read_excel('mydata.xls', 
sheet_name='airports', 
usecols='A,B,F:G')
df

The outcome now accommodates 4 columns:

To date the primary Excel spreahsheet is fairly easy (I’d say it’s too sanitized for the actual world). The second dataset that I’ll use is extra sensible and it may be downloaded from https://information.world/davelaff/restaurant-inventory-log.

This can be a restaurant stock workbook. The information is actual however is much from present. Costs listed should not precise costs. LICENSE: Public Area

This Excel spreadsheet accommodates two worksheets:

Stock Worksheets
Subcategory Stock Worth

Here’s what the Stock Worksheets seems to be like:

And right here is the Subcategory Stock Worth worksheet:

To load the newer .xlsx Excel file format, it is advisable set up the openpyxl bundle:

pip set up openpyxl

openpyxl is a Python library to learn/write Excel 2010 xlsx/xlsm recordsdata.

Let’s load the primary worksheet:

df = pd.read_excel('RESTAURANT INVENTORY LOG.xlsx', 
sheet_name='Stock Worksheets')df

Right here’s the output:

Discover the Unnamed:0 column? Additionally, discover the underside few rows containing NaNs?

If you happen to look at the Excel spreadsheet, you’ll discover that the precise information begins with column B, and that there are some empty cells on the backside of the worksheet:

To get Pandas to load the dataframe accurately, it is advisable specify the columns that you simply need to load in addition to the variety of rows to load. Right here, I’ll use the usecols and nrows parameters to try this:

df = pd.read_excel('RESTAURANT INVENTORY LOG.xlsx', 
sheet_name='Stock Worksheets',
usecols="B:Q", 
nrows=797)
df

The outcome now seems to be a lot better:

For the second worksheet, assume that the precise information that I need to load is in the course of the worksheet (I’ve highlighted in pink). That can assist you visualize the rows to load, I’ve indicated the rows that it is advisable skip loading if you wish to load the information highlighted in pink:

To load all of the rows beneath the FOOD part, first create an inventory to retailer the rows to skip:

rows_to_skip = listing(vary(0,5))
rows_to_skip.append(6)

The above will generate an inventory of values — [0,1,2,3,4,6]. Then , use the skiprows, usecols, and nrows parameters to load the information from the precise places within the worksheet:

df = pd.read_excel('RESTAURANT INVENTORY LOG.xlsx', 
sheet_name='Subcategory Stock Worth',
skiprows=rows_to_skip,
usecols="B:AA", 
nrows=8)
df

The next reveals the dataframe containing the rows and columns beneath the FOOD part:

Loading Pandas DataFrames from Excel recordsdata is sort of much like loading from CSV recordsdata. The important thing variations are:

An Excel file might comprise a number of worksheets, and it is advisable point out which worksheet you need to load.
Loading from Excel offers the flexibleness to specify columns through Excel column names. This makes it straightforward so that you can establish the columns to load the information from.
As the information in an Excel worksheet is probably not populated in strictly tabular format, fairly often it is advisable specify the variety of rows to load, in addition to the variety of rows to skip when loading.

Normally, when loading from Excel recordsdata, it is advisable pay additional care to the format of the information within the worksheet. Have enjoyable!

Previous article7 Methods To Ship Giant Video Recordsdata From Android (2022)

Next article35 Greatest New Fonts for 2023

Ideas and Tips for Loading Excel Spreadsheets into Pandas DataFrames | by Wei-Meng Lee | Dec, 2022

Learn to correctly load your worksheets into Pandas DataFrames

Associated Readings

On AI Artwork. Is AI artwork, artwork? Sure. Must you promote AI… | by Eugen Lindwurm | Dec, 2022

It is for Everybody, however No person Makes use of it

How is Linear Algebra Utilized for Machine Studying? | by Destin Gong | Dec, 2022

LEAVE A REPLY Cancel reply

Most Popular

On AI Artwork. Is AI artwork, artwork? Sure. Must you promote AI… | by Eugen Lindwurm | Dec, 2022

Person Profile Service Failed The Signal-In Concern In Home windows

It is for Everybody, however No person Makes use of it

Tips on how to convey your mission imaginative and prescient simply with a mission constitution

Recent Comments

ABOUT US

POPULAR POSTS

On AI Artwork. Is AI artwork, artwork? Sure. Must you promote AI… | by Eugen Lindwurm | Dec, 2022

Person Profile Service Failed The Signal-In Concern In Home windows

It is for Everybody, however No person Makes use of it

POPULAR CATEGORY