Put a little bit of magic in your Spark SQL notebooks utilizing JupyterLab SQL cell magic editor
jupyterlab-sql-editor is an ipython magic that empowers you to put in writing composable analytics in Spark SQL. This jupyterlab extension has a wealthy set of options:
- Auto-completion of
– desk names, column names, nested sub-fields
– desk joins
– capabilities - SQL syntax highlighting and formatting
- A number of show outputs
– html and plain textual content
– interactive knowledge grid and json tree
– seize of dataframe and views
jupyterlab-sql-editor extends JupyterLab with SQL formatting and syntax highlighting.
jupyterlab-sql-editor additionally register an LSP server which gives autocompletion. Press <tab> key to set off autocomplete of desk and column names together with nested sub-fields. Use the %%sparksql --refresh all
command to replace the native cache file utilized by the autocompleter. The refresh command enumerates the capabilities, tables and columns discovered within the present spark context.
Autocomplete Spark SQL capabilities with documentation and utilization examples.
There are additionally a couple of energy consumer options like auto filling all column names within the SELECT assertion and auto suggesting JOIN situations on matching column names.
The sparksql magic is a handy approach of executing spark sql statements. On this instance we execute an announcement and print the outcomes as textual content.
%%sparksql --output textual contentSELECT * from vary(1, 1000)
We will do the identical factor in python.
assertion = 'choose * from vary(1, 1000)'
df = spark.sql(assertion)
df.present()
As you’ll be able to see it’s not likely magic. The sparksql magic takes the physique of a cell because the assertion, executes it and prints the outcomes. The extension merely encapsulates boilerplate python code and makes it available by way of command line choices.
Jupyterlab-sql-editor helps a wealthy set of output codecs corresponding to an interactive knowledge grid. We use Bloomberg’s ipydatagrid which may post-process (type, filter, and search) the outcomes returned by the Spark question.
A tree widget shows the schema of the outcomes. That is notably helpful for columns containing complicated nested knowledge constructions.
Use the JSON output to navigate the contents of complicated nested columns. This output makes use of the usual ipython tree widget which may search in any of the nested subject.
To this point we’ve got largely seen question knowledge nevertheless you aren’t restricted to queries. You possibly can carry out any SQL assertion supported by Spark SQL.
%%sparksql
CREATE TABLE prod.db.pattern (
id bigint COMMENT 'distinctive id',
knowledge string)
USING iceberg%%sparksql
ALTER TABLE prod.db.pattern
ADD COLUMNS (
new_column string remark 'new_column docs'
)
Use the road magic %sparksql
for easy one-liner statements
%sparksql SET spark.sql.variable.substitute=false
%sparksql DROP TABLE prod.db.pattern
%sparksql CALL prod.system.set_current_snapshot('db.pattern', 1)
Frequent desk expressions (CTE) is an ordinary approach of breaking massive SQL statements into extra manageable items. Jupyterlab-sql-editor not solely helps CTE however permits you to seize or alias SQL statements as views which may then be reused in later cells. This mechanism takes benefit of Spark dataframe’s createOrReplaceTempView
operate. Use the --view
choice to create a short lived view. To forestall rapid execution of the question use the --output skip
possibility. When this selection is used solely the view is created.
Autocompletion additionally works on short-term views. Use the %%sparksql --refresh native
choice to replace your autocomplete cache file with any native views you will have created.
To this point we’ve got seen use SQL views to create composable SQL analytics. Nevertheless you aren’t restricted to solely SQL you’ll be able to swap from SQL to dataframe and again. Elements of your evaluation may be higher suited to the dataframe API as different would possibly lend themselves higher to the SQL dialect. Jupyterlab-sql-editor makes it very straightforward to change between dialects. Use the--dataframe
choice to convert an SQL statements as dataframe.
Given any dataframe, swap again to SQL by calling df.createOrReplaceTempView()
. For instance you may be utilizing Spark ML, Spark GraphFrames, or just utilizing a datasource like CSV. Lots of the libraries in Spark create dataframes.
df = spark.learn.csv(path)
df.createOrReplaceTempView(‘my_csv_table’)
Then consult with the view you created in %%sparksql
%%sparksql
SELECT * FROM my_csv_table
Bear in mind to replace your native cache by operating %%sparksql --refresh native
and benefit from the autocompleter!
jupyterlab-sql-editor has two essential dependencies. jupyterlab-lsp and sql-language-server. Putting in the extension could be very straightforward.
pip set up jupyterlab-lsp jupyterlab-sql-editor
Set up the sql-language-server undertaking to supply autocompletion.
sudo npm set up -g sql-language-server
For a whole record of configuration choices see the detailed set up information.
On this article we confirmed leverage %%sparksql
to simply write composable analytics in Spark SQL. We targeted on Spark, nevertheless jupyterlab-sql-editor additionally features a %%trino
magic! Extra SQL engines may be added sooner or later. Contributions are welcomed! Right here’s our git repo CybercentreCanada/jupyterlab-sql-editor.
In a comply with up article we’ll cowl jupyterlab-sql-editor’s assist for SQL templating utilizing Jinja and DBT.
To contributors of those initiatives:
krassowski/jupyterlab-lsp
joe-re/sql-language-server
zeroturnaround/sql-formatter
cryeo/sparksql-magic
bloomberg/ipydatagrid