Run your SQL queries distributed and fast, with the lightweight, cross-plattform and easy to install
if __name__ == "__main__":
# Create a dask cluster
from dask.distributed import Client
client = Client()
# Load the data with dask
df = dask.datasets.timeseries()
# Register the data in a dask-sql context
from dask_sql import Context
c = Context()
# Query the data utilizing your dask cluster with standard SQL
result = c.sql("SELECT name, SUM(x) FROM timeseries GROUP BY name").compute()
DASK AND DASK-SQL
Query your data with both the
dask API and normal SQL syntax in combination without the need for a database.
USE THE FULL POWER OF YOUR CLUSTER
You are writing normal SQL - but in the background your queries get distributed over your cluster and leverage the full computation power.
MIX SQL WITH CUSTOM FUNCTIONS
Combine SQL functions with your self-written, python functions without any performance drawbacks or rewriting.
dask-sqlutilizes dask, you can use the large variety of possible computing infrastructures that dask supports: cloud providers, YARN, k8s, batch systems, ...
dask-sqlwill connect to your dask cluster and will translate your SQL queries into dask API calls. A large fraction of the SQL standard is already understood.
dask-sqlservice, e.g. with a docker image, which lets you send SQL queries via a presto-compatible REST interface.
dask-sqlinto your own scripts and mix SQL code with your normal dataframe operations. It works particularly well with the interactive notebook format. Launch a test notebook server.
dask-sqlhelps you to query your data from wherever you want.
$ conda install -c conda-forge dask-sql
# or (needs java pre-installed)
$ pip install -U dask-sql
# or run the SQL server via docker
$ docker run --rm -it -p 8080:8080 nbraun/dask-sql