Lambda III: Data Science with Amazon Lambda- 17 mins
After we have done the basic setup in the last post we are now ready to add some real data science application and fully use the parallelisation of Lambda. We will write a very simple map-framework for this.
This is part three on my series on Amazon Lambda for Data Science. To see how this all comes together, we will develop a real-world data science applications in this blog post series over the next time.
- Introduction: Why you should use Amazon Lambda
- Part 2: Simple map framework for Amazon Lambda using zappa
- Part 3: Data Science with Amazon Lambda
- Part 4: Use the Power of Lambda
You can find the code of this series on my github account (will be updated when we go on). The version of this post is stored in tag part-three.
As you have now seen how to create a website using
zappa and Lambda, we (only) have to add some content now. For this, we basically have to add three core points
- add functionality to read in the data the user sends with his request. We will make our live very easy here and just allow the data to be send with the request itself as CSV (you could also think about reading it from S3, different formats etc.)
- add the data science part. Again, we simplify things and do not really add some data science analysis or alike, but just calculate the mean in every column. We will use
pandasfor doing the calculations and the data handling. In case you are not familiar with
pandas, I really advice you to do some tutorials and read some examples, because this package is one of the core concepts in data science.
- Do the calculations in a parallel way, by executing a lambda function for each “core” we want to have and do the calculation independently.
All three steps will be handled in this and the next post. So lets start!
Remember to install
pandas into your virtual environment with
pip install pandas
if you have not already done so.
Add our data science application - the sequential way
First, we need some data to play with. You can either generate some on your own, e.g. by using this script:
or you download some random data from the internet. The data should have at least two columns:
idcolumn (not to be confused with pandas
indexcolumn!), where every time series has a unique id.
valuecolumn with the actual value of the time series. In other words, the data sample is a long list of values of different time series, partitioned by their id, like so
id value 0 x1 0 x2 .. .. 0 xN 1 y1 .. ..
We first add the utility functions for the handling of the user interaction. We need three new functions in the
request object we import from the
flask package has some very convenient functionality for accessing the information passed by the user. The
args attribute is some sort of a dictionary, that includes the parameters like
query=hello passed in the URL in the form
test.com/search?query=hello and the
stream attribute is the payload, that is passed as data (we will see later how this can be done).
Before adding the methods to our main route, we first add the real data science code. To mock a real world application, we first import some typical libraries (that are however unused in our case) and fake some heavy calculation.
We already discussed before, we want to do thins in parallel later. For this, we already prepare things now. I will start with giving you all the code and walk through it later:
I think this part can already be understood without knowing, how the
my_map function works exactly. The
calculate_result function will be our main function of the module. We group the data according to the column
id (which name can be given by the user) and extract only the
value column (again, given by the user). For each chunk of data (the
id and the time series of values with this
id - the
values), we will run the
feature_calculation function once, which will return a dictionary in the form
If you fill in time series with 1000 ids, the
feature_calculation will be called 1000 times. The dictionaries can be gather together in a list and transformed into a
pandas.DataFrame (the last half of the
calculate_result function). We can then return the data back to the caller.
feature_calculation will later run in parallel in another Lambda function.
my_map function will later handle the invocation of the parallel Lambda functions. Until now, we will only use a normal pandas
map here - but we will already include some “twist”: as we will not start one Lambda function for every chunk of data later (that would mean we start 1000 parallel Lambda functions, which would give us a huge overhead because of streaming), we will summarize some chunks together (dependent on the chunk size parameter) and only start a smaller number of parallel Lambdas.
This chunk size handling can already be implemented, although we are using a sequential ansatz. The idea is: we will start walking through the grouped data chunks (remember, one chunk means all time series data with the same
id), stop every
chunksize items, wrap these elements up and send them to the calculation Lambda. There, the list of chunks is unwrapped, the
feature_calculation is calculated for each chunk in the chunk list and the wrapped result is send back to the calling function. On code, this may look something like this (in
You see, I am still using the default
map function here for running over the chunks - this will change later.
Now we have everything in place to start our first test! Enable the calculation by using all our written tools in the
Note that we are now using PUT instead of GET, as we want to transport some data. And we can already run
to start our server locally and test it (the server will run on http://localhost:5000). When you visit the server, you will get greeted by the lovely error message
Method Not Allowed
What is wrong? Well, we have changed the method from GET to PUT and our normal browser will not help us here So lets write a small script that passes the test data (stored in
data.csv on disk) to our web service and write back the result to the screen
The output should be something like
result id 0 44.675536 1 45.564449 2 54.431959 3 46.572613 4 47.840048 The calculation took 0.363618850708 seconds
Very good! We have a first working implementation of a data science application. We just have to upload things into the Amazon cloud.
Upload our application into the amazon cloud
After you are finished implementing your framework, you can now upload it into the cloud with a simple
However, you will probably run into an error message saying something like
Unzipped size must be smaller than 262144000 bytes
The problem is already described in the error message: the maximum size of you zip, that packs the virtual environment, your flask code and the
zappa handler is 250 MB - however the dependencies of pandas, numpy etc. make it larger.
zappa team has already provided a solution for this: just enable the
slim_handler in your json
zappa update again. This will create two zip files:
- one which only included the
zappahandler and is loaded directly
- a second one with all the rest (your dependencies, your own code) and is stored in S3. It is loaded later during the first invocation into your lambda function.
With that, you are ready to go and by changing the URL in your test script, you can now upload your data and get the calculated features in return.
Two things to note:
- the first invocation may be really slow. Remember the cold and warm container thing from my last post?
- Also the later invocation may not be as fast as on your local machine. This is mainly because of streaming issues. We will discuss performance in a later post.
Python 2 - Python 3
Well, this is a bit embarrassing: if you are using python 2, you probably run into the SSL problem mentioned in the post before: python 2 can not handle https correctly, so when changing the URL in your test script, the request will fail. But if you are using python 3, the script itself will fail because of
str things. Life is hard :-(
Ok, two possibilities here: you use python 3 to invoke your script, but then you need a slightly updated version of it:
I have added this script as
test3.py in the github repository.
Or you use Python 3.6 right from the beginning, which is supported by Amazon Lambda.
What to do next?
We will finally add the Lambda invocation in the next post - which is quite easy now as we have everything around. We will discuss some implications of running Lambda functions, and do some performance studies later.