Dockerizing your Python applications is an easy way to ensure your Python workloads run reliably. But is Dockerizing all you need to scale your workflows? Building your Python applications into a Docker container gives you several benefits off the bat:
However, simple containerization deployments lack a few critical capabilities that enable them to be resilient and scalable. Particularly, they lack a consistent approach to observability - the ability to view, diagnose, and resolve potential issues in your workloads.
In this article, I’ll walk through the basic steps of Dockerizing a Python application. Then, I’ll show how to make your Dockerized workflows even more robust with workflow traceability, observability, logging, & robust error handling.
In this walkthrough, I’ll show the following:
To do this walkthrough, you’ll need the following installed locally:
This walkthrough assumes a vague knowledge of Docker and the concept of Dockerization. If you need more background, read the full overview on the Docker Web site.
Assume you want to fetch a JSON file in an Amazon S3 folder in the following format (sample data borrowed and modified from JSON Editor Online):
1[
2 { "name": "Chris", "age": 23, "city": "New York" },
3 { "name": "Emily", "age": 19, "city": "Atlanta" },
4 { "name": "Joe", "age": 32, "city": "New York" },
5 { "name": "Kevin", "age": 19, "city": "Atlanta" },
6 { "name": "Michelle", "age": 27, "city": "Los Angeles" },
7 { "name": "Robert", "age": 45, "city": "Manhattan" },
8 { "name": "Sarah", "age": 31, "city": "New York" }
9 ]
Assume you don’t care about the individual people - what you want is a rollup that counts the number of people in each city. You also want to normalize this data - e.g., you know that “Manhattan” is a borough of New York, and you want to count it as “New York” instead. In other words, you want something like:
1[
2 { “city”: “New York”, “people_count”: 4 },
3 { “city”: “Atlanta”, “people_count”: 2 },
4 { “city”: “Los Angeles”, “people_count”: 1 },
5]
You can pull this off with a simple Python script:
1# Based on sample JSON file from https://jsoneditoronline.org/indepth/datasets/json-file-example/
2
3import boto3, json
4
5city_normalizations = {
6 'manhattan': 'New York'
7}
8
9# Reads the JSON file from an S3 account.
10# NOTE: Assumes the S3 object is public.
11def get_json_file():
12 client = boto3.client('s3')
13 response = client.get_object(
14 Bucket='jaypublic',
15 Key='data.json',
16 )
17
18 return json.loads(response['Body'].read())
19
20# Transforms a list of people who lives in cities into a count of people per city.
21def transform_json_file(json_obj):
22 dict = {}
23
24 for obj in json_obj:
25 key = obj['city']
26 if key.lower() in city_normalizations:
27 key = city_normalizations[key.lower()]
28 if key in dict:
29 dict[key] = int(dict[key]) + 1
30 else:
31 dict[key] = 1
32
33 return dict
34
35def save_transformed_data(save_dict):
36 for key in save_dict:
37 print("{}:{}".format(key, save_dict[key]))
38 # Save to DB - step omitted for ease of tutorial
39
40# return
41
42if __name__ == '__main__':
43 json_ret = get_json_file()
44 new_dict = transform_json_file(json_ret)
45 save_transformed_data(new_dict)
I’ve divided this up into three method calls, as each represents a distinct part of the workflow:
To make this work, you’ll need a way to load your dependencies - in this case, the AWS Boto3 library.
To test this, run the following commands to establish a virtual environment for the script to run it like an application:
1python -m venv pythonapp
2source env/bin/activate # On Windows, use: pythonapp/Scripts/Activate.ps1
3pip install -r requirements.txt
4python s3.py
This will yield the following output:
That’s the hard part done. Now, time to wrap this up in a Docker container.
To create a Docker container, you create and then build a Dockerfile. A Dockerfile builds off of a base image, which is usually a lightweight version of a Unix or Windows operating system. It then builds on this base image with a set of commands, each creating a new layer in the Dockerfile.
Layers are a time-saving feature in Docker. If you modify a Dockerfile, Docker doesn’t have to rebuild the entire container; it only has to rebuild from the last layer you modified. SimilarlySImilarly, a container runtime doesn’t have to re-download the entire container - only the layers you changed. This makes shipping changes to a Docker container fast and lightweight.
The following Dockerfile will build a Docker container that runs your Python application above:
1FROM python:3.12
2
3RUN mkdir /usr/src/app
4
5COPY s3.py /usr/src/app
6
7COPY requirements.txt /usr/src/app
8
9WORKDIR /usr/src/app
10
11RUN pip install -r requirements.txt
12
13CMD ["python", "./s3.py"]
Here’s what each line of this Dockerfile is doing:
You can build your Dockerfile with the following command:
1docker build -t python-app .
To run this container, you’ll need to specify AWS access key credentials for boto3. Ideally, you’d store these in a secrets manager of some sort, or use an IAM Role if you’re running your container in AWS. For this example, you can supply them at runtime. Use the docker run command to run your container as follows, adding your own AWS access key and secret key:
1docker run -it -e AWS_ACCESS_KEY_ID=<access-key> -e
2AWS_SECRET_ACCESS_KEY=<secret-key> python-app
Here, the -e arguments define environment variables for the running container. The -it argument redirects the output of the container to the console so you can see that it ran successfully.
This works! However, you’re still missing a lot in terms of truly operationalizing this workflow. This approach has:
This is where Prefect comes in. Prefect is a workflow orchestration platform that enables monitoring background jobs across your infrastructure. With Prefect, you can encode your workflows as flows divided into tasks. The best part is that you can do this by making only minimal modifications to your Python application.
I’ll break the next part up into two steps. First, I’ll demonstrate how to convert the Python application above into a Prefect workflow and run it locally. Next, I’ll show how to ship this and run it on Prefect within a Docker container.
First, copy your Python application to a new file (e.g., s3-flow.py) and make the following changes highlighted below:
1# Based on sample JSON file from https://jsoneditoronline.org/indepth/datasets/json-file-example/
2
3import boto3, json
4from prefect import task, Flow
5
6city_normalizations = {
7 'manhattan': 'New York'
8}
9
10# Reads the JSON file from an S3 account.
11# NOTE: Assumes the S3 object is public.
12
13@task
14def get_json_file():
15 client = boto3.client('s3')
16 response = client.get_object(
17 Bucket='jaypublic',
18 Key='data.json',
19 )
20
21 return json.loads(response['Body'].read())
22
23# Transforms a list of people who lives in cities into a count of people per city.
24@task
25def transform_json_file(json_obj):
26 dict = {}
27
28 for obj in json_obj:
29 key = obj['city']
30 if key.lower() in city_normalizations:
31 key = city_normalizations[key.lower()]
32 if key in dict:
33 dict[key] = int(dict[key]) + 1
34 else:
35 dict[key] = 1
36
37 return dict
38
39@task
40def save_transformed_data(save_dict):
41 for key in save_dict:
42 print("{}:{}".format(key, save_dict[key]))
43 # Save to DB - step omitted for ease of tutorial
44
45@Flow
46def process_cities_data():
47 json_ret = get_json_file()
48 new_dict = transform_json_file(json_ret)
49 save_transformed_data(new_dict)
50
51if __name__ == '__main__':
52 process_cities_data()
What did I change here and why? Going step by step:
To run this from the command line, first make sure you have Prefect set up:
Finally, run your Python application locally:
python s3-flow.py
On the command line, you can see Prefect run your flow along with each task that comprises that flow. It gives a unique name and UUID for the run so that you can trace it easily.
You can dive into the flow and its tasks more easily on the Prefect dashboard. Navigate to https://app.prefect.cloud, select Flow Runs, and then select your flow from the list of recently completed flows.
By selecting this, you can see more details on the flow run, including a visualization of each of the tasks of the flow, how they relate to one another, and how long each task in the flow took to complete. (Not surprisingly, the task containing the remote fetch operation took the longest time.)
Now it’s time to bring it all together. You can package this flow as a Docker container and run it on any container-compatible environment.
To package your script as a flow, make the following changes to the Dockerfile:
1FROM prefecthq/prefect:2-python3.12-conda
2
3RUN mkdir /usr/src/app
4
5COPY s3-flow.py /usr/src/app
6
7COPY requirements.txt /usr/src/app
8
9WORKDIR /usr/src/app
10
11RUN pip install -r requirements.txt
12
13CMD ["python", "./s3-flow.py"]
Again, this code requires minimal changes to be Prefect-ified. Instead of inheriting from the base Python image, I use the Prefect Docker image, which contains both Python 3.12 as well as Prefect pre-installed. I then changed the name of my Python application file to use the Prefect-enabled version.
After making these changes, compile your Docker container images you normally would:
docker build -t python-app-prefect .
Since this workload will run in a container, you’ll need to supply it credentials so that it can run on Prefect. You can do this by generating a Prefect API key. From your Prefect dashboard, select your account icon (lower left corner) and then select API Keys. Then, select the + button to generate a new key.
Give your key a name, select Create, and copy the key value somewhere safe.
Next, you’ll need the Prefect API URL. You can obtain this for your account by using the command prefect config view.
Finally, run your container locally using the following command, replacing the stub values with your own secret values:
1docker run -e PREFECT_API_URL=YOUR_PREFECT_API_URL -e
2PREFECT_API_KEY=YOUR_API_KEY -e AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY
3-e AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_KEY python-app-prefect
That’s it! You’ve successfully Dockerized a Python application, adding additional observability to your workloads with just a few additional lines of code. You can now run this Docker container anywhere you’d run a containerized workload. For more details, check out the Docker page in our documentation.
Now that your Python application has been Dockerized with Prefect, you can add additional logging and reliability features to the code with minimal work.
For example, what do you do if the attempt to fetch the JSON file from S3 fails? Currently, there’s no logic in the code to handle this common occurrence. Using Prefect, you can add retry semantics using a simple Python attribute:
1@task(retries=2, retry_delay_seconds=5)
2def get_json_file():
3 client = boto3.client('s3')
4 response = client.get_object(
5 Bucket='jaypublic',
6 Key='data-nonexistent.json',
7 )
8
9 return json.loads(response['Body'].read())
If you run this with a non-existent file (like I do in the code snippet above), Prefect will automatically try and fetch this file three times before giving up. You can see this failure clearly in the Prefect console, along with the last error and the full logs Prefect captured from the running Docker container process.
You can also take advantage of other reliability and workflow management features, such as Prefect’s built-in support for the Pydantic validation library , storing results across tasks, and event-driven workflow execution. Give them a try for yourself and see how Prefect simplifies building reliable, high-quality workflows and monitoring them across your architecture.
Prefect makes complex workflows simpler, not harder. Try Prefect Cloud for free for yourself, download our open source package, join our Slack community, or talk to one of our engineers to learn more.