Containerising Data Pipeline Components

The last post in this series covered some simple Python code that leveraged twitter’s tweepy API in order to obtain tweets based on a query, sentiment score each tweet and then load these into an S3 bucket in .csv file form. This post will cover containerising the code, in order to do this we require docker, and if the end goal is to push the container image up to dockerhub – a dockerhub account will also be required. The code for this blog post can be found on GitHub here.

The Anatomy Of Our Simple Dockerfile

Containerising the code is relatively trivial, all that is required is the Python code and a Dockerfile, which looks like this:

FROM python:latest
  
RUN pip3 install boto3 pandas nltk tweepy uuid numpy
WORKDIR /app
ADD . /app
RUN chmod a+x *.py
ENTRYPOINT ["tweets_to_s3_csv.py"]

In essence, this is what the Dockerfile is doing:

1. The image that results from the dockerfile has a python:latest base image

2. The Python packages required by tweets_to_s3_csv.py are installed via pip

3. A working directory from which the code is run is specified (/app)

4. We copy the files from the host directory to /app

5. Set the necessary permissions to execute the file containing the Python code

6. The final step is the actually run the code

Creating the actual image is performed by running a docker build command, thus:

docker build . t <image tag>

e.g.:

docker build . -t tweets_to_s3_csv:1.0

This will result in a docker image that resides in the local image registry, docker login and then docker push can be used to push this to docker hub.

Running The Container

Assuming that the image created has been tagged with tweets_to_s3:1.0, a container for the image can be spun-up as follows, substitute the place holders between the angular brackets with actual values:

docker run -e BEARER_TOKEN=<bearer_token> \
           -e TWITTER_QUERY='<twitter_query>' \
           -e MAX_TABLE_SIZE=10 \
           -e ENDPOINT_URL=<endpoint_URL> \
           -e BUCKET=<bucket_name> \
           -e AWS_ACCESS_KEY_ID=<aws_access_key_id> \
           -e AWS_SECRET_ACCESS_KEY=<aws_secret_access_key> \
           tweets_to_s3_csv:1.0 

As per the previous post in this series, here are some notes on setting these environment variable values readers of this post might find useful:

  • A twitter bearer token is obtained by signing up for a twitter developer account 
  • At the time of writing I have used an on-premises storage device (Pure Storage FlashBlade) for testing this, this code should work with any S3 compatible storage device or software defined storage platform, I will in due course get around to testing the code with AWS S3
  • Refer to twitter’s developer guide for instructions on how to specify queries, because one of the aims of this work is to showcase S3 object virtualisation for SQL Server 2022, ‘(SQL Server 2022)’ is an incredibly simple query that readers of this post might wish to try
  • 10 is a good default value for the maximum number of tweets per file, as specified in the MAX_TABLE_SIZE environment variable

Next Time

The next post in this series will cover running this simple container from within an Argo workflow, more specifically the topics of what Argo workflow is, how to install and configure it and how to build an actual workflow will be covered.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s