Scripts are a collection of commands that run in sequence when executed. Scripts can be used to drive interactions between characters in a video game or to set a web-server to a desired state. These actors interact with the environment as they perform according to the script. You will often find scripts in large and complex systems, where they are necessary to scale. Software projects accrue scripts out of necessity to avoid memorizing the many incantations for building, testing, and releasing code. Fortunately, the “write once, run everywhere” philosophy is not unique to Java. In this tutorial, we will create a data-processing application in Python that can be run in a reproducible way using Pipenv and Docker.
Python has been a dominating language in scientific community with projects like SciPy and Anaconda providing a reproducible environment for data processing and analysis. Python is designed to be general purpose, but the Zen of Python makes it a suitable choice for new programmers and experienced ones alike. Notebook computing, popularized by IPython and later Jupyter, is a paradigm shift in the way we interact with computers. The notebook is also a script for reproducing a particular experiment or procedure.
The POSIX shell language and CMD batching together run on most computers today. However, shell likely runs on more virtualized copies of systems due to the prevalence of Infrastructure as a Service (IaaS). The success of Amazon Web Service fits well with the nature of computation today – distributed and heterogeneous. Many large sites will offload massive amounts of traffic, computation, and data to servers owned by different companies across the globe. Despite the complexity of software today, it’s never been easier to create robust and reproducible applications using both Python and shell.
**Figure:** A typical computer can support many applications by layering software to handle orthogonal tasks. An application runs in an environment.
An Adding Machine
While a data-processing application can be involved, we can look at the construction of a simple one. An adding machine takes two numbers as options and prints the result to standard out. The Adder is a Python application that implements an adding machine. The project is flat, and every file is single purposed. Large projects grow from a similar base with defined processes and a nested folder structure.
adder/ ├── README.md ├── Dockerfile ├── Pipfile ├── Pipfile.lock ├── adder.py └── test_adder.py
Before starting a new project, make sure that the
pip is up to date on your system. If not already up-to-date, install the upgraded version in the user executable folder (e.g.
pip install --user --upgrade pip
We will be using Pipenv to manage the Python environment and dependencies. Pipenv integrates
virtualenv to create a human-centered development workflow. It turns out that it’s an excellent tool for managing autonomous workflows too. This tool increases the portability of a Python application by isolating dependencies management and execution into userland, i.e., applications do not need root privileges.
$ pip install --user pipenv
If there are issues with the
--user option, check that the
PATH variable is set correctly.
Create a new project for the Adder.
$ mkdir adder $ cd adder $ pipenv sync
We will be observing the adding machine through the standard terminal input and output (stdin and stdout or file descriptors 1 and 2). Click is a library for creating command-line interfaces for Python applications. The idea of the command is central to the Click API. It provides a simple way to create interfaces and can take input from arguments, options, and environment variables.
To add a library to a project,
$ pipenv install click
Python and Click can be used to write the Adder implementation.
- Run the file using Python from the user environment. This is run by setting the executable bit via
- Import Click library to create the Command Line Interface (CLI).
- The core functionality of an adding machine.
@is notation for the application of a decorator function. This returns
- This convention should be adopted when running applications through Docker. See .
- The application entry point
- Printing to stdout is one way to pass data between applications. Files and sockets are also widely used.
- The script is run standalone when the
__main__script entry point is defined.
clickwill read variables from the environment when the
We’re can now run the application in the wild.
$ chmod +x adder.py $ pipenv run ./adder.py --port-A 3 --port-B 4 > 7 $ pipenv run ADDER_PORT_A=3 ADDER_PORT_B=4 ./adder.py >7
Great, everything looks correct at first glance. Because we’re writing software that’s executed more often than it’s read, let’s verify the behavior with a test.
There’s an extensive toolbox to choose from when testing Python software. Here, we want a low boilerplate framework called
pytest to write tests. We can keep these dependencies separate from the production dependencies by adding the
--dev option to the
$ pipenv install --dev pytest
Again, here is a breakdown of the anatomy of the code.
pytestpackage forms the basis of the tests.
unittestis an alternative that is included in the standard library.
clickpackage includes useful testing harnesses for invoking wrapped functions
- The relative import syntax is used here. Because
__init__.pyis missing, we need to supply the interpreter a hint to treat the current folder as a module using the
- Fixtures are testing objects that are shared across tests. For example, a static resource can be read from a file and passed as a fixture between testing routines.
- Tests are prefixed with
test_. Pytest will detect these at runtime.
runnertype is the same as the return type of the
- Note the newline. One possible improvement is to ignore whitespace or write directly to stdout.
Run the test using pytest. Remember that the current directory should be treated as a module using
python -m <command>.
pipenv run python -m pytest
Pytest generates the test results and prints them out to the console. Add the
--junitxml option to log the results into a file.
Tests verify the correct environment configuration and are indispensable for enabling a reproducible workflow.
Tests is a dark art of itself. If you had to reverse engineer the Adder black box, how you generate the minimal set of statements needed to validate its hypothesized behavior?
Running in Docker
Now that the Python adding machine can be run and tested the shell, we can package the entire environment in an operating system container. Docker creates application environments that share the host kernel but are isolated from all system resources like the file system and process manager. Environment variables are a standard way to set container configuration.
To start your container fleet, drop a
Dockerfile to the project directory.
These steps should look familiar. The
Dockerfile is a source of end-to-end system documentation.
The container is managed locally with two commands. To create the docker image, run
build in the current directory.
docker build -t adder:latest .
This will generate an image and tag it as
The shell is the control interface of the adding machine. The Docker CLI has an option for setting environment variables.
$ docker run -it adder:latest \ pipenv run ./adder --port-A 2 --port-B -1 > 1 $ docker run -e ADDER_PORT_A=-2 -e ADDER_PORT_B=3 -it adder:latest > 1
With this, the application is successfully portable. The repository can be distributed as a source or as an image. Pipenv and Docker are potent tools that can improve your workflow and make results accessible to reproduce.
pipenv init # Create a Pipfile pipenv install click # Install Click as a library pipenv install --dev pytest # Install pytest as a development library pipenv sync # Create a Pipfile.lock pipenv run python -m pytest # Run tests like `tests_*` pipenv run python application.py # Run the app in the virtual environment docker build -t <image-name> . # Build a docker image in the current directory docker run -it <image-name> <shell command> # Run an image interactively with a psuedo-tty