02/02/2019

CB4: Dependencies Management

This article is part of the CodeBoosting series, where we teach to scientist how to make their code better and shareable.


The code is now in good shape. You use functions to transmit the intention of the code to the reader. Each part of the code has a single responsibility making it simple to add or change functionalities. You also wrote a small suite of tests that helps in debugging your code in a structured way and gives you confidence in the whole codebase.

Now your code needs to run on a different machine, maybe the one of your colleagues who want to validate some other hypothesis, or you were just able to pull in your project a new collaborator or maybe your analysis or simulation is growing too big and you need a more powerful machine to complete it in a reasonable amount of time.

This is already a scientific success, congrats!

But now you move your code into another machine, you start it, and it doesn’t work! At least few hours wasted, likely several email back and forth, loss of hard gained confidence in your coding capabilities and a new collaboration that start with the wrong foot.

First of all, don’t worry, writing a portable code is one of the most difficult thing to do, but often it is necessary.

In this article we will show how to deal with managing dependencies. More importantly few strategies to test and ensure that the code you are writing is actually portable.

We will assume the use of interpreted language, in particular Python and R suggesting the best practices for them. Of course the same concept can be applied to any language.

Let’s get started!

Managing Dependencies

Libraries are essential to quickly and efficiently develop software, especially scientific software. Indeed a lot of effort has been devoted in creating solid, complete, tested, efficient and open source libraries, and you should definitely take advantage of them. However it is necessary to install those dependencies on any machine that run your code.

Manual Approach

The first, trivial, approach is manual. Dependencies are often declared on top of each file, hence, before to share your code you could just open each file, annotate what external dependency it requires and simply install all those dependencies on the new machine.

While this approach might work on project with few, very stable external dependencies it is doomed to fail in the long term. Indeed the interface of your dependencies may change with time, some functions may be removed, some other may be added and some function could be change to take different arguments. With time new version of the libraries are releases while you rarely update the libraries that you use every day.

If you install now the latest version of all your dependencies it is likely that your code won’t work flawlessly.

Hence is not enough to only annotate the name of the libraries but also their version and, to be on the safe side, it is suggested to annotate also the name and version of the indirect dependencies (the dependencies of your dependencies, the libraries used by your libraries.)

Moreover, only the version may not be enough. It could be possible that a library, even with breaking changes, is published twice with the same version. Or you may need the very latest updates of a library, updates that are not yet released under a version name.

In order to also account for those cases, it is wise to store also the hash of your direct and indirect dependencies. (The hash is an alphanumeric string that uniquely identifies a library at any given point in time. If even a single bit in a single file of the library changes, then the hash changes completely.)

Clearly this work is extremely tedious and error prone and it should not be done by a human: a computer would do a much faster and better job.

Indeed software is been written just to deal with dependencies management.

Dependency managers

Now that we understand why is not wise to manually track the dependencies of a project lets explore what dependency managers do.

Dependency managers do exactly what we can expect from their name. They track for us the name of our dependencies, from where to download them, what version to download and some of them also track what hash is necessary to install.

Dependency managers also take care of downloading all the necessary dependencies and to store them inside your project.

If each dependency is stored inside the project that uses them, it means that they cannot be shared between different projects and that they will use space in your disk. The excessive use of disk space is usually not a problem nowadays, especially for software libraries that are usually limited in size, however it is something to keep in mind and to be aware of.

The user interface of a dependency manager is quite simple. It allows you to create a project and it will usually create a standard directory structure for you to populate with your files.

When you need to add a new dependency you ask to the dependencies manager to install such dependency and it will do for you all the installation and all the necessary bookkeeping.

Finally dependency managers store all the information in two different lock files. In the human friendly lock file you declare your direct dependencies and their version (here you should not be too strict about the version of your dependencies, indeed you should declare the minimum required version necessary to run your software.)

The computer friendly lock file is generated automatically and should never be modified manually. It usually store the name, version, and hash of all your installed dependencies, both direct and indirect.

Usually both files, the human friendly and the computer friendly, should be shared. The dependencies manager is able to construct a computer friendly lock file starting from the human friendly one, but it may generate a slightly different one, which in turns means that slightly different dependencies are installed and used. Usually it is not a problem, however this miss-match might produce bugs and inconsistencies that are very hard to spot, understand and fix.

Now that we understand how dependency managers work we will explore two common ones:

Pip + Virtualenv

pip and virtualenv are the de-facto tools to manage dependencies in Python.

pip is concerned only with installing and updating python dependencies in an environment that by default is the global environment. By using only the global environment it is impossible to isolate the dependencies from different projects, indeed all the dependencies installed will end up in the global environment so it will be impossible to understand what dependency is needed by what project.

In order to overcame this problem, virtualenv allows to create several virtual environments, hence the name. Each environment is separated from the others and in each of them it is possible to use pip to install the required dependencies.

After we have installed both pip and virtualenv, to create a new virtual environment it is sufficient to type:

virtualenv DIRECTOR

Where DIRECTORYis the directory where we want to create our new environment, it can either be empty or it can already contain your project.

Once the environment is created it can be activated with the command:

source /path/to/DIRECTORY/bin/activate

At this point you are working inside the virtual environment, and you should notice that the prompt of your command line changed a little bit. All your dependencies are not usable anymore, just like they were not installed. Indeed they are installed in the global environment, not the DIRECTORY one.

At this point you can install all the necessary dependencies using pip as an example to install numpy would be sufficient to do:

pip install numpy

After you have installed all your dependencies you can use freeze them and store a list of them in a simple text file with:

pip freeze > requirements.txt

This command will write all your dependencies, along with their version in the requirements.txt file. This file can then be read by pip itself to install all the necessary dependencies.

Hence when you move your software in a different machine, you can simply execute:

pip install -r requirements.txt

to install all the dependencies in the new machine. Ideally you should execute this command inside a new virtual environment.

Finally, the last importat piece, to exit from a virtual environment is sufficient to execute:

deactivate

This should be enough to start working with pip & virtualenv. For a more exhaustive description, visit the official guide. Note that we covered here only Linux and Mac, but Windows works similarly: to enter the the virtual environment is sufficient to execute the script inside \path\to\DIRECTORY\Scripts\activate

Packrat

Packrat is the R equivalent of pip + virtualenv.

Before to write any line of code you should initialize the repoistory executing in an R shell:

packrat::init("/path/to/your/directory")

At this point, you are into packrat mode inside the project directory. To install new dependencies it is sufficient to execute:

install.packages("packageName")  

exactly how you would do without packrat. Moreover it is possible to install packages also using the standard RStudio interface.

In order to save the status of your dependencies is then necessary to execute:

packrat::snapshot() 

That will create the lock files we were discussing above.

It is possible to query the status of the packrat environment with

packrat::status()

and to restore the status of the repository with:

packrat::restore()

The restoration of the environment is useful when you are moving the code in another machine, in this case packrat will install all the dependencies for you.

Testing the dependency management

Before to share your code it can be very useful to test that the management of the dependencies works as expected.

The simplest thing to do is to just try to copy your project in a different directory and to execute it, ideally you could execute your tests or even your whole analysis. If this process doesn’t work as expected you should:

  1. Delete the new folder with the copy of the project
  2. Fix the error in the original folder
  3. Re-create a copy in a of the fixed project
  4. Test again

And iterate this process until it works immediately just after the copy.

If we have the possibility it could be ideal to use the same procedure in a completely new environment. Indeed you could use a remote server or cheaper options like docker or virtual machines.

Ideally, this test procedure, should happen continuously, after every change in the code base. Indeed there are services of Continuous Integration (CI) such as TravisCI and CircleCI that copy your source directory in a different machine, install the dependencies and execute all the test.

Usually CI is too much in scientific software, but in the industry is common. Moreover, the CI platforms that we mentioned above have free plans for open source project.

Concluding

In this article we have shown how to manage dependencies.

We started in understanding what dependency management is and how difficult it is to do it manually. Then we explore two different package managers, pip + virtualenv for Python and packrat for R.

In the last part we show how to test if our dependency management works correctly.

If you are interested in these topic don’t forget to subscribe and to comment with any questions or suggestion.

Newsletter

We publish new content each week, subscribe to don't miss any article.

Leave a Reply

Your email address will not be published. Required fields are marked *