Probably Random

A blog by Paul Siegel

Working with Python Project Directories

March 05, 2020

Consider the following directory structure, typical of a Python-based data science project:

root
├── data
│   ├── test.csv
│   ├── train.csv
│   └── validation.csv
├── notebooks
│   └── exploration.ipynb
├── readme.md
├── requirements.txt
└── src
    ├── __init__.py
    ├── features
    │   ├── __init__.py
    │   └── extract_features.py
    ├── models
    │   ├── __init__.py
    │   └── train_model.py
    └── validation
        ├── __init__.py
        └── validate_model.py

The intent of this codebase is relatively clear: the various modules in src will fit and validate a model to the data in the data directory. It’s likely that that code in the notebooks folder will also work with the same dataset.

Problem: how can we reliably refer to the data from the code?

Here are some of the standard approaches and their weaknesses:

Better idea: use git

Recently I happened upon an alternative approach which uses git. In short: express all paths to data in your codebase relative to the project’s git root path. This technique assumes that your project is in fact a git repository, but:

  1. It should be, dammit!
  2. You don’t even have to commit anything - just executing git init from your project’s root directory is enough.

Here’s how to implement this technique in a robust, platform-independent way. The first step is to install the Python package gitpython, a Python wrapper around git. Then define a function project_root() somewhere in your repository as follows:

# project_root.py
import git
from pathlib import Path

def get_project_root():
    return Path(git.Repo('.', search_parent_directories=True).working_tree_dir)

The gitpython package works correctly on all of the standard operating systems. Now you can load data as follows:

from project_root import get_project_root
import pandas as pd

root = get_project_root()
train = pd.read_csv(root / 'data' / 'train.csv')
test = pd.read_csv(root / 'data' / 'test.csv')

This will continue to work even if you move or refactor the project directory, and it will also work on your colleagues’ laptops if they clone your repository.

(Of course, the import statement above assumes that project_root.py is on the PATH when the code above is being executed - absolute / relative imports in Python is a subject for another blog post…)

This also works in the shell

It’s possible to use this same trick to maintain a reference to your project’s root directory in your shell. To begin, add the following to your git config file (normally ~/.gitconfig):

[alias]
    root = rev-parse --show-toplevel

Now any time you are inside a git repository, the command git root will return the absolute path of the repository’s root directory. So given the directory structure at the beginning of this post, you can navigate directly to the data directory from anywhere in your repository using cd $(git root)/data.

Downsides

This strategy has been working for me pretty well so far, but there are some legitimate objections:

  1. Parametrization is better since it allows references to data outside of the project.
  2. This approach creates a dependence on git.

The first point is a reasonable concern, and I do still add data paths as command line arguments for more formal projects. But for simple prototypes and demos it is often much easier and less brittle to carry the data around with the code.

The second concern is more reasonable for those who use different or multiple version control tools, but if you are already thoroughly dependent on git like me then this post will help you get more out of it!