Reproducible examples

Examples and tutorials are a great way to showcase your software. However, for them to be useful, they need to be reproducible. There are a few ways you can ensure this is the case:

  • Providing clear instructions to install the software and its dependencies, as described here.

  • Providing clear instructions to run the example and download any necessary data, e.g. in the README or in the home page of the documentation.

  • Using notebooks and scripts to showcase your software. See below for more details.

  • Using configuration management tools, such as Hydra, for separating the configuration of an experiment from the code. See below for more details.

Using notebooks and scripts

Jupyter notebooks are a great way to showcase your software. They allow you to combine code, text, and images in a single document. However, they can be prone to errors when not running cells in order. Moreover, they are a pain when it comes to version control, as they are not plain text files and lend to many lines of code being changed when only a few lines are actually changed.

Scripts on the other hand are plain text files and are easy to version control. Moreover, they are less prone to errors as everything is run in order, and you are less prone to have missing or incorrectly defined variables.

Both notebooks and scripts have their place. My recommendation is to:

  1. (Optionally) Use notebooks for interactive development.

  2. Convert the notebook into a script / scripts + files with utility functions, and continue development.

  3. Commit the script(s) to your repository.

  4. Create a polished notebook to showcase your software with code, text, and images.

  5. Commit the notebook to your repository.

In this manner, you can avoid the annoyances of notebooks with version control, while still being able to showcase your software in a polished notebook.

Separating configuration from code

Relying on scripts can still lead to pesky little commits and code duplication, e.g. when you want to change a parameter or showcase different results from the same functionality. This is where configuration management tools, such as Hydra, can come in handy.

Hydra allow you to separate the configuration of an experiment from the code, through a separate configuration YAML file. This allows you to change the parameters inside the configuration file(s) without having to change the code. This is particularly useful when you want to demonstrate different scenarios, e.g. when you have have a script that trains a model and a configuration file that specifies the parameters of the model / training. Moreover, having a single configuration file that exposes all the relevant parameters can be much more convenient than having to dig through the code to find them.

The script examples/real_convolve.py shows how to use Hydra, namely adding a decorate to the function:

@hydra.main(version_base=None, config_path="configs", config_name="defaults")
def main(config):
    # `config` is a dictionary with the parameters specified in the YAML file.
    # e.g. `config.seed`

Running python examples/real_convolve.py will run the script with the default parameters from the file configs/defaults.yaml. You can also specify a different configuration file, e.g. python examples/real_convolve.py -cn exp1 will run the script with the file configs/exp1.yaml. Instead of creating a new configuration file, you can also specify the parameters directly through the command line, e.g. python examples/real_convolve.py filter_len=15.

One handy thing about Hydra is that it creates a folder outputs with the timestamp of the run, and saves the configuration file used in that run. You can also save the results of the run in that folder:

@hydra.main(version_base=None, config_path="configs", config_name="defaults")
def main(config):
    # ...
    # Save the results
    np.save(os.path.join(os.getcwd(), "results.npy"), results)

This makes Hydra a great tool for keeping tracking of experiment runs and their parameters, while limiting changes to the code (and having new code commits). Check out this blog post for more on Hydra.

Setting the seed

Setting the seed is important for reproducibility. You can set the seed in the configuration file and use it the script like so:

@hydra.main(version_base=None, config_path="configs", config_name="defaults")
def main(config):

    # Set the seed for numpy
    np.random.seed(config.seed)
    # application-specific seed setting