My Deep Learning Toolchain
Successful model development can be surprisingly dependent on good engineering practices. Despite this, many model implementations scattered about Github are difficult to follow and hard to recreate locally.
But what should a good model look like? I would propose that the gold standard for a model implemented on Github could be:
- The dependencies may be installed automatically, using a single command.
- I can build the model in a sandbox without polluting with my dev. environment.
- I can easily retrain (and retune!) the model the model exactly as the author did during their development.
- I can easily test the inference pass.
- I can easily import the model into my own workflow and use it as part of my own code.
It may be that I've never come across a model that met this standard in practical day-to-day development...! But I claim that achieving these things is surprisingly easy given the modern tools available to us.
Furthermore, many may simply respond "just ship everything in a Docker container". My response to this is that Docker is a great tool for deploying a stable, reproducible environment, but Docker images provide a poor basis for sharing code and enabling collaboration. In this post, we'll focus on designing models that can be installed and imported with a single command, without Docker, on any OS, on any platform, using the vanilla tools our language provides.
So, let's dive in and look at the tools that I use to train models, and how they can be used to meet the gold standard I gave above.
Python 3 for model development
It may not be controversial these days, but it's worth highlighting that Python 3 is an excellent choice for prototyping and releasing deep learning models. The key feature here is the library support, which is unmatched by other languages. I tend to target Python 3.7.5 or later, because I make heavy use of dataclass, and it's recent enough to support type annotations. Ubuntu Focal now includes Python 3.8 by default, so many users will start being able to support this out of the box.
Bonus pionts: type annotations
If you use type annotations (correctly) in your model, you will get serious points from me. It's very common that models use native lists, tensors, and numpy arrays almost interchangeably as function arguments, and so using type hints can make the data flow in your model easier to follow and debug.
from dataclasses import dataclass
from typing import List, Optional
import numpy as np
from PIL import Image
@dataclass
class DataSample:
img: Image
bboxes: List[np.ndarray]
scores: Optional[List[float]]
...
sample = DataSample(Image(), [np.array([1,2,3,4])], scores=None)
Use pyenv to manage Python versions
For better or worse, many incompatible versions of Python now exist, and it's easy to get in a tangle with different Python installs, especially when making system-wide changes.
My advice here is to do it properly, and do it once: learn to use pyenv to manage all of your Python installs.
Pyenv is a tool that provides a shim around the python command which redirects it to the correct version of Python for the current context. For a given project, you can use pyenv local 3.7.5 in a directory to tell pyenv to use Python 3.7.5 from now on when calling Python in that directory.
If you don't have the version of Python installed that you need, you can use pyenv to install it. If the universe is feeling good to you today, pyenv install 3.7.5 should be sufficient to fetch and install Python 3.7.5 for you automatically on any platform.
In practice, things don't always go so smoothly with pyenv, however I really believe it's really worth the hassle of spending the time and getting it working - most issues you might enouncter are well understood, and documented on stackoverflow.
A few common tips for common pyenv issues
- It's not working! - make sure you have activated it (google "pyenv activate"). Typically you need to add this to e.g. your .bashrc
- pyenv install failed! - you may be missing system dependencies. Be patient, and trawl stackoverflow. It can be made to work!
Use poetry to manage your project and dependencies
Keeping your Python project sandboxed is crucial aspect of remaining sane when using Python. We typically use virtualenvs, or virtual environments, to achieve this. These are are essentially directories containing their own Python install, and a local copy of all the packages your project needs.
A separate but related problem is ensuring you actually have the dependencies your project needs installed into that virtualenv.
Once upon a time, you may have used `virtualenv` and `requirements.txt` or `anaconda` or `pipenv` to do these things.
Well, my advice is steer clear from all of them and move straight to poetry. It's very much the new kid on the block, but from my perspective it's already miles ahead of the competition. I have managed to deploy a number of complex Python applications in a professional capacity using poetry, and I have found it both plays very well with other tools and is generally well designed.
Poetry uses the modern pyproject.toml way of defining a package, and this essentially replaces the old setup.py and friends. It would be out of scope to explain why pyproject.toml is important and worth learning about, so I recommend some googling here!
Ok, so let's create a new project! ... don't forget to `pyenv local 3.7.5` first, to ensure you are using the correct version of Python (if you forgot, you can run it later, and then go back and edit pyproject.toml to make sure it's set correctly.)
$ poetry new fancynet
Yes, it's as easy as that. Go ahead and move in the directory poetry created for you, and you can now
$ poetry install
$ poetry add torch
$ poetry add numpy
Poetry will install them to the virtualenv and add them to the pyproject.toml. It will also pin the exact versions of the dependencies into a lock file so that other users installing your package can reproduce precisely the same environment as you. This is perhaps the most important step here, and is worth underscoring: this is how we are able to achieve reproducibility.
The Best Bit About Poetry
Poetry was designed to be good at both developing packages and building/sharing/publishing them. These latter features are sorely missing from other tools (such as pipenv), and they are really the killer feature of poetry.
Let's suppose you pushed your fancynet package to github (don't forget to check in the lock file!). Now, anyone using a recent version of pip can install your package with a single command,
$ pip install git+https://github.com/myusername/fancynet
Pip will fetch the code from github, look into the pyproject.toml, see it uses poetry, and then just do everything for you, including installing poetry, and fetching the correct versions of the pinned dependencies! This is totally magic and not enough people know this trick. Go forth and spread this knowledge!
If your package reaches a level of maturity where you'd like to publish it to a public package repository (e.g. pypi), you can use poetry to manage this. To bump the version (you can also bump the minor and major versions this way), use
$ poetry version patch
$ poetry build
$ poetry publish
Use click to manage your entrypoint
An oft-overlooked step in building a good python package is organising the files and "entrypoints" (or "executables"). I believe it's worth imagining that someone else is going to use your code at somepoint, and so I try to create nicely named subdirectories for each part of my model: "datasets", "models", "inference", but also "cli". In this "cli" directory, I usually have several subdirectories "train", "tune", "test". Each of these contains a single "__main__.py", which contains the (small!) shim code needed to perform those actions.
Inside "__main__.py" I then use click to manage arguments and the entrypoint. I have found it much easier to use than argparse, and I do think it's worth the effort!
import click
from omegaconf import OmegaConf
from pytorch_lightning import Trainer
from mynet.models import MyNet
@click.command()
@click.option('--dataset-root-dir', help='directory containing the dataset')
@click.option('--config-path', default="config.yaml", help='The config file to use.')
def train(dataset_root_dir: str, config_path: str):
cfg = OmegaConf.load(config_path)
model = MyNet(cfg)
trainer = Trainer(gpus=1, profiler=True)
trainer.fit(model)
if __name__ == '__main__':
train()
Now, if we are developing locally, we can run this using
$ poetry run python -m mynet.cli.train --dataset-root-dir /foo/bar/baz
[tool.poetry.scripts]
train = "mynet.cli.train.__main__:train"
$ poetry run train --dataset-root-dir /foo/bar/baz
$ train --dataset-root-dir /foo/bar/baz
It's possible to add multiple entrypoints to the pyproject.toml, so you can easily create one for training, tuning, testing. This is helpful for users trying to discover how to train or evaluate your model from scratch!
Do note that if a user installs your package into a shared virtualenv, you may wish to choose a more meaningful name than train!
Use pytorch to develop your net
The key primitives making up modern deep learning are: operations on tensors, a computation graph, and the functionality to automatically compute derivatives. Pytorch is a library designed to excel at providing these primitives, and surprisingly little else.
This decision to focus on implementing this set of features as flexible, compsable, free functions, rather than building an opinionated framework for Deep Learning is (in my opinion) what makes pytorch so successful. Whilst some steps can feel verbose, keeping the exact operations that pytoch does at a high level abstraction helps users maintain a clear mental model of what the library is actually doing for them. This suits my personal preference for "verbose, boring code" being superior to "clever, terse code".
One thing worth noting is that to get the most of pytorch, you will ultimately need a Cuda gpu. However, pytorch is quite happy to install and run on CPU, and I believe it is worth ensuring that any model you build can be executed on the cpu, even where it would be prohibitively slow to train your net. this brings a few benefits, such as making it possible to run your tests in an arbitrary environment (such as in continuous integration), and also allowing development on a dev. laptop such as a macbook.
Finally, pytorch has grown quickly and gained many new built-in features. Despite this, it's still common to find nets on Github using custom Cuda code to implement operations (such as nms - which is now natively provided by torchvision). My advice is as follows: as much as physically possible, keep your model using pure Python with no custom C or Cuda code. The pain involved in making your model meet the imagined gold standard above when using external code is a significant undertaking, and worth avoiding at all costs, if possible. Usually, this involves using modern pytorch and torchvision libraries, and sometimes, it may mean sacrificing a small amount of performance. My suggest is to consider this tradeoff carefully!
Use pytorch lightning to structure your model
When my colleague came to me pitching pytorch lightning as part of our model workflow, I was very sceptical. Generally, as engineers we are suspicious of using frameworks, which limit flexibility and impose conventions on our projects that can conflict with our existing code and at worst, fail to stay up-to-date with the development of the surrounding ecosystem of libraries.
Fortunately my colleague convinced me to try lightning, and I now wholeheartedly recommend it for model development and use it in my own projects. Lightning's success is that it provides a sane, modular, structure on top of pytorch to get the most common tasks of Deep Learning done: loading a val and train set, logging training progress and images, putting the model into eval mode for the validation run, saving the best checkpoints by inspecting the validation loss... When we prototype models, it's easy to forget or procrastinate these steps, or implement them incorrectly, and my experience is that lightning has done a very good job of doing what you need without getting in the way.
Probably, moving to use lightning is the single most important step I made in rapidly prototyping models.
It can be tempting to find an existing model on Github (I have seen engineers fork the Facebook detectron repo as a starting point), but I think this is a mistake: progress is rapidly being made in pytorch and other libraries, and so structuring your code around an old project can be a bad idea. Lightning makes it easy to start afresh each time you build a model, using the most modern workflows and most modern libraries.
A note on visualization
Understanding how our models work is crucial to debugging them. In this respect, lightning helps us enormously. Without any prompting, a totally vanilla pytorch lightning model will log key stats (and even model hyperparameters) to tensorboard. This can of course be configured to use different log sinks (such as WandB). And of course, one can easily add extra logging, fields, images etc.
With respect to logging to console: my advice is try to avoid it in code that you push to github! print()ing directly to standard out makes your model a bad actor for users wishing to import and use it as part of their own code, and there is a simple and standard alternative: use the Python logger (which writes to stderr, not stdout), and select the level correctly (warn, info, debug etc.). Take a little time reading the docs on this and avoid print, and note that by default the logger typically squashes .info level logs, so you may wish to turn this up.
A final note: if you are just starting deep learning, my advice would be to train a few models with vanilla pytorch before jumping into lightning, as this will help build intuition!
Use OmegaConf to manage your config
When developing models, it's easy to end up with magic constants all over our code: when setting convolution kernel sizes, when trying out different loss functions, and generally whilst iterating on different architectures. My advice here is to start early with a principled approach to config, and to avoid magic constants littering your code as soon as possible.
The key benefit comes when combing this advice with the advice above to use pytorch lightning: lightning will save the hyperparameters in your config into each model checkpoint that it (automatically) creates! This means that even when you make large non-trivial changes to your model's code, previously trained checkpoints can still be loaded, evaluated, and tested with the current versions of the code. This backwards compatibility can be very hard to add retrospectively, and turns out to be very easy to add early, and so I believe is worth the effort.
Using this approach, and checking in the config files used during training can further help users ensure reproducibility in the results you have obtained.
Of the (small number) of different configuration magement tools I have tried, the one I would recommend is "OmegaConf". The key feature for me was that it's simple, unobtrusive, can easily be converted to and from native Python objects, and supports multiple kinds of configuration file.
Use Raytune to tune your model
Automated hyperparameter tuning may be one of those things that we all know is a good idea, but that in practice feels like a lot of effort and can be easily to put off. I felt this way myself until I realised how easy it was to achieve using modern tools. Many of us (me included in many cases...) continue to use "GSD" (Graduate Student Descent) to find the optimal parameters in our models, which is painfully slow and unsatisfying.
But, I have good news: now that we have used lightning to define our model and OmegaConf to manage our config, we will find that the jump to automated hyperparameter tuning is surprisingly modest!
A fully worked example of tuning a pytorch lightning model is provided in the docs and adapting this to another model and making it work is not too bad.
It took me a little while to grok what raytune is doing, but the key thing to understand is that it'll spawn several different subprocesses, each of which with an instance of your wrapped model inside of it. The wrapper you provide contains a lightweight callback which, when invoked, sends a message back to the parent process to report on current progress.
Raytune has a lot of tunables and can rapidly become confusing, though I found even the most basic config of randomly sampling across a set of hyperparameters immediately sped up my progress in model development! It's worth noting that one can tell raytune how many resources to allocate to each of the child processes (such as 0.25 of a gpu), which means that raytune can run several trials in parallel if you have enough resources on your gpu.
Putting it all together
At this point I was tempted to link to an "example" repo with an example model, but then I don't think copying an existing repo is the best way to start, I believe the best way is to know your tools well, and be able to quickly set up a new environment from scratch each time, always getting better as the tools around us evolve.
So, let's instead talk through the steps I (currently) use to start training a new model, using the advice above.
- Decide on which Python version to use - "pyenv local 3.7.5"
- Create a new poetry project - "poetry new mynet"
- Install my tools - "poetry add torch"... etc.
- Create an entrypoint using click
- Create a pytorch lightning module - By reading the quickstart lightning docs!
- Create a config file
- Build my model...
- Create an entrypoint to tune the model with raytune
- Tune my model...
Well now... that was a long-winded post. Hopefully those of you that made it here got something from it 🙂