Project in molecular Life science (KB8024/KB8025) → How to organize your biological project

How to organize your biological project

How to organize your project?

This paper is a very good primer on how to organize a project. The following is a simplification that is a good start for building an automated script.

1. File structure

projects/<name of project>

./scripts/ – directory for all the python/R/perl scripts

./bash/ – driver scripts that call all other scripts and execute pipelines

./bash/runall.sh – the main driver script

./bash/filter*sh – scripts that I use usually only once to filter input directories and create soft links

./input/ – input directories

./output/ – output directories

./logs – stdout and stderr of runall.sh scripts

readme.txt – description of files and scripts in this project folder

commands.txt – commands that I run in this directory (could be a mess)

datasets (directory for storing all datasets)

CASP9

CASP10

CASP11

Human reference genome

etc…

general_scripts (directory for storing scripts that are use in many projects)

pdb_scripts

templates

etc..

bin (all small scripts that I want to be always in PATH, remember to att this to your PATH-variable)

fastalen

svm_to_txt

txt_to_svm

echo_both

etc…

The idea of file organization is that someone not familiar with the project should be able to understand and use the scripts. This person could be anyone, but most of the time it is yourself, because you tend to forget what you did a few months ago. Also remember the Murphy’s law of bioinformatics: Everything you do, you will probably have to do over again.

2. Driver (runall.sh) scripts
The idea of a runall.sh script is to have a wrapper script that runs everything, from cleaning and prepping your data, all intermediate steps and the final results. This way you can easily rerun a experiment/project with new data.

Useful hints:

1) Usa text that describes all the parameters

2) Separate code into blocks, ie “Stages”. Before each stage print the date and stage

“================== stage 1 ==================”

3) Always have “if [ -f $output_file ]” statement before each stage. That way if you want to rerun some stages, you can delete the corresponding output files.

4) Always print “runall.sh is running with parameters: …” in the begining of the script. Print “runall.sh is done.” in the end of the script.

3. Templates

You will often want to write small scripts in bash/R/python and to speed up this process you should write small templates scripts which contains basic structure, authorship, usage lines, script running and script done statements.

Project in molecular Life science (KB8024/KB8025)

How to organize your biological project

How to organize your project?

Affiliations

Funding

Modal title