This paper is a very good primer on how to organize a project. The following is a simplification that is a good start for building an automated script.
1. File structure
projects/<name of project>
./scripts/ – directory for all the python/R/perl scripts
./bash/ – driver scripts that call all other scripts and execute pipelines
./bash/runall.sh – the main driver script
./bash/filter*sh – scripts that I use usually only once to filter input directories and create soft links
./input/ – input directories
./output/ – output directories
./logs – stdout and stderr of runall.sh scripts
readme.txt – description of files and scripts in this project folder
commands.txt – commands that I run in this directory (could be a mess)
datasets (directory for storing all datasets)
CASP9
CASP10
CASP11
Human reference genome
etc…
general_scripts (directory for storing scripts that are use in many projects)
pdb_scripts
templates
etc..
bin (all small scripts that I want to be always in PATH, remember to att this to your PATH-variable)
fastalen
svm_to_txt
txt_to_svm
echo_both
etc…
The idea of file organization is that someone not familiar with the project should be able to understand and use the scripts. This person could be anyone, but most of the time it is yourself, because you tend to forget what you did a few months ago. Also remember the Murphy’s law of bioinformatics: Everything you do, you will probably have to do over again.
2. Driver (runall.sh) scripts
The idea of a runall.sh script is to have a wrapper script that runs everything, from cleaning and prepping your data, all intermediate steps and the final results. This way you can easily rerun a experiment/project with new data.
Useful hints:
1) Usa text that describes all the parameters
2) Separate code into blocks, ie “Stages”. Before each stage print the date and stage
“================== stage 1 ==================”
3) Always have “if [ -f $output_file ]” statement before each stage. That way if you want to rerun some stages, you can delete the corresponding output files.
4) Always print “runall.sh is running with parameters: …” in the begining of the script. Print “runall.sh is done.” in the end of the script.
3. Templates
You will often want to write small scripts in bash/R/python and to speed up this process you should write small templates scripts which contains basic structure, authorship, usage lines, script running and script done statements.