Lab: Machine learning methods, LOGOs, SignalP
Instructions
Answer all questions and send it to John by email.
***Please send a pdf file (not .docx, .odt etc) with “name_lab_your_name.pdf”
Video introduction
Grading
The lab reports are a training to write scientific texts and an important part of the course. To pass you need to provide a final version of the lab report within 7 days after the lab. You have one chance to submit a preliminary version and receive feedback on it (not mandatory). The preliminary version needs to be submitted within 72 hours of the lab and feedback will be provided at the latest 24 hours before the deadline of the assignment. If the final report it not submitted in time or it contains an error you will get an Fx on the lab course. This means that you have to submit an updated lab report within 7 days after the exam and that the highest grade you can get is an E on the entire course. If this does not occur you will have to re-register for the course next time it is given (normally next year) and complete the missing parts that year and you can still not receive a grade higher than E. So in short this is the rules for the lab reports.
A. 0-0 Lab
B. 0-3 days after the lab – Preliminary version
C. 3-6 days after the lab – Feedback
D. 7 days after the lab – Final version
The lab report must be complete upon submission, i.e. you have to provide an answer to each question. The assistant has the right to return / not approve the submission if it is not complete. Linguistic accuracy is a must. Returns should get back to s within a week after you have received it. The submission of returns must be a version where all comments appear. All comments should be considered carefully or else it does not count as a return. A maximum of 1 return is allowed. The second return means that you have to re-register for the course next time it is given (normally next year) and complete the lab then. Complete reports in time for all bioinformatics and programming labs are compulsory.
Good Video Introduction to Machine Learning
Assignment Instructions
Download all you need first!
Resources
REMEMBER TO READ THIS TUTORIAL!
SVM Classification of Signal Peptides
Support vector machines (SVM) are one of the most popular machine learning methods employed to classify data. SVM light is a computer implementation of SVM and it will be used for this practical. Your task will be to predict whether a certain protein has or doesn’t have a signal peptide. Signal peptides are short peptides (3-60 amino acids) that direct the post translational transportation of proteins. In this practical, we will look at proteins with signal peptides at their N-termini (beginning of the peptide chain). Proteins with this signal peptide are transported outside the cell, in other words, they are marked to become extracellular.
Classification by SVM is generally based on a number of selected features of the data that is about to be classified. In this practical, the data is a set of protein sequences and the selected features are: a) the frequency of each amino acid when looking at the first 25 residues b) the frequency of each amino acid when looking at the first 100 residues. Create a folder called svm and download the data (svm.tar) and decompress it (tar xfv) to that folder.
Use the textbook and other relevant resources and try to answer the following questions.
1. What is the general procedure when classifying data with support vector machines?
2. Define with your own words supervised and unsupervised learning and point out the difference(s). Give 2 example methods for each.
3. What is cross-validation?
You should have several files, including the following files in your /svm/ folder;
- train25_mini (Training set, Number of sequences: 50, Features: amino acid frequencies for the first 25 residues)
- train25 (Training set, Number of sequences: 1000, Features: amino acid frequencies for the first 25 residues)
- train100 (Training set, Number of sequences: 1000, Features: amino acid frequencies for the first 100 residues)
- test25(Test set, Number of sequences: 100, Features: amino acid frequencies for the first 25 residues)
- test100 (Test set, Number of sequences: 100, Features: amino acid frequencies for the first 100 residues)
Try to understand the content of these files. They are all in the format required by SVMlight, and therefore you should look at the description of the format of the input files given at http://svmlight.joachims.org/ .
4. What does a line in any of these files correspond to?
5. What is the meaning of a -1 in the first column in the file train25_mini?
6. What is the meaning of the 3:1 in the first line of train25_mini?
You are now ready to play with SVMlight and you can train an SVM by running:
svm_learn trainset svm_model
where svm_learn is a program from the SVM Light collection, trainset is the training data and svm_model is the model of the training data that you will create by using svm_learn.
You can classify a test data set by running:
svm_classify testset svm_model outfile.out
where svm_classify is a program from the SVM Light collection, testset is the test data set, svm_model is a model of the training data set obtained by using svm_learn and outfile.out is the output file created using svm_classify. When running svm_learn and svm_classify, you will have to replace these mentioned dummy names with real file names. Please try to answer the following questions.
7. Train an SVM model on train25_mini. Then test the performance of this model on test25. What accuracy did you get?
8. Use the svm_model from question 7 and test it on train25_mini. What is the accuracy? Is this a good way of testing an SVM model?
9. Train an SVM model on a larger training data set, train25, and then test this model on the set test25. What accuracy did you get?
10. Train an SVM model on train100 and test it on test100. What is the accuracy?
11. Do you get a better classification by training and testing with the first 25 residues or with the first 100 residues? How would you explain this result?
12. There are different kernels that can be used when creating an SVM model using svm_learn (see different svm_learn options by running svm_learn --help). The svm_learn flag for selecting a kernel is called -t. Which kernel is used by default? Which kernel gives the highest accuracy when using train25 for building an SVM model and test25 for testing the SVM model?
13. A sequence LOGO is generally created to compare the different positions in a multiple sequence alignment in terms of information content. The higher the letters at a certain position in the LOGO, the more informative or conserved this position is in terms of sequence evolution. You can create sequence LOGOs using the online tool WebLogo. Create two LOGOs, one for the sequences in the attached file signal_sequences.txt and one for the sequences in nonsignal_seqs.txt. Submit these images together with the report. Compare these two LOGOs, can you observe any differences?
14. There are various methods for predicting signal peptides in protein sequences. These tools have been trained on known signal peptide data and some of them perform really well. The tool SignalP is available online. Use it on the first two human proteins available in the file sequences.txt. Save the result plots of the SignalIP-NN and submit it together with the report.
15. What do high and low S-scores indicate in the SignalP-NN result?
16. What are the D-scores for the two sequences?
17. The results from SignalP include something called cleavage sites. How do you interpret the term cleavage site in the context of signal peptides?