Lab: Clustering
Instructions
Answer all questions and send it to Marco by email.
***Please send a pdf file (not .docx, .odt etc) with “name_lab_your_name.pdf”
Video
Grading
The lab reports are a training to write scientific texts and an important part of the course. To pass you need to provide a final version of the lab report within 7 days after the lab. You have one chance to submit a preliminary version and receive feedback on it (not mandatory). The preliminary version needs to be submitted within 72 hours of the lab and feedback will be provided at the latest 24 hours before the deadline of the assignment. If the final report it not submitted in time or it contains an error you will get an Fx on the lab course. This means that you have to submit an updated lab report within 7 days after the exam and that the highest grade you can get is an E on the entire course. If this does not occur you will have to re-register for the course next time it is given (normally next year) and complete the missing parts that year and you can still not receive a grade higher than E. So in short this is the rules for the lab reports.
A. 0-0 Lab
B. 0-3 days after the lab – Preliminary version
C. 3-6 days after the lab – Feedback
D. 7 days after the lab – Final version
The lab report must be complete upon submission, i.e. you have to provide an answer to each question. The assistant has the right to return / not approve the submission if it is not complete. Linguistic accuracy is a must. Returns should get back to s within a week after you have received it. The submission of returns must be a version where all comments appear. All comments should be considered carefully or else it does not count as a return. A maximum of 1 return is allowed. The second return means that you have to re-register for the course next time it is given (normally next year) and complete the lab then. Complete reports in time for all bioinformatics and programming labs are compulsory.
Assignment Instructions
It is often the case that, instead of classifying the data in known categories, we are interested in grouping the data points in meaningful sets on the basis of their common properties. Such a task is defined as clustering, and there is a variety of clustering methods available in the literature.
1. Give at least two examples of concrete problems where clustering could be useful.
2. Name two types of clustering methods and describe the concept behind them?
3. Give three examples of distance metrics commonly employed in clustering.
4. Assume we have four data points, A, B, C, D, with the distances between them given in the matrix below. Build the dendogram obtained when applying hierarchical single linkage clustering.
| A | B | C | D | |
| A | ||||
| B | 1.0 | |||
| C | 1.5 | 2.0 | ||
| D | 2.5 | 1.3 | 2.5 |
K-means is another clustering algorithm commonly used. We will use k-means to cluster cell cycle expression data from yeast. This data is found in a file (expression_parsed.dat) where each row is the “expression profile” for a gene, and each column is the expression level measured at a new time. You also have a file with the gene IDs in the corresponding order (ids.txt) as the expression profiles. A broad functional classification is also provided in the file (see functioncodes.txt for details).
EXERCISES:
5. Download the file clustering.tar and decompress it (tar xfv clustering.tar). Enter this directory (cd clustering).
6. Open the file clustering_scikit.py. It loads the data from a text file and performs the k-means algorithm. Do you understand what it does? How many clusters does it create by default? You may want to look at the documentation: http://scikit-learn.org/stable/modules/clustering.html
7. Run the program. Modify it as needed so you get the names of the sequences belonging to each cluster.
8. What does k in k-means mean?
9. If you change k=2, how many sequences do you get in the smallest cluster?
10. Change to k=3. How many sequences do you now get in the smallest cluster(s)? You can also try other values for k.
11. For k=3, what is the function of the genes in the two smaller clusters? Look at the functional classification and also use your database search skills. NB: the numbering of the clusters may change between runs.
12. There is a sequence (YDL115C) of unknown function in the dataset. Can you make a prediction of the functional class of this protein (according to the classification in functioncodes.txt)?
13. For comparison, take a look at the result of a neighbour joining clustering (nj_clustering).This tree has been created from the same data set. (However before calculating the tree the correlation between the profiles of each gene pair has been calculated.) What do you think are the advantages/disadvantages of the two clustering methods? When is k-means clustering useful and when is NJ preferable?
14. Modify the program to use a different clustering algorithm. Do you get significantly different results? Which one would you trust most? You will find a list and descriptions of algorithms already available here: http://scikit-learn.org/stable/modules/clustering.html#k-means