Python: String Operations and I/O
Session 3: String operations and I/O
Content
- String manipulation
- Reading and writing files
- Input/Output
Reading
https://docs.python.org/3/tutorial/Refresh chapter 3.1.2 and 7.
Some extra useful things on strings:
http://www.diveintopython3.net/strings.htmlhttps://docs.python.org/3/library/stdtypes.html#string-methods
Grading
The lab reports are a training to write scientific texts and an important part of the course. To pass you need to provide a final version of the lab report within 7 days after the lab. You have a chance but it is not mandatory to submit a preliminary version and receive feedback on it. The preliminary version needs to be submitted within 72 hours of the lab and feedback will be provided at the latest 24 hours before the deadline of the assignment. If the final report it not submitted in time or it contains an error you will get an Fx on the lab course. This means that you have to submit an updated lab report within 7 days after the exam and that you will receive an E on the entire course. If this does not occur you will have to re-register for the course next time it is given (normally next year) and complete the missing parts that year and you can still not receive a grade higher than E.
Exercises
Only the mandatory assignments should be handed in. The file name should be yourname_yoursurname_sess3.py. Ignore accents and special characters. Before submitting, please make sure your code is correct and send it to david.menendez.hurtado@scilifelab.se.
You can find the template and the attachments here: https://gist.github.com/Dapid/f626f7bc8606ca0a40d735fd672efc09
Don’t hesitate to ask us to clarify if you have questions.
Introductory exercises
- Make a function that translates DNA into RNA. All the occurences of T change to U.
- Write a function that gets the complement of a DNA sequence.
- Given a string of characters, write a function that returns True if it is a valid DNA sequence, and False if it isn’t. DNA contains uniquely GATC.
- Make a program that reads in a file and writes back into a new file only the odd lines (every other line).
- In the previous exercise, what happens if the input and output files have the same name?
- Write a parser for a Multiple Sequence Alignment in Jones/ALN format. In this format, each line contains one protein sequence.
PEPTIDE PEPT-KESave them in a list, one line per element:
['PEPTIDE', 'PEPT-KE'] - What happens if you forget to close a file in read mode? And in write mode?
Mandatory Assignments
- Given a string of characters, write a function
is_protein(seq)that returns True if it is a valid protein sequence, else False. We consider a a valid protein something that contains only valid aminoacids: ACDEFGHIKLMNPQRSTVWY in capital letters. - Define a function
longest_line(file_name)that takes as an input a file name and returns the length of the longest line in the file. - Write a parser forĀ FASTA file: each sequence has a header in the previous line starting with the charater “>”. Call this function
parse_fastaand return the results as a list. Ignore the headers.* - Modify the exercises of the introduction to take DNA from a FASTA file and write its complements to another file. For each header, add the word “complement”.
* Note: please ignore the line length limit, every protein will be in a single line. That is, in this exercise, you won’t find:
> header5
PEPT
IDE
Additional assignments
- Make a program that asks you to type something on the keyboard, and prints it
In Title Case, Just Like So. - Write a function that takes a sentence as a string and counts the number of capital and lowercase letters.
- Write a translator from RNA sequence to protein sequence. You may make use of the codon table found further down. Given:
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA, you should obtain:MAMAPRTEINSTRING - Same as
longest_line(file_name), but return the third longest line in the file instead. - Modify the RNA translator to take a RNA string with possibly multiple proteins and write them back to a Jones format.
Appendix: RNA codon table
UUU F CUU L AUU I GUU V
UUC F CUC L AUC I GUC V
UUA L CUA L AUA I GUA V
UUG L CUG L AUG M GUG V
UCU S CCU P ACU T GCU A
UCC S CCC P ACC T GCC A
UCA S CCA P ACA T GCA A
UCG S CCG P ACG T GCG A
UAU Y CAU H AAU N GAU D
UAC Y CAC H AAC N GAC D
UAA Stop CAA Q AAA K GAA E
UAG Stop CAG Q AAG K GAG E
UGU C CGU R AGU S GGU G
UGC C CGC R AGC S GGC G
UGA Stop CGA R AGA R GGA G
UGG W CGG R AGG R GGG G