thoraxe.transcript_info package

Submodules

thoraxe.transcript_info.clusters module

clusters: Helper module for work with sets as clusters.

A cluster is a set of elements and we can have list of clusters.

thoraxe.transcript_info.clusters.cluster2str(cluster, delim='/', item2str=<class 'str'>)

Take a cluster (set) and return a string representation.

>>> cluster2str({1, 10, 100})
'1/10/100'
>>> cluster2str({1, 10, 100}, delim=';')
'1;10;100'
>>> names = {1 : 'a', 10 : 'b'}
>>> cluster2str({1, 10, 100}, item2str=lambda x: names.get(x, str(x)))
'a/b/100'
thoraxe.transcript_info.clusters.fill_clusters(clusters, element_i, element_j)

Fill the list of clusters (sets) with two elements from the same cluster.

>>> clusters = []
>>> fill_clusters(clusters, 1, 10)
[{1, 10}]
>>> fill_clusters(clusters, 10, 100)
[{1, 10, 100}]
>>> fill_clusters(clusters, 20, 30)
[{1, 10, 100}, {20, 30}]
thoraxe.transcript_info.clusters.inform_about_deletions(todelete, message)

Warning about elements that are going to be deleted.

inform_about_deletions({1, 2, 3}, “Identical sequences were found:”) will inform:

WARNING:root:Identical sequences were found:
WARNING:root:deleting 1
WARNING:root:deleting 2
WARNING:root:deleting 3
thoraxe.transcript_info.clusters.set_to_delete(set_list)

Return the set to delete given a list of sets.

It keeps the first element of the set.

>>> set_to_delete([{1, 2, 3}, {4, 5}]) # it keeps 1 and 4
{2, 3, 5}

thoraxe.transcript_info.exon_clustering module

Exon clustering: Functions to cluster exons using pairwise alignments.

thoraxe.transcript_info.exon_clustering.align(seq_a, seq_b, gap_open_penalty, gap_extend_penalty, substitution_matrix)

Align the sequnces using the BioPython pairwise2 aligner.

thoraxe.transcript_info.exon_clustering.coverage(seq, seq_len)

Return coverage of the sequence in the alignment.

>>> coverage("AAAA----", 8)
50.0
thoraxe.transcript_info.exon_clustering.coverage_shortest(seq_query, seq_target, seq_len)

Return coverage of the shortest sequence in the alignment.

>>> coverage_shortest("AAAA----", "AAAAAAAA", 8)
50.0
thoraxe.transcript_info.exon_clustering.exon_clustering(trx_data, minimum_len=4, coverage_cutoff=80.0, percent_identity_cutoff=30.0, gap_open_penalty=-10, gap_extend_penalty=-1, substitution_matrix=None, merge_clusters=False)

Cluster exons based on their sequence identity after local alignment.

It uses a Hobohm I sequence clustering algorithm to perform a fast clustering of the exons. Exons are sorted from the longest to the shortest before start the clustering, but at the end the returned table as the same order as the input. The returned data frame as two extra columns, Cluster and QueryExon that contains the cluster number and the name of the query sequence used to join that one into that cluster.

The alignment is performed using Smith-Waterman from BioPython. The keyword arguments gap_open_penalty, gap_extend_penalty and substitution_matrix are passed to pairwise2.

Exons with a length less than the minimum_len (default: 4) are not clustered. Non-clustered exons have Cluster number 0 and an empty string as ExonQuery.

The coverage and percent identity cutoff to decide if an exon sequence belongs to one cluster can be modified with the keyword arguments coverage_cutoff and percent_identity_cutoff (default to 80.0 and 30.0, respectively).

Exons are aligned against others until they reach a percent identity higher than percent_identity_cutoff plus 30, to allow low identity matches to be replaced by better ones.

thoraxe.transcript_info.exon_clustering.percent_identity(query, target)

Return percent identity of the aligned sequences.

>>> percent_identity("AA---", "AAAA-")
50.0

thoraxe.transcript_info.phases module

phases: Calculate exon phases and chack exon order and phases.

This is the code for calculating exon start and end phases that we used before downloading that data from ENSEMBL. The code was changed to be use as a checker for the downloaded and parsed data. Also, this module have functions used to calculate the subexon phases.

This could be useful to help in the understanding of exon start and end phases.

thoraxe.transcript_info.phases.bases_to_complete_next_codon(phase)

Return the bases at the exon end that completes the next codon.

>>> bases_to_complete_next_codon(0) # e.g. -----XXX
0
>>> bases_to_complete_next_codon(2) # e.g. XX-----X
2
>>> bases_to_complete_next_codon(1) # e.g. X-----XX
1
thoraxe.transcript_info.phases.bases_to_complete_previous_codon(phase)

Return the bases at exon start that completes the previous codon.

>>> bases_to_complete_previous_codon(0) # e.g. -----XXX
0
>>> bases_to_complete_previous_codon(2) # e.g. XX-----X
1
>>> bases_to_complete_previous_codon(1) # e.g. X-----XX
2
thoraxe.transcript_info.phases.calculate_phase(cdna_len, previous_end_phase)

Calculate exon start and end phases using the intron phase.

It calculates phases as explained in the ENSEMBL glossary: http://www.ensembl.org/Multi/Help/Glossary

“In protein-coding exons, the end phase is the place where the intron lands inside the codon : 0 between codons, 1 between the 1st and second base, 2 between the second and 3rd base. Exons therefore have a start phase and an end phase, but introns have just one phase. An exon which is non coding (or non-coding at the end) has an end phase of -1”

Under this definition, start phase of the actual exon is the end phase of the previous one, and the end phase of the actual is the start phase of the next one.

Let X, y and z denote codon bases that belong to exons X, y and z. Let - denote intron bases.

0 - No interruption.

XXX——yyyzzz

1 - first codon’s first base is in the previous exon.

X——XXyyyzzz

2 - first codon’s first two bases are in the

previous exon. XX——Xyyyzzz

NOTE: This function can not calculate a phase of -1.

>>> calculate_phase(6, 0) # e.g. -----XXXyyy----zzz
(0, 0)
>>> calculate_phase(6, 2) # e.g. XX-----Xyyyzz----z
(2, 2)
>>> calculate_phase(6, 1) # e.g. X-----XXyyyz----zz
(1, 1)
thoraxe.transcript_info.phases.check_order_and_phases(data_frame)

Check DataFrame order and exon start/end phases.

It takes a data frame with the exon data, including the sequences as the output of add_exon_sequences when it is applied to the output of read_exon_file.

The columns of the input dataframe must be ordered by ‘Transcript stable ID’ and ‘ExonRank’. The function read_exon_file ensures that order.

It checks that order and it also checks exon start and end phase information. It throws an informative error if something is wrong.

thoraxe.transcript_info.phases.end_phase_previous_exon(data_frame, exon_pos, prev_row_index, end_phase_column='EndPhase')

Return the end phase of the previous exon.

It returns 0 if the actual exon is the first in the transcript.

thoraxe.transcript_info.transcript_info module

Functions that use pandas to read transcript information.

thoraxe.transcript_info.transcript_info.add_exon_sequences(data_frame, sequence_file)

Add exon sequences to the DataFrame.

This function adds an ‘ExonSequence’ column at the end of ‘data_frame’. For each row, this column has the BioPython’s SeqRecord for the exon sequence of the row. The sequence description in the sequence_file (in fasta format) should have the ‘ExonID’ as the second element if the description is split by ‘ ‘.ipt_in The data_frame should have an ‘ExonID’ column and should not have an ‘ExonSequence’ column. ‘ExonID’ is going to be used to match the SeqRecord to the row.

thoraxe.transcript_info.transcript_info.add_protein_seq(data_frame, allow_incomplete_cds=True, seq_column='ExonProteinSequence')

Add a column with the protein sequence of the exon.

It takes a data frame with the exon data, including the sequences as the output of add_exon_sequences when it is applied to the output of read_exon_file.

thoraxe.transcript_info.transcript_info.delete_badquality_sequences(data_frame)

Delete protein sequences with X’s in their sequence in place.

thoraxe.transcript_info.transcript_info.delete_identical_sequences(data_frame, seq_column='ExonProteinSequence')

Delete identical sequences in place keeping only one.

thoraxe.transcript_info.transcript_info.delete_incomplete_sequences(data_frame, seq_column='ExonProteinSequence', trx_column='TranscriptID')

Delete incomplete sequences in place.

Incomplete sequences are the ones that do not have their protein sequence finishing with ‘*’. It also deletes sequences that probably has an incomplete CDS because starts or ends with phases differente fron 0 or -1.

thoraxe.transcript_info.transcript_info.find_identical_exons(data_frame, exon_id_column='ExonID', seq_column='ExonProteinSequence')

Find exons that have similar coordinates and identical sequences.

thoraxe.transcript_info.transcript_info.find_identical_sequences(data_frame)

Find different transcripts with identical sequences.

Input should have the exon sequences as the output of add_exon_sequences. It returns a list with the identical sequence clusters (sets).

thoraxe.transcript_info.transcript_info.merge_identical_exons(data_frame, exon_id_column='ExonID', seq_column='ExonProteinSequence')

Unify the ‘ExonID’ of identical exons.

thoraxe.transcript_info.transcript_info.read_exon_file(exon_table_file)

Read the exon_table_file and returns a pandas’ DataFrame.

thoraxe.transcript_info.transcript_info.read_transcript_info(tsl_table_file, exon_table_file, exon_sequence_file, max_tsl_level=3.0, remove_na=True, remove_badquality=True, species_list=None)

Read and integrate the transcript information.

Due to the data structure downloaded from ENSEMBL, we combine 3 types of primary information:

  1. tsl_table_file has the evidence for the gene transcripts. For each transcript, it gives information on its type and its evidence level. We use this information to filter out transcripts which are not interesting for our analysis (e.g. ncRNA, partial transcripts, etc.) At this moment, we keep transcripts with ‘Protein coding’ biotype and with a TSL evidence level of 1, 2 or 3 by default (max_tsl_level). If remove_na is True (default) rows with missing TSL data are eliminated.

  2. Each row of the exon_table_file summarizes t :whe information for a transcript exon, as provided by Biomart. The sorting of the exons is important and it is done using the rank information provided by ENSEMBL (not coordinates).

  3. The genomic sequence information from the coding sequence of all the exons in fasta format is in exon_sequence_file.

This function also processes the information within transcript to get the peptidic sequences of each exon taking into account the exon phases. The codons overlapping the exon boundary are always added to the end of the first exon. This is different from the older implementation, where it is added to the beginning of the next exon.

This protein sequence information is used by this function to delete incomplete and identical sequences and to merge identical exons.

If remove_badquality is True (default), protein sequences with X’s are deleted.

thoraxe.transcript_info.transcript_info.read_tsl_file(tsl_file, max_tsl_level, remove_na=False, species_list=None)

Read a csv file with the Transcript Support Level (TSL) data.

max_tsl_level determines the maximum Transcript Support Level (TSL) level to keep. The options are:

  • 1.0 : All splice junctions of the transcript are supported by at least one non-suspect mRNA.

  • 2.0 : The best supporting mRNA is flagged as suspect or the support is from multiple ESTs (Expressed Sequence Tags).

  • 3.0 : The only support is from a single EST.

  • 4.0 : The best supporting EST is flagged as suspect.

  • 5.0 : No single transcript supports the model structure.

If remove_na is True, columns with missing TSL are not included (default: False).

It returns a pandas DataFrame with MultiIndex: (‘Species’, ‘TranscriptID’).

thoraxe.transcript_info.transcript_info.store_cluster(data_frame, cluster_list, default_values, column_name, item2str=<class 'str'>, get_item=<function <lambda>>, delim='/')

Store the string representation of clusters in the data_frame.

Parameters:
  • data_frame – DataFrame that’s going to store the cluster column.

  • cluster_list – List of sets.

  • default_values – Name of existing DataFrame column that containts the default (string) values for the new column.

  • column_name – Name of the new column to add with the cluster information.

  • item2str – Function to map from cluster item to its representation in the DataFrame (default: str)

  • getitem – Function to get the cluster item from the DataFrame row and the index of the default_values column (default: default value).

  • delim – Character delimiting cluster items (default: ‘/’).

>>> import pandas as pd
>>> cluster_list = [{2, 3}]
>>> df = pd.DataFrame(data={'id': [1, 2, 3, 4], 'value': [1, 2, 2, 3]})
>>> store_cluster(df, cluster_list, 'id', 'cluster')
>>> df
   id  value cluster
0   1      1       1
1   2      2     2/3
2   3      2     2/3
3   4      3       4

Module contents

transcript_info: Module to read and manage transcript information.

It performs the first exon clustering of the pipeline.