thoraxe package

Subpackages

Submodules

thoraxe.thoraxe module

ThorAxe pipeline and script functions.

thoraxe.thoraxe.add_s_exon_phases_and_coordinates(tbl)

Add s-exon genomic coordinates and phases to the tidy s-exon table.

For each s-exon, it add the S_exon_Genomic_Sequence and their genomic coordinates, starting at 1 and defining the closed interval: [S_exon_CodingStart, S_exon_CodingEnd]. Where S_exon_CodingStart is greater or equal to S_exon_CodingEnd for genes in the reverse Strand. It add also the S_exon_StartPhase and S_exon_EndPhase for those intervals.

An s-exon, defined at the protein level, can be an entire sub-exon or a part of it. In cases where an s-exon is identical to a sub-exon, the genomic coordinates and the phases are the same. However, when a sub-exon is split into several s-exons, coordinates and phases need to be calculated:

Example 1: s-exons in the positive strand.

The following sub-exon shares codons with the previous and following sub-exon. In this example, - represents introns and lower case characters represent exon nucleotides. The start and end phases are 1 and 2 respectively:

...v----vvwwwxxxyyyzz-----z...

Because we are in the positive strand, the residue V coming from the vvv shared codon is assigned to this sub-exon protein sequence. In the same way, the Z residue coming from the zzz shared exon is assigned to the beginning of the next sub-exon:

VWXY

Let’s say that this sub-exon is composed of two s-exons: VW and XY. Then the corresponding genomic sequences are vvwww and xxxyyyzz and the (start, end) phases are (1, 0) and (0, 2).

Genomic coordinates.

Let’s say that the first v has the genomic coordinate 1:

v ---- vvwww xxxyyyzz ----- z `` 1 11111111 12222 2`` 1 2345 67890 12345678 90123 4

In this way, the first s-exon has the genomic coordinates [6, 10] and the second has the genomic coordinates [11, 18]. The first s-exon has 2 amino-acid residues, but 5 instead of 6 bases because the first codon is shared between two s-exons. In particular, the amino-acid residue V needs two bases in this s-exon to complete the codon started in the previous s-exon.

Example 2: s-exons in the negative strand.

v ---- vvwww xxxyyyzz ----- z 2 2222 11111 11111 `` ``4 3210 98765 43210987 65432 1

If the same sequence belongs to the negative strand, the sub-exon protein sequence is going to contain Z but not V: WXYZ. Therefore, possible s-exons are W and XYZ and the corresponding genomic sequences will be vvwww and xxxyyyzz and the (start, end) phases will be (1, 0) and (0, 2).

In this way, the first s-exon has the genomic coordinates [19, 15] and the second has the genomic coordinates [14, 7]. The first s-exon has ` 1` amino-acid residue, but 5 instead of 3 bases because the first codon is shared between two s-exons. In particular, the amino-acid residue V needs two bases in this s-exon to complete the codon started in the previous s-exon.

thoraxe.thoraxe.create_chimeric_msa(output_folder, subexon_table, gene2speciesname, connected_subexons, clusters=None, cutoff=30.0, min_col_number=4, aligner='ProGraphMSA', padding='XXXXXXXXXX', species_list=None, keep_single_subexons=False)

Return a dict from cluster to cluster data.

For each cluster, there is a tuple with the subexon dataframe, the chimeric sequences and the msa.

This function can take a clusters argument with the list of ‘Cluster’ identifiers to use. If that list is not given, all the positive ‘Cluster’ identifiers from subexon_table are used.

thoraxe.thoraxe.get_s_exon_msas(output_folder): Return a dict of the s_exon MSAs.

thoraxe.thoraxe.get_s_exons(output_folder, subexon_table, gene2speciesname, connected_subexons, minimum_len=4, coverage_cutoff=80.0, percent_identity_cutoff=30.0, gap_open_penalty=10, gap_extend_penalty=1, aligner='ProGraphMSA', padding='XXXXXXXXXX', movements=True, disintegration=True, species_list=None, keep_single_subexons=False): Perform almost all the ThorAxe pipeline.

thoraxe.thoraxe.get_subexons(transcript_table, minimum_len=4, coverage_cutoff=80.0, percent_identity_cutoff=30.0, gap_open_penalty=10, gap_extend_penalty=1)

Return a DataFrame with subexon information and clustered exons.

Exons are clustered according to their protein sequence using the blosum50 matrix and the following arguments: minimum_len, coverage_cutoff, percent_identity_cutoff, gap_open_penalty and gap_extend_penalty.

thoraxe.thoraxe.get_transcripts(input_folder, max_tsl_level=3.0, species_list=None): Return a DataFrame with the transcript information.

thoraxe.thoraxe.main(): Perform Pipeline.

thoraxe.thoraxe.merge_clusters(subexon_table): Merge ‘Cluster’s that share subexons.

thoraxe.thoraxe.parse_command_line()

Parse command line.

It uses argparse to parse thoraxe’ command line arguments and returns the argparse parser.

thoraxe.thoraxe.update_subexon_table(subexon_table, cluster2data): Update the subexon table by adding the s-exon information.

thoraxe.version module

Module contents

thoraxe: Pipeline to disentangle homology relationships between exons.