thoraxe.subexons package

Submodules

thoraxe.subexons.alignment module

alignment: Module to create the subexon MSA with ProGraphMSA.

class thoraxe.subexons.alignment.ColCluster(patterns, consensus, start, end)

class thoraxe.subexons.alignment.ColPattern(pattern, start, end)

property end: Alias for field number 2

property pattern: Alias for field number 0

property start: Alias for field number 1

class thoraxe.subexons.alignment.ProblematicSubexonBlock(sequence_index, subexon, subexon_blocks, block_type, subexon_block_start, subexon_block_end, gap_block_start, gap_block_end)

property block_type: Alias for field number 3

property gap_block_end: Alias for field number 7

property gap_block_start: Alias for field number 6

property sequence_index: Alias for field number 0

property subexon: Alias for field number 1

property subexon_block_end: Alias for field number 5

property subexon_block_start: Alias for field number 4

property subexon_blocks: Alias for field number 2

thoraxe.subexons.alignment.cluster_subexon_blocks(blocks, margin=2): Cluster sub-exon blocks starting or ending in the same column.

thoraxe.subexons.alignment.column_clusters(colpatterns): Return a ColCluster list from a ColPattern list.

thoraxe.subexons.alignment.column_patterns(msa_matrix): Return a ColPattern list from the msa_matrix.

thoraxe.subexons.alignment.create_chimeric_sequences(subexon_table, subexon_matrix, connected_subexons, padding='XXXXXXXXXX')

Create chimeric sequence for ProGraphMSA.

It returns a Dict from ‘GeneID’ to a tuple with the chimeric sequence and a Dict from ‘SubexonIndex’ to …

thoraxe.subexons.alignment.create_msa_matrix(chimerics, msa)

Convert a msa from chimerics to a matrix.

Each cell has the subexon number (Index) or nan for gaps and padding.

thoraxe.subexons.alignment.create_subexon_matrix(subexon_table): Return a binary matrix showing subexon presence in transcripts.

thoraxe.subexons.alignment.delete_padding(seq, padding)

Replace padding by gaps.

>>> delete_padding("MHGL--XXXX-XXX--XXXKLMHSXXXXXXX-X-X-X", "XXXXXXXXXX")
'MHGL---------------KLMHS-------------'

>>> delete_padding("MHGLXKLMHSX", "XXXXXXXXXX")
'MHGLXKLMHSX'

thoraxe.subexons.alignment.delete_subexons(subexon_df, chimerics, msa, cutoff=30.0, min_col_number=4, keep_single_subexons=False): Return the list of ‘SubexonIDCluster’ to delete from ‘Cluster’.

thoraxe.subexons.alignment.disintegration(msa_numpy, msa_matrix)

Disintegrate a one-residue length s-exons.

It merges their residues to the contiguous s-exons. Return the modified MSA and sub-exon matrix.

thoraxe.subexons.alignment.gene2species(transcript_data): Return the a dict from ‘GeneID’ to ‘Species’.

thoraxe.subexons.alignment.get_consensus(msa, threshold=0.5)

Return the consensus sequence of the MSA.

It returns the amino-acid residue that appears in more than threshold fraction of the sequences without counting sequences that has gaps in that column. If any residue is present in more sequences than the threshold, it introduces an X to indicate the ambiguity. If there are more than 50% gaps in a column, it displays the residue in lowercase.

>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Align import MultipleSeqAlignment
>>> seq_1 = SeqRecord(Seq("SEEEEACCC", IUPAC.protein), id="I")
>>> seq_2 = SeqRecord(Seq("SSEEEGCC-", IUPAC.protein), id="II")
>>> seq_3 = SeqRecord(Seq("SSSEEKC--", IUPAC.protein), id="III")
>>> seq_4 = SeqRecord(Seq("SSSSEH---", IUPAC.protein), id="IIII")
>>> msa = MultipleSeqAlignment([seq_1, seq_2, seq_3, seq_4])
>>> get_consensus(msa)
'SSXEEXCCc'

thoraxe.subexons.alignment.get_gene_ids(msa): Take a biopython MSA and return a list with the gene (sequence) ids.

thoraxe.subexons.alignment.get_subexon_boundaries(col_clusters): Returns a dict from (sequence number, subexon id) to start and end columns.

thoraxe.subexons.alignment.get_submsa(msa, seq_ids): Return the sub-MSA with seq_ids, deleting full gap columns.

thoraxe.subexons.alignment.impute_missing_s_exon(table, column='S_exons', key_columns=None): Replace column values that do not conform to the naming by 0_number.

thoraxe.subexons.alignment.move_block(matrix, row, block_start, block_end, destination_start, destination_end)

Move in-place a block in the (numpy) matrix.

>>> import numpy as np
>>> mat = np.matrix([[0,1,2,3,4], [5,6,7,8,9]])
>>> move_block(mat, 0, 0, 2, 3, 5)
matrix([[3, 4, 2, 0, 1],
        [5, 6, 7, 8, 9]])
>>> move_block(mat, 1, 0, 2, 3, 5)
matrix([[3, 4, 2, 0, 1],
        [8, 9, 7, 5, 6]])

thoraxe.subexons.alignment.move_problematic_block_clusters(msa, msa_matrix, max_res_block=2, min_gap_block=2): Find and move all the problematic block clusters in a chimeric MSA.

thoraxe.subexons.alignment.move_subexon_block(msa_numpy, msa_matrix, subexon_block)

Swap a sub-exon block within its gap block.

The movement depends on the problematic sub-exon block position, e.g.:

``` from: [1.0, 1.0, 1.0, nan, nan, nan, 1.0, 2.0, 2.0, 2.0] # last_block to: [1.0, 1.0, 1.0, 1.0, nan, nan, nan, 2.0, 2.0, 2.0]

from: [1.0, 1.0, 1.0, 2.0, nan, nan, nan, 2.0, 2.0, 2.0] # first_block to: [1.0, 1.0, 1.0, nan, nan, nan, 2.0, 2.0, 2.0, 2.0] ```

thoraxe.subexons.alignment.msa2sequences(msa, gene_ids, padding)

Return str sequences from msa.

It also checks gene_ids and replaces padding by gaps. Full gap columns are deleted.

thoraxe.subexons.alignment.msa_matrices(subexon_df, chimerics, msa): Return msa (residues) and subexon id cluster matrices from msa.

thoraxe.subexons.alignment.problematic_block_clusters(msa_matrix, max_res_block=2, min_gap_block=2): Return a Dict of problematic sub-exon block clusters.

thoraxe.subexons.alignment.problematic_subexon_blocks(subexons, starts, ends, sequence, max_res_block=2, min_gap_block=2)

Return a list of ProblematicSubexonBlocks.

In particular, a sub-exon block is problematic if it has <= max_res_block and it is separeated from the rest of the sub-exon by a gap block => min_gap_block.

>>> import numpy as np
>>> problematic_subexon_blocks([1.0, np.nan, 1.0, 2.0], [0, 3, 9, 10], [3, 9, 10, 13], 0)
[ProblematicSubexonBlock(sequence_index=0, subexon=1.0, subexon_blocks=2, block_type='last_block', subexon_block_start=9, subexon_block_end=10, gap_block_start=3, gap_block_end=9)]

thoraxe.subexons.alignment.read_msa_fasta(msa_file)

Return a BioPython’s alignment object from the fasta msa_file.

Return None if the file is empty.

thoraxe.subexons.alignment.resume_seq(matrix_row)

Return the sub-exon and gap blocks in the msa matrix.

The function return three lists, a list of sub-exon ids and NaNs, i.e. the block id, a list with the column indices where the block starts and a list with the block ends.

>>> resume_seq([1.0, 1.0, 1.0, np.nan, np.nan, np.nan, 1.0, 2.0, 2.0, 2.0])
([1.0, nan, 1.0, 2.0], [0, 3, 6, 7], [3, 6, 7, 10])

thoraxe.subexons.alignment.run_aligner(chimerics, output_path='alignment.fasta', aligner='ProGraphMSA')

Run ProGraphMSA in the chimeric sequences and return the output file.

You can pass arguments using aligner (default: ‘ProGraphMSA’), e.g:

aligner='ProGraphMSA --mldist_gap'

You need ProGraphMSA installed to run this function. You can install it from: https://github.com/acg-team/ProGraphMSA

If you are using Windows 10 and you have installed ProGraphMSA in Ubuntu using the ‘Windows Subsystem for Linux’, you can try with the following options:

aligner='ubuntu.exe -c ProGraphMSA'
aligner='bash.exe -c ProGraphMSA'
aligner='wsl.exe ProGraphMSA'

thoraxe.subexons.alignment.save_s_exons(subexon_df, sequences, gene_ids, colclusters, output_folder)

It saves the information about s-exons.

It takes a list of sequences, like the one returned by msa2sequences. Return subexon_df with the orthologous exonic region (s-exon) information. For each s-exon saves a fasta MSA in the output_folder.

thoraxe.subexons.alignment.score_solution(msa_matrix): Return the number of times a sub-exon cross an s-exon boundary.

thoraxe.subexons.alignment.sort_species(chimerics, gene2sp, species_list=None): Sort chimerics using the gene2species output and the species_list.

thoraxe.subexons.alignment.subexon_connectivity(subexon_table, id_column='SubexonIDCluster')

Return a set of connected subexon tuples.

Each tuple has two elements, ‘SubexonIDCluster’ (directed) pairs. The first subexon appears before of the second in at least one transcript.

thoraxe.subexons.ases module

ases: Function to detect conserved alternative splicing events.

It finds the canonical path in the splice graph to detect conserved ASEs.

class thoraxe.subexons.ases.PathData(Genes, Transcripts)

property Genes: Alias for field number 0

property Transcripts: Alias for field number 1

thoraxe.subexons.ases.conserved_ases(s_exon_table, transcript_table, s_exon_msas, graph_file_name, min_genes=1, min_transcripts=2, column_order=None, delim='/')

Return two DataFrames.

A table of transcripts/paths.
A table of the conserved alternative splicing events detected.

thoraxe.subexons.ases.create_ases_table(s_exon_msas, s_exon_table, events, delim='/'): Create an ASE data frame from the dictionary returned by detect_ases.

thoraxe.subexons.ases.detect_ases(path_table, min_genes=0, min_transcripts=1, delim='/')

Return a dictionary from ASEs to their data.

It takes as imput a pandas DataFrame object with the path table. Each key of the output dictionary is a tuple (canonical_path, alternative_path) and the value is a tuple of:

one PathData tuple for the canonical path

and one for the alternative path;

the transcript weighted conservation for the canonical sub-path,

and for the alternative sub-path,

the type of alternative splicing event, e.g. “insertion”;

the mutually exclusivity of the event (”” for non mutually exclusive);

one list of mutually exclusive s-exons in canonical paths

and one for the alternative paths.

The last two list are paired, i.e. the two first s-exons are mutually exclusive.

thoraxe.subexons.ases.get_transcript_scores(s_exon_table, transcript_table, graph, column_order=None, delim='/'): Return a DataFrame with the needed data to choose the canonical path.

thoraxe.subexons.ases.read_splice_graph(graph_file_name): Read ThorAxe’s splice graph using NetworkX.

thoraxe.subexons.graph module

graph: Functions to store the splice graph with conservation information.

thoraxe.subexons.graph.nodes_and_edges2genes_and_transcripts(data)

Return five dictionaries:

nodes to genes
edges to genes
nodes to transcripts
edges to transcripts
edges to the transcript weighted conservation score

thoraxe.subexons.graph.splice_graph_gml(filename, s_exons_msas, node2genes, edge2genes, node2transcripts, edge2transcripts, edge2trx_cons, s_exon_2_char)

Store the splice graph in GML (Graph Modelling Language) format.

It stores the conservation level information for nodes and edges, where conservation is the fraction of species that has that particular feature.

thoraxe.subexons.phylosofs module

phylosofs: Functions to generate PhyloSofS input.

thoraxe.subexons.phylosofs.get_s_exon2char(s_exons)

Return a dictionary from orthologous regions (s-exon) to a single character.

>>> result = get_s_exon2char(['1_0', '1_1', '2_0'])
>>> sorted(result)
['1_0', '1_1', '2_0']
>>> [result[s_exon] for s_exon in sorted(result)]
['%', '(', ')']

thoraxe.subexons.phylosofs.get_transcript2phylosofs(data, s_exon2char): Return a dictionary from transcript id to its phylosofs representation.

thoraxe.subexons.phylosofs.phylosofs_inputs(s_exon_data, ensembl_folder, output_folder): Create the input files needed for running PhyloSofS in a phylosofs folder.

thoraxe.subexons.phylosofs.prune_tree(input_tree, output_tree, exontable_file, used_genes)

Delete unused proteins from the Ensembl gene tree.

The function reads the Newick tree from input_tree and stores the pruned tree in output_tree and returning the output_tree path. If an error occurs, the function does not save the tree and raises an Exception.

In the tree, each protein is identified by their ensembl translation id. For each gene, Ensembl uses the longest translated transcript sequence.

The exontable_file downloaded from Ensembl with transcript_query is needed for getting the mapping between translation and gene ids of the used_genes.

thoraxe.subexons.plot module

plot: Generate js data for plotting the MSA matrix using plotly.

thoraxe.subexons.plot.create_python_structure(cluster2data): Python structured data to be translated to JavaScript.

thoraxe.subexons.plot.plot_msa_subexons(cluster2data, output_folder): Save html and js data in the output_folder.

thoraxe.subexons.rescue module

Functions for the rescue phase.

thoraxe.subexons.rescue.get_unaligned_regions(seq_i, seq_j, minimum_len=4)

Return the ungapped regions in seq_j that do not align with seq_i.

Returned regions should have at least minimum_len.

>>> seq_i = '------MHKCLVDE------YTEDQGGFRK------'
>>> seq_j = 'MLLHYHHHKC-------LMCYTRDLHG---IH-L-K'
>>> get_unaligned_regions(seq_i, seq_j)
['MLLHYH', 'IHLK']
>>> seq_i = 'XXXXXXXXXXMHKCLVDE------YTEDQGGFRK'
>>> seq_j = '----MLLHYHHHKC-------LMCYTRDLHG---'
>>> get_unaligned_regions(seq_i, seq_j)
['MLLHYH']
>>> seq_i = 'XXXXXXXXXXMHKCLVDE------YTEDQGGFRK'
>>> seq_j = '----MLLHYHHHKC-----XXLMCYTRDLHG---'
>>> get_unaligned_regions(seq_i, seq_j)
['MLLHYH', 'XXLMC']

thoraxe.subexons.rescue.modify_subexon_cluster(subexon_table, subexon2cluster): Modify subexon_table using the mapping in subexon2cluster.

thoraxe.subexons.rescue.subexon_rescue_phase(cluster2data, subexon_table, minimum_len=4, coverage_cutoff=80.0, percent_identity_cutoff=30.0, gap_open_penalty=-10, gap_extend_penalty=-1, substitution_matrix=None)

Execute the subexon rescue phase.

This function looks the subexons to rescues in the subexon_table by looking at the ‘Cluster’ numbers that are lower than 0, i.e. -1 means that the subexon has been cleaned up from ‘Cluster’ 1. Then, each subexon to rescue is aligned against the subexons in other ‘Cluster’s, different from the original one. If it is nicely aligned against another cluster, the ‘Cluster’ number is modified in the subexon_table to match the new proposed ‘Cluster’ and the data relative to the original pairwise alignment are cleaned up. This function returns the list of ‘Cluster’s that have been modified.

thoraxe.subexons.subexons module

Functions to create the subexon table.

class thoraxe.subexons.subexons.Interval(start, end, components)

property components: Alias for field number 2

property end: Alias for field number 1

property start: Alias for field number 0

thoraxe.subexons.subexons.create_subexon_table(transcript_data, merge_non_redundant=True)

Return a subexon table.

If merge_non_redundant is True (default), contiguous subexons that appear together in only one exon are merged/joint. This happens because subexons are defined using coordinates at the genomic level to reduce redundancy. However, some subexons does not have the same protein sequence because of different reading frames/phases, giving non-redundant subexons at the protein level.

thoraxe.subexons.subexons.disjoint_intervals(starts, ends, ids)

Return list of disjoint intervals.

Produce the set of disjoint intervals from a list of start and end positions (same chr) and optionally ids. Empty intervals are reported as well. !!! The intervals need to be sorted by start and end position before calling the function and the coordinates should be inclusive.

>>> disjoint_intervals([1,10,20,40], [9,30,30,50], [0,1,2,3])
[Interval(start=1, end=9, components={0}), Interval(start=10, end=19, components={1}), Interval(start=20, end=30, components={1, 2}), Interval(start=31, end=39, components=set()), Interval(start=40, end=50, components={3})]
>>> disjoint_intervals([1,1], [9,7], [0,1])
[Interval(start=1, end=7, components={0, 1}), Interval(start=8, end=9, components={0})]
>>> disjoint_intervals([1,3], [7,9], [0,1])
[Interval(start=1, end=2, components={0}), Interval(start=3, end=7, components={0, 1}), Interval(start=8, end=9, components={1})]
>>> disjoint_intervals([3,1], [9,7], [0,1])
Traceback (most recent call last):
    ...
AssertionError: starts should be sorted.
>>> disjoint_intervals([3,1], [9,7,10], [0,1])
Traceback (most recent call last):
    ...
AssertionError: starts, ends and ids should have the same length.

thoraxe.subexons.subexons.update_to_merge_list(to_merge, subexon_1, subexon_2)

Add subexon_1 and subexon_2 to the to_merge list.

>>> update_to_merge_list([], 1, 2)
[{1, 2}]
>>> update_to_merge_list([{1, 2}], 2, 3)
[{1, 2, 3}]
>>> update_to_merge_list([{1, 2}], 8, 9)
[{1, 2}, {8, 9}]

thoraxe.subexons.tidy module

tidy: Utils for having a tidy CSV table output (denormalized table).

thoraxe.subexons.tidy.get_tidy_table(table, gene2species)

Save a tidy and denormalized output table.

Takes the final exon table and a dict from gene to species name to save a csv file with the output table in the outputdir.

Module contents

subexons: Module to identify and cut subexons.

This defines functions that take as input the DataFrame output from transcript_info.read_transcript_info.