thoraxe package
Subpackages
- thoraxe.add_transcripts package
- thoraxe.subexons package
- Submodules
- thoraxe.subexons.alignment module
ColCluster
ColPattern
ProblematicSubexonBlock
ProblematicSubexonBlock.block_type
ProblematicSubexonBlock.gap_block_end
ProblematicSubexonBlock.gap_block_start
ProblematicSubexonBlock.sequence_index
ProblematicSubexonBlock.subexon
ProblematicSubexonBlock.subexon_block_end
ProblematicSubexonBlock.subexon_block_start
ProblematicSubexonBlock.subexon_blocks
cluster_subexon_blocks()
column_clusters()
column_patterns()
create_chimeric_sequences()
create_msa_matrix()
create_subexon_matrix()
delete_padding()
delete_subexons()
disintegration()
gene2species()
get_consensus()
get_gene_ids()
get_subexon_boundaries()
get_submsa()
impute_missing_s_exon()
move_block()
move_problematic_block_clusters()
move_subexon_block()
msa2sequences()
msa_matrices()
problematic_block_clusters()
problematic_subexon_blocks()
read_msa_fasta()
resume_seq()
run_aligner()
save_s_exons()
score_solution()
sort_species()
subexon_connectivity()
- thoraxe.subexons.ases module
- thoraxe.subexons.graph module
- thoraxe.subexons.phylosofs module
- thoraxe.subexons.plot module
- thoraxe.subexons.rescue module
- thoraxe.subexons.subexons module
- thoraxe.subexons.tidy module
- Module contents
- thoraxe.transcript_info package
- thoraxe.transcript_query package
- Submodules
- thoraxe.transcript_query.transcript_query module
dictseq2fasta()
filter_ortho()
generic_ensembl_rest_request()
get_biomart_exons_annot()
get_exons_sequences()
get_geneids_from_symbol()
get_genetree()
get_listofexons()
get_listoftranscripts()
get_orthologs()
get_transcripts_orthologs()
is_esemble_id()
lodict2csv()
main()
parse_command_line()
save_ensembl_version()
write_tsl_file()
- Module contents
- thoraxe.utils package
Submodules
thoraxe.thoraxe module
ThorAxe pipeline and script functions.
- thoraxe.thoraxe.add_s_exon_phases_and_coordinates(tbl)
Add s-exon genomic coordinates and phases to the tidy s-exon table.
For each s-exon, it add the
S_exon_Genomic_Sequence
and their genomic coordinates, starting at1
and defining the closed interval:[S_exon_CodingStart, S_exon_CodingEnd]
. WhereS_exon_CodingStart
is greater or equal toS_exon_CodingEnd
for genes in the reverseStrand
. It add also theS_exon_StartPhase
andS_exon_EndPhase
for those intervals.An s-exon, defined at the protein level, can be an entire sub-exon or a part of it. In cases where an s-exon is identical to a sub-exon, the genomic coordinates and the phases are the same. However, when a sub-exon is split into several s-exons, coordinates and phases need to be calculated:
Example 1: s-exons in the positive strand.
The following sub-exon shares codons with the previous and following sub-exon. In this example,
-
represents introns and lower case characters represent exon nucleotides. The start and end phases are1
and2
respectively:...v----vvwwwxxxyyyzz-----z...
Because we are in the positive strand, the residue
V
coming from thevvv
shared codon is assigned to this sub-exon protein sequence. In the same way, theZ
residue coming from thezzz
shared exon is assigned to the beginning of the next sub-exon:VWXY
Let’s say that this sub-exon is composed of two s-exons:
VW
andXY
. Then the corresponding genomic sequences arevvwww
andxxxyyyzz
and the(start, end)
phases are(1, 0)
and(0, 2)
.Genomic coordinates.
Let’s say that the first
v
has the genomic coordinate1
:v ---- vvwww xxxyyyzz ----- z
`` 1 11111111 12222 2``1 2345 67890 12345678 90123 4
In this way, the first s-exon has the genomic coordinates
[6, 10]
and the second has the genomic coordinates[11, 18]
. The first s-exon has2
amino-acid residues, but5
instead of6
bases because the first codon is shared between two s-exons. In particular, the amino-acid residueV
needs two bases in this s-exon to complete the codon started in the previous s-exon.Example 2: s-exons in the negative strand.
v ---- vvwww xxxyyyzz ----- z
2 2222 11111 11111 `` ``4 3210 98765 43210987 65432 1
If the same sequence belongs to the negative strand, the sub-exon protein sequence is going to contain
Z
but notV
:WXYZ
. Therefore, possible s-exons areW
andXYZ
and the corresponding genomic sequences will bevvwww
andxxxyyyzz
and the(start, end)
phases will be(1, 0)
and(0, 2)
.In this way, the first s-exon has the genomic coordinates
[19, 15]
and the second has the genomic coordinates[14, 7]
. The first s-exon has ` 1` amino-acid residue, but5
instead of3
bases because the first codon is shared between two s-exons. In particular, the amino-acid residueV
needs two bases in this s-exon to complete the codon started in the previous s-exon.
- thoraxe.thoraxe.create_chimeric_msa(output_folder, subexon_table, gene2speciesname, connected_subexons, clusters=None, cutoff=30.0, min_col_number=4, aligner='ProGraphMSA', padding='XXXXXXXXXX', species_list=None, keep_single_subexons=False)
Return a dict from cluster to cluster data.
For each cluster, there is a tuple with the subexon dataframe, the chimeric sequences and the msa.
This function can take a clusters argument with the list of ‘Cluster’ identifiers to use. If that list is not given, all the positive ‘Cluster’ identifiers from subexon_table are used.
- thoraxe.thoraxe.get_s_exon_msas(output_folder)
Return a dict of the s_exon MSAs.
- thoraxe.thoraxe.get_s_exons(output_folder, subexon_table, gene2speciesname, connected_subexons, minimum_len=4, coverage_cutoff=80.0, percent_identity_cutoff=30.0, gap_open_penalty=10, gap_extend_penalty=1, aligner='ProGraphMSA', padding='XXXXXXXXXX', movements=True, disintegration=True, species_list=None, keep_single_subexons=False)
Perform almost all the ThorAxe pipeline.
- thoraxe.thoraxe.get_subexons(transcript_table, minimum_len=4, coverage_cutoff=80.0, percent_identity_cutoff=30.0, gap_open_penalty=10, gap_extend_penalty=1)
Return a DataFrame with subexon information and clustered exons.
Exons are clustered according to their protein sequence using the blosum50 matrix and the following arguments: minimum_len, coverage_cutoff, percent_identity_cutoff, gap_open_penalty and gap_extend_penalty.
- thoraxe.thoraxe.get_transcripts(input_folder, max_tsl_level=3.0, species_list=None)
Return a DataFrame with the transcript information.
- thoraxe.thoraxe.main()
Perform Pipeline.
- thoraxe.thoraxe.merge_clusters(subexon_table)
Merge ‘Cluster’s that share subexons.
- thoraxe.thoraxe.parse_command_line()
Parse command line.
It uses argparse to parse thoraxe’ command line arguments and returns the argparse parser.
- thoraxe.thoraxe.update_subexon_table(subexon_table, cluster2data)
Update the subexon table by adding the s-exon information.
thoraxe.version module
Module contents
thoraxe: Pipeline to disentangle homology relationships between exons.