add_transcripts
This script should run after transcript_query and before thoraxe. For example, let’s see you want to add a MAPK8 transcript described in user_transcript.csv to the data download from Ensembl for that gene to run thoraxe; you can do something like:
transcript_query MAPK8
add_transcripts user_transcript.csv MAPK8/Ensembl
thoraxe -i MAPK8
add_transcripts downloads add user-defined transcripts to the data previously download from Ensembl.
usage: transcript_query [-h] [-v] [--version] input ensembl
Positional Arguments
- input
Input CSV containing the transcript data.
- ensembl
Path to the previously download Ensembl data for a gene, i.e. the path to the Ensembl directory created by transcript_query
Named Arguments
- -v, --verbose
Print detailed progress.
Default: False
- --version
show program’s version number and exit
It has been developed at LCQB (Laboratory of Computational and Quantitative Biology), UMR 7238 CNRS, Sorbonne Université.
Tip
While this program is called add_transcripts you can use it to add genes by adding their transcripts. Use a single input table to add multiple genes and transcripts or run this script multiple times to add a different transcript each time.
Warning
You can not use this program to add single exons unless they contain the complete CDS. Otherwise, thoraxe would delete the exon and its incomplete transcript.
Input preparation
The input table should be a CSV file with the following columns:
Column Name |
Description |
---|---|
Species |
It should be the binomial species name in lowercase and using underscore instead of space. |
GeneID |
It should be the Ensembl gene ID, e.g. ENSG00000107643, rather than the gene name, e.g. MAPK8. |
TranscriptID |
A string to identify the transcript. |
Strand |
The Strand should be 1 for a gene in the forward strand and -1 for one in the reverse strand. |
ExonID |
A string to identify the exon. |
ExonRank |
The Exon Rank should be consecutive integer numbers indicating the order of the exons in the transcript. |
ExonRegionStart |
ExonRegionStart should be the genomic coordinate of the first, last if the gene is in the reverse strand, nucleotide of the NucleotideSequence of the exon. Note that ExonRegionStart should always be less than ExonRegionEnd. |
ExonRegionEnd |
ExonRegionEnd should be the genomic coordinate of the last, first if the gene is in the reverse strand, nucleotide of the NucleotideSequence of the exon. |
GenomicCodingStart |
If the gene is in the forward strand, GenomicCodingStart should be the first coding nucleotide’s genomic coordinate. Otherwise, it should be the genomic coordinate of the last coding nucleotide. Note that GenomicCodingStart should always be less than GenomicCodingEnd. |
GenomicCodingEnd |
If the gene is in the forward strand, GenomicCodingEnd should be the last coding nucleotide’s genomic coordinate on the exon. Otherwise, it should be the genomic coordinate of the first coding nucleotide. |
StartPhase |
Start phase of the exon. The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1. |
EndPhase |
End phase of the exon. |
NucleotideSequence |
Nucleotide sequence for the exons. They can contain non-coding regions, e.g. UTRs. If the gene is in the reverse strand, the exon sequence should be the reverse complement of the genomic sequence. |
You can find an example of the required table in test/data/user_transcript.csv of the thoraxe repository at GitHub or in this Google spreadsheet.
Hint
You can copy this Google spreadsheet to modify it with your data and download it as a CSV file.