add_transcripts

This script should run after transcript_query and before thoraxe. For example, let’s see you want to add a MAPK8 transcript described in user_transcript.csv to the data download from Ensembl for that gene to run thoraxe; you can do something like:

transcript_query MAPK8
add_transcripts user_transcript.csv MAPK8/Ensembl
thoraxe -i MAPK8

add_transcripts downloads add user-defined transcripts to the data previously download from Ensembl.

usage: transcript_query [-h] [-v] [--version] input ensembl

Positional Arguments

input

Input CSV containing the transcript data.

ensembl

Path to the previously download Ensembl data for a gene, i.e. the path to the Ensembl directory created by transcript_query

Named Arguments

-v, --verbose

Print detailed progress.

Default: False

--version

show program’s version number and exit

It has been developed at LCQB (Laboratory of Computational and Quantitative Biology), UMR 7238 CNRS, Sorbonne Université.

Tip

While this program is called add_transcripts you can use it to add genes by adding their transcripts. Use a single input table to add multiple genes and transcripts or run this script multiple times to add a different transcript each time.

Warning

You can not use this program to add single exons unless they contain the complete CDS. Otherwise, thoraxe would delete the exon and its incomplete transcript.

Input preparation

The input table should be a CSV file with the following columns:

Column Name

Description

Species

It should be the binomial species name in lowercase and using underscore instead of space.

GeneID

It should be the Ensembl gene ID, e.g. ENSG00000107643, rather than the gene name, e.g. MAPK8.

TranscriptID

A string to identify the transcript.

Strand

The Strand should be 1 for a gene in the forward strand and -1 for one in the reverse strand.

ExonID

A string to identify the exon.

ExonRank

The Exon Rank should be consecutive integer numbers indicating the order of the exons in the transcript.

ExonRegionStart

ExonRegionStart should be the genomic coordinate of the first, last if the gene is in the reverse strand, nucleotide of the NucleotideSequence of the exon. Note that ExonRegionStart should always be less than ExonRegionEnd.

ExonRegionEnd

ExonRegionEnd should be the genomic coordinate of the last, first if the gene is in the reverse strand, nucleotide of the NucleotideSequence of the exon.

GenomicCodingStart

If the gene is in the forward strand, GenomicCodingStart should be the first coding nucleotide’s genomic coordinate. Otherwise, it should be the genomic coordinate of the last coding nucleotide. Note that GenomicCodingStart should always be less than GenomicCodingEnd.

GenomicCodingEnd

If the gene is in the forward strand, GenomicCodingEnd should be the last coding nucleotide’s genomic coordinate on the exon. Otherwise, it should be the genomic coordinate of the first coding nucleotide.

StartPhase

Start phase of the exon. The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1.

EndPhase

End phase of the exon.

NucleotideSequence

Nucleotide sequence for the exons. They can contain non-coding regions, e.g. UTRs. If the gene is in the reverse strand, the exon sequence should be the reverse complement of the genomic sequence.

You can find an example of the required table in test/data/user_transcript.csv of the thoraxe repository at GitHub or in this Google spreadsheet.

Hint

You can copy this Google spreadsheet to modify it with your data and download it as a CSV file.