add_transcripts =============== This script should run after `transcript_query` and before `thoraxe`. For example, let's see you want to add a *MAPK8* transcript described in `user_transcript.csv` to the data download from *Ensembl* for that gene to run `thoraxe`; you can do something like: :: transcript_query MAPK8 add_transcripts user_transcript.csv MAPK8/Ensembl thoraxe -i MAPK8 .. argparse:: :ref: thoraxe.add_transcripts.add_transcripts.parse_command_line :prog: transcript_query .. tip:: While this program is called `add_transcripts` you can use it to add genes by adding their transcripts. Use a single input table to add multiple genes and transcripts or run this script multiple times to add a different transcript each time. .. warning:: You can not use this program to add single exons unless they contain the complete *CDS*. Otherwise, `thoraxe` would delete the exon and its incomplete transcript. Input preparation ~~~~~~~~~~~~~~~~~ The input table should be a CSV file with the following columns: ==================== =========================================================== Column Name Description ==================== =========================================================== Species It should be the binomial species name in lowercase and using underscore instead of space. GeneID It should be the Ensembl gene ID, e.g. ENSG00000107643, rather than the gene name, e.g. MAPK8. TranscriptID A string to identify the transcript. Strand The Strand should be 1 for a gene in the forward strand and -1 for one in the reverse strand. ExonID A string to identify the exon. ExonRank The Exon Rank should be consecutive integer numbers indicating the order of the exons in the transcript. ExonRegionStart ExonRegionStart should be the genomic coordinate of the first, last if the gene is in the reverse strand, nucleotide of the NucleotideSequence of the exon. Note that ExonRegionStart should always be less than ExonRegionEnd. ExonRegionEnd ExonRegionEnd should be the genomic coordinate of the last, first if the gene is in the reverse strand, nucleotide of the NucleotideSequence of the exon. GenomicCodingStart If the gene is in the forward strand, GenomicCodingStart should be the first coding nucleotide's genomic coordinate. Otherwise, it should be the genomic coordinate of the last coding nucleotide. Note that GenomicCodingStart should always be less than GenomicCodingEnd. GenomicCodingEnd If the gene is in the forward strand, GenomicCodingEnd should be the last coding nucleotide's genomic coordinate on the exon. Otherwise, it should be the genomic coordinate of the first coding nucleotide. StartPhase Start phase of the exon. The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1. EndPhase End phase of the exon. NucleotideSequence Nucleotide sequence for the exons. They can contain non-coding regions, e.g. UTRs. If the gene is in the reverse strand, the exon sequence should be the reverse complement of the genomic sequence. ==================== =========================================================== You can find an example of the required table in `test/data/user_transcript.csv` of the `thoraxe` repository at `GitHub`_ or in this `Google spreadsheet`_. .. hint:: You can copy this `Google spreadsheet`_ to modify it with your data and download it as a CSV file. .. _GitHub: https://github.com/PhyloSofS-Team/thoraxe .. _Google spreadsheet: https://docs.google.com/spreadsheets/d/1EEz1rsDCJdJeCl8jPoikTtskAKpEQiE3V8FGcCG55mk/edit?usp=sharing