Skip to content
Snippets Groups Projects
Helene Rimbert's avatar
Helene Rimbert authored
remove temporary fasta files to loawer disk usage. lower coverage and identity threshold in gmap rescue
1f7f7adf
History

MAGATT pipeline

Marker Assisted Gene Annotation Transfert for Triticeae.
Snakemake pipeline used to transfert GFF annotation on a new assembly with a fine target mapping approach.

Install the pipeline

$ git clone git@forgemia.inra.fr:umr-gdec/magatt.git

Dependancies

Snakemake

  • Version 5.5.2

Python3

  • Python: 3.5
  • Biopython: 1.68
  • numpy: 1.15
  • pandas: 0.23
  • pysam: 0.15

Genomic Tools

  • Bedtools: 2.27
  • Blat: 36
  • Exonerate (fastavalidcds): 2.4.0
  • GenomeTools: 1.5.9
  • gffread: 0.9.11
  • GMAP: 2018-05-11
  • NCBI-blast (BLAST+): 2.6
  • Samtools: 1.9

Prepare the pipeline

Creating the configuration file

The configuration file config.yaml will contain all the input files required for the pipeline and some other parameters.

  • Prepare the query genome data
Parameter in config.yaml format description Example
annotationQuery GFF3 The gff file of the annotation we want to transfert onto the new Target genome annotationQuery: "/path/to/IWGSC_RefSeqv1_annotation.gff3"
featureType [STRING] The feature we want to use to anchore the annotation. Here we use the gene feature of the GFF. featureType: 'gene
queryFasta FASTA Fasta file of the genome reference sequence. Must match the GFF file used in annotationQuery parameter queryFasta: '/path/to/IWGSC_RefSeqv1_annotation.fasta'
blastdb [blast database] blast db of all mRNAs (all isoforms) of the query annotation. This will be used to rescue genes which have failed in the transfert blastdb: 'data/IWGSCv1.1_all_mrna'
chromosomes python list list of all the chromosomes in the query reference genome. This will be used to split all the data per chromosome and speed up the analysis chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U']
  • Prepare the markers/ISBPs input data

The pipeline uses markers/ISBPs as anchores to target a restricted part of the target genome on which our genes are suspected to be located.

In the first place, it is required to map such markers (ISBPs in our cases) with bwa on your target genome:

$ bwa mem IWGSC_RefSeqV2_bwaindex ISBPs.fasta |samtools view -bS - |samtools sort -o ISBPS_on_refseqv2.bam
Parameter in config.yaml format description Example
isbpBam BAM BAM file of the mapping of markers/ISBPs on the target genome isbpBam: 'ISBPS_on_refseqv2.bam'
isbpBed BED Initial coordinates of the markers/ISBPs on the query genome isbpBed: 'data/ISBP_refseqv1.bed'
mapq [INT] Minimum mapping quality of the marker/ISBP to be kept in the anchoring process mapq: 30
mismatches [INT] Max missmatches allowed for a marker to be kept in the analysis mismatches: 0
  • Prepare the target genome data

Here, we only need the fasta of the new genome assembly

Parameter in config.yaml format description Example
targetFasta FASTA fasta file of the target genome assembly on which we transfert the annotation targetFasta: 'data/CS_pesudo_v2.1.fa'
  • Output parameters/settings
Parameter in config.yaml format description Example
results [STRING] directory in which all the Snakemake rules will be executed results: 'results'
finalPrefix [STRING] Prefix for all the final output files (annotaion, mrna/pep fasta sequences ect) finalPrefix: 'IWGSC_refseqv2.0_annotv2.0'
chromMapID CSV Mapping file which sets the correspondance between the chromosome names in the GFF and the chromosome ID in the newlygenerated gene IDs chromMapID: 'data/chromosomeMappingID.csv'

Example of chromosomeMappingID.csv file :

$ cat data/chromosomeMappingID.csv
Chr1A   1A
Chr1B   1B
Chr1D   1D
Chr2A   2A
Chr2B   2B
Chr2D   2D
Chr3A   3A
Chr3B   3B
Chr3D   3D
Chr4A   4A
Chr4B   4B
Chr4D   4D
Chr5A   5A
Chr5B   5B
Chr5D   5D
Chr6A   6A
Chr6B   6B
Chr6D   6D
Chr7A   7A
Chr7B   7B
Chr7D   7D

Once all those parameters has been set up, the final configuration file may look like this:

##### QUERY related files/parameters (refseqv1.0)
# GFF annotatin to transfert
annotationQuery: 'data/IWGSC_v1.1_20170706.gff3'
# feature type used for anchoring on target genome
featureType: 'gene'
# FASTA of the query (used to check the sequences after the coordinates are calculated on the target genome)
queryFasta: 'data/161010_Chinese_Spring_v1.0_pseudomolecules.fasta'
# blastdb of all mrnas. used to rescue genes which have failed in the transfert using the targeted approache
blastdb: 'data/IWGSCv1.1_all_mrna'
# map of all chromosome ids --> NEED TO BE UPDATED in another version WITH ONE ARRAY FOR THE QUERY AND ONE ARRAY FOR THE TARGET GENOME ASSEMBLY
chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U']
refChrom: ['chr1A', 'chr1B', 'chr1D', 'chr2A', 'chr2B', 'chr2D', 'chr3A', 'chr3B', 'chr3D', 'chr4A', 'chr4B', 'chr4D', 'chr5A', 'chr5B', 'chr5D', 'chr6A', 'chr6B', 'chr6D', 'chr7A', 'chr7B', 'chr7D', 'chrUn']

##### TARGET related files/parameters (refseqv2.1)
targetFasta: 'data/CS_pesudo_v2.1.fa'

##### ISBP/markers related config and parameters
# BAM file of markers/ISBPs mapped on the target genome (REFSEQ v2.1)
isbpBam: 'data/iwgsc_refseqv1_ISBP.bwav2.1.bam'
# BED file of coordinates on the query genome (REFSEQ v1.0)
isbpBed: 'data/ISBP_refseqv1.bed'
# minimum mapping quality of markers on the target genome
mapq: 30
# max mismatches per ISBP/marker
mismatches: 0

##### OUTPUT directory
results: 'resultsDEV'
finalPrefix: 'IWGSC_refseqv2.0_annotv2.0'
# this file contains two columns: the first is the chromosome name as it appears in the genome.fasta of the new reference,
# and the second the chromosome name as it will appear in the new gene Names
chromMapID: 'data/chromosomeMappingID.csv'

Running the pipeline

At first, it is recommended to make a dry-run of the analysis:

$ snakemake -nrp

This will check all the rules and th parameters in the config.yaml file and print all the command lines which would have been executed.
If there is no errors, then you can execute the pipeline with:

$ snakemake

If you have multiple CPUs available on your computer, you can choose to use them.
For example, if you want to use up to 8 CPUs in parallel, you can run:

$ snakemake -j 8

If you are on a computer cluster with a job scheduler, you can tell the pipeline to use this scheduler instead of runnin all the processes on the local machine:

$ snakemake -j 32 --cluster sbatch

This will allow to have at most 32 subproccess run through the SLURM scheduler with sbatch.

You can generate the diagram of all the processes and dependancies of you analysis:

$ snakemake --dag |dot -T png > dag.png

This will generate a PNG file of your diagram.
dag.png

If you simply want the global process of the pipeline, you may run:

$ snakemake --rulegraph |dot -T png > rulegraph.png

This will generate a PNG file of your diagram.
rulegraph.png