Skip to content
Snippets Groups Projects

Synteny pipeline

This Nextflow pipeline can produce orthogroups from OrthoFinder and synteny blocks from MCScanX, using genome and associated annotation file.

metro

Requirements

2 modes are available. The first one uses both the genome (FASTA) and the annotation file (GFF3) to extract proteins and run the rest of the pipeline (NOTE: make sure that sequences from the FASTA are folded, it is a requirement for AGAT/BioPerl). The second mode takes directly proteins as an input, along with the GFF3 file. This second mode gives more control over the proteins of interest and saves some of time.

First mode (Genome + Annotation)

Genome files should be in FASTA format. Annotation files should be in GFF3 format. A TSV file is the main input, gathering paths of aforementioned files for each species. The format is as follow:

ID    genome                  gff3               chr_conversion
sp1   input/sp1_genome.fa     input/sp1.gff3     input/chr_conversion/sp1.tsv
sp2   input/sp2_genome.fa     input/sp2.gff3     input/chr_conversion/sp2.tsv

The ID of each species should be unique and will be used to name output files. The chr_conversion column allows to provide a two-columns file for each species and change chromosome names in the FASTA and GFF3. This column is optional and files can be a placeholder, BUT the header must remain. Please see the example_data/synteny_genome_infiles.tsv for a working template.

Second mode (Protein + Annotation)

Protein files should be in FASTA format. Annotation files should be in GFF3 format. A TSV file is the main input, gathering paths of aforementioned files for each species. The format is as follow:

ID    protein          gff3             chr_conversion
sp1   input/sp1.fa     input/sp1.gff3   input/chr_conversion/sp1.tsv
sp2   input/sp2.fa     input/sp2.gff3   input/chr_conversion/sp2.tsv

The ID of each species should be unique and will be used to name output files. Please see the example_data/synteny_protein_infiles.tsv for a working template.

Softwares

All tools required for the pipeline execution will be installed on launch with Conda (ensure you have it installed) or Docker, with the exception of MCScanX. MCScanX should be installed in the bin/ folder of the pipeline, from https://github.com/wyp1125/MCScanX/archive/refs/heads/master.zip, unzipped, then make (ensure javac is installed). In case of difficulty, refer to the steps described here (https://github.com/wyp1125/MCScanX#installation). The programm should be accessed through bin/MCScanX-master/McScanX.

Running the pipeline

An example dataset is provided to test the pipeline. With the first mode:

nextflow run main.nf -c example_data/example_data.config --convert_chr false --species_genome_files example_data/synteny_genome_infiles.tsv --outdir results_example_data/

With the second mode:

nextflow run main.nf -c example_data/example_data.config --species_protein_files example_data/synteny_protein_infiles.tsv --outdir results_example_data/

Default parameters are described in the nextflow.config file. User can either modify the nextflow.config, mention a new config file with the -c option, or specify each parameter with its value in the command line.

Output

Output files will be gathered in the outdir directory. If all Publish results are set to true in the config file, the following outputs are expected for each species mentionned in the tsv file in the Requirements section:

  • checked_gff/ -> GFF3 file after verification with agat_convert_sp_gxf2gxf.pl (if check_gff is true)
  • converted_chr_names/ -> old to new chromosome names (if convert_chr is true)
  • cds/ -> extracted CDS using jcvi.formats.gff load
  • longest_isoform_gff/ -> selection of the longest isoform using agat_sp_keep_longest_isoform.pl (publish is set to false by default)
  • proteins/ -> translated CDS using seqkit translate
  • orthology/ -> OrthoFinder's results
  • synteny/ -> MCScanX's results More output directories can be created to check the results of each process, by changing the related options in the nextflow.config file.