Synteny pipeline
This Nextflow pipeline can produce orthogroups from OrthoFinder and synteny blocks from MCScanX, using genome and associated annotation file.
Requirements
2 modes are available. The first one uses both the genome (FASTA) and the annotation file (GFF3) to extract proteins and run the rest of the pipeline (NOTE: make sure that sequences from the FASTA are folded, it is a requirement for AGAT/BioPerl). The second mode takes directly proteins as an input, along with the GFF3 file. This second mode gives more control over the proteins of interest and saves some of time.
First mode (Genome + Annotation)
Genome files should be in FASTA format. Annotation files should be in GFF3 format. A TSV file is the main input, gathering paths of aforementioned files for each species. The format is as follow:
ID genome gff3 chr_conversion
sp1 input/sp1_genome.fa input/sp1.gff3 input/chr_conversion/sp1.tsv
sp2 input/sp2_genome.fa input/sp2.gff3 input/chr_conversion/sp2.tsv
The ID of each species should be unique and will be used to name output files. The chr_conversion
column allows to provide a two-columns file for each species and change chromosome names in the FASTA and GFF3. This column is optional and files can be a placeholder, BUT the header must remain.
Please see the example_data/synteny_genome_infiles.tsv
for a working template.
Second mode (Protein + Annotation)
Protein files should be in FASTA format. Annotation files should be in GFF3 format. A TSV file is the main input, gathering paths of aforementioned files for each species. The format is as follow:
ID protein gff3 chr_conversion
sp1 input/sp1.fa input/sp1.gff3 input/chr_conversion/sp1.tsv
sp2 input/sp2.fa input/sp2.gff3 input/chr_conversion/sp2.tsv
The ID of each species should be unique and will be used to name output files.
Please see the example_data/synteny_protein_infiles.tsv
for a working template.
Softwares
All tools required for the pipeline execution will be installed on launch with Conda (ensure you have it installed) or Docker, with the exception of MCScanX.
MCScanX should be installed in the bin/ folder of the pipeline, from https://github.com/wyp1125/MCScanX/archive/refs/heads/master.zip, unzipped, then make
(ensure javac is installed). In case of difficulty, refer to the steps described here (https://github.com/wyp1125/MCScanX#installation).
The programm should be accessed through bin/MCScanX-master/McScanX.
Running the pipeline
An example dataset is provided to test the pipeline. With the first mode:
nextflow run main.nf -c example_data/example_data.config --convert_chr false --species_genome_files example_data/synteny_genome_infiles.tsv --outdir results_example_data/
With the second mode:
nextflow run main.nf -c example_data/example_data.config --species_protein_files example_data/synteny_protein_infiles.tsv --outdir results_example_data/
Default parameters are described in the nextflow.config
file. User can either modify the nextflow.config
, mention a new config file with the -c
option, or specify each parameter with its value in the command line.
Output
Output files will be gathered in the outdir directory.
If all Publish results
are set to true
in the config file, the following outputs are expected for each species mentionned in the tsv file in the Requirements section:
- checked_gff/ -> GFF3 file after verification with agat_convert_sp_gxf2gxf.pl (if
check_gff
istrue
) - converted_chr_names/ -> old to new chromosome names (if
convert_chr
istrue
) - cds/ -> extracted CDS using jcvi.formats.gff load
- longest_isoform_gff/ -> selection of the longest isoform using agat_sp_keep_longest_isoform.pl (publish is set to
false
by default) - proteins/ -> translated CDS using seqkit translate
- orthology/ -> OrthoFinder's results
- synteny/ -> MCScanX's results
More output directories can be created to check the results of each process, by changing the related options in the
nextflow.config
file.