Identification of genomic regions for high-resolution taxonomic profiling using long-read sequencing technology

1. Scientific Context

The 16S rRNA gene is the most used marker to address this question as it is universally distributed among prokaryotes and has conserved and hypervariable regions. On top of that, 16S databases contain gene from much more species than genome databases making them more comprehensive. Studies usually target a small part of the 16S gene depending on the lineage of interest and sequence it with the MiSeq Illumina sequencer. However, the 16S often fails to assign taxonomy at the genus or species level because amplicons are not specific enough and they are often present in multicopies, preventing any quantitative estimation. Alternative markers such as rpoB, gyrB and recA show better results within specific lineage. In parallel, long amplicon PCR-based approaches targeting the full- length 16S rRNA and the whole ribosomal RNA operon show encouraging results

2. Description

Workflow to automatically identified and assesed genomic regions that would allow high resolution taxonomic profiling using long read sequencing approaches.

3. Organisation

3.1 Planning

Identification of universal bacterial genes with eggNOG
Identification of the universal genes in refseq genomes
Genes pairing and selection of pair meeting selection criteria
Assessment of potential region taxonomical resolution
Primer design of the region

3.2 People involved

Implementation: Jean Mainguy.
Tests and code review: Jean Mainguy and Claire Hoede.

3.3 Material ressources

None

3.4 Financial ressources

Projet SeqOccIn

4. Timeline

end of december 2019

5. Validation criteria

A test option in the nextFlow allow to down-sample data ti verify that the workflow ended well.
We want a functional test (synthetic data and output) that automatically check the region found and their coordinate on a set of genome.
a) In silico PCR validation will be the taxonomic coverage and resolution of designed primers.
b) Sequencing validation will be performed with get-plage.

6. Assessment and specification modification

February 14th 2020 Assessment:

Identification of universal bacterial genes with eggNOG
Identification of the universal genes in refseq genomes
Genes pairing and selection of pair meeting selection criteria are implemented in a nextFlow workflow.
Assessment of potential region taxonomical resolution
Primer design of the region Are in separated scripts.

We need now to switch to second project's phase so we decided not to add primer design component in the nextflow WF.

It remains some important issues that we will implement as soon as possible before end of March.