Identification of genomic regions for high-resolution taxonomic profiling using long-read sequencing technology
1. Scientific Context
The 16S rRNA gene is the most used marker to address this question as it is universally distributed among prokaryotes and has conserved and hypervariable regions. On top of that, 16S databases contain gene from much more species than genome databases making them more comprehensive. Studies usually target a small part of the 16S gene depending on the lineage of interest and sequence it with the MiSeq Illumina sequencer. However, the 16S often fails to assign taxonomy at the genus or species level because amplicons are not specific enough and they are often present in multicopies, preventing any quantitative estimation. Alternative markers such as rpoB, gyrB and recA show better results within specific lineage. In parallel, long amplicon PCR-based approaches targeting the full- length 16S rRNA and the whole ribosomal RNA operon show encouraging results
2. Description
Workflow to automatically identified and assesed genomic regions that would allow high resolution taxonomic profiling using long read sequencing approaches.
3. Organisation
3.1 Planning
- Identification of universal bacterial genes with eggNOG
- Identification of the universal genes in refseq genomes
- Genes pairing and selection of pair meeting selection criteria
- Assessment of potential region taxonomical resolution
- Primer design of the region
3.2 People involved
Implementation: Jean Mainguy.
Tests and code review: Jean Mainguy and Claire Hoede.
3.3 Material ressources
None
3.4 Financial ressources
Projet SeqOccIn
4. Timeline
end of december 2019
5. Validation criteria
- A test option in the nextFlow allow to down-sample data ti verify that the workflow ended well.
- We want a functional test (synthetic data and output) that automatically check the region found and their coordinate on a set of genome.
- a) In silico PCR validation will be the taxonomic coverage and resolution of designed primers.
- b) Sequencing validation will be performed with get-plage.
6. Assessment and specification modification
February 14th 2020 Assessment:
-
Identification of universal bacterial genes with eggNOG
-
Identification of the universal genes in refseq genomes
-
Genes pairing and selection of pair meeting selection criteria are implemented in a nextFlow workflow.
-
Assessment of potential region taxonomical resolution
-
Primer design of the region Are in separated scripts.
We need now to switch to second project's phase so we decided not to add primer design component in the nextflow WF.
It remains some important issues that we will implement as soon as possible before end of March.
No need to change the resources of the project.
July 24th 2020 Assessment:
New feature have been implemented to be able to really finished this part:
- 16S gene and 16S-23S region identification have been added to the pipeline to be compared with other identified regions.
- Possibility to select and modify genome selections to run the pipeline.
- In silico PCR with ecoPCR have been launch on genomes and on expected targetted regions on extended refSeq selection.
- Taxonomic resolution based on a clustering approach.
- Taxonomic resolution computed on amplicons sequences predicted by ecoPCR.
This part should be finished on the 7th of September.
In parallel, bibliographic work have done on long read metagenomic whole genome, to prepare the second step. And this work will continue along the phase 2 of the project.
No need to change the resources of the project.
September 24th 2020 Assessment:
The universal targets found by the workflow does not seem to work efficiently compare to 16S. A possible startegy to use the pipeline is to make it more flexible in order to identify potential targets in a range of different taxons. These taxons could be for example very high abundant genus found in the gut microbiome.
This part should be finished on the 18th of December 2020.
No need to change the resources of the project.
February 25th 2021 assessment:
Making the pipeline more flexible is not currently a priority and it will be done when time is available. We plan to apply this new strategy on different familly/class of interest of the Firmicutes phylum as a proof of concept.
This feature should be completed on July 2021.
No need to change the resources of the project.
July 29th 2021 assessment:
The analysis of metabarcoding Hifi data took a long time because the read error rate were higher than expected because of the polymerase.
Consequently, we are planing to complete this feature in January 2022.
No need to change the resources of the project.
April 20th 2022 assessment
We have finished the exploitation of 16S-23S. Now we know how to obtain the best result (protocol in wet lab and in bioinformatics part). For the moment we put aside the idea to design target and primers for a clad of interest or a species'list.