Poolseq2PGG_prepData

ec58486d · Ludovic Duvaux · c92c38e8 · ec58486d
Commit ec58486d authored 6 months ago by Ludovic Duvaux
--- a/04_atelier_graphe_pangenome/mappingpoolseq2MC-PGG.md
+++ b/04_atelier_graphe_pangenome/mappingpoolseq2MC-PGG.md
+# Mapping poolseq data to MC-PGG and SV calling
+
+The idea of this work is to check the feasability of mapping poolseq reads to a 
+pangenome graph in order to infer SV frequencies from population samples.
+
+We use the _Zemoseptoria tritici_ toydataset from the [Bioinfomics tutorial](https://pangenome-hackathon-genotoul-bioinfo-11d6d4f47ac33734abfa2a1377.pages.mia.inra.fr/pages/data/). 
+It is made of the chromosome 5 only. **This will necessarily lead to mapping issues, but that is not a problem for testing purposes.**
+
+**So to start, let's shamelessly reuse the tutorial first steps.**
+
+## Preparing the graph
+### Run the analysis on genobioinfo (Genotoul):
+
+```bash
+ssh login@genobioinfo.toulouse.inrae.fr
+# replace login by your user name / If you do not have an account ask the organizers for a training account
+# ask for a node
+srun -c 4 --mem 60G --pty bash
+# mv to your work and create directory
+cd work 
+mkdir pangbomics
+cd pangbomics
+```
+
+### Prepare the _Z. tritici_ data
+Let's download them:
+```bash
+wget https://genotoul-bioinfo.pages.mia.inra.fr/pangenome_hackathon/data/ztritici.tar
+tar -x -f ztritici.tar
+# unzip all genome files
+gzip -d ztritici/*.fa.gz
+# set the DATADIR path for the tutorial
+export DATADIR=$(pwd)/ztritici
+echo $DATADIR
+```
+
+### Data preparation
+For clarity, haplotype sequences will be renamed using PanSN.
+
+NB we call "haplotype" each individual genome assembly. We could have two haplotypes for the same diploïd genome.
+
+Briefly, each sequence ID should be named as follow : <genome_name>#<haplotype_id>#<chr_name or ctg_name>.
+
+```bash
+for file in ztritici/*.fa; do
+    genome=$(basename $file .fa | cut -f1 -d'.')
+    echo $genome
+    sed -i -e "1 s/^.*$/>$genome#1#chr05/" $file
+done
+```
+
+### MC Pangenome Graph Construction
+First, we create a seqfile that will tell Minigraph-Cactus where genome sequences are located.
+```bash
+# Create a directory for MC
+mkdir MC
+
+# Create the seqfile
+for file in $DATADIR/*.fa; do
+    echo -e "$(basename $file .fa | cut -f1 -d'.').1\t${file}" >> MC/zt.seqfile
+done
+```
+
+Let's run Minigraph-Cactus. It will take around 15min:
+
+```bash
+# load cactus env
+module load bioinfo/Cactus/2.7.1
+# info
+# minimal command is: cactus-pangenome jobstore zt.seqfile --outDir zt_mc --outName zt_mc --reference IPO323
+# we will run the command with several options to obtain the graph in different formats with their indexes
+
+cactus-pangenome jobstore MC/zt.seqfile --outDir MC/zt_mc --outName zt_mc --reference IPO323.1 TN09.1 --gfa clip filter full 
+
+#cactus-pangenome jobstore_full zt.seqfile --outDir MC/zt_mc_full --outName zt_mc --reference IPO323.1 --filter 0 --clip 0 --gbz --gfa
+
+# decompressing outputs
+gzip -d MC/zt_mc/*.full.gfa.gz
+
+# Retrieving graphs
+ln MC/zt_mc/zt_mc.full.gfa MC/zt_mc.V1-1.gfa
+```
\ No newline at end of file