Skip to content
Snippets Groups Projects
Commit ec58486d authored by Ludovic Duvaux's avatar Ludovic Duvaux
Browse files

Poolseq2PGG_prepData

parent c92c38e8
No related branches found
No related tags found
No related merge requests found
# Mapping poolseq data to MC-PGG and SV calling
The idea of this work is to check the feasability of mapping poolseq reads to a
pangenome graph in order to infer SV frequencies from population samples.
We use the _Zemoseptoria tritici_ toydataset from the [Bioinfomics tutorial](https://pangenome-hackathon-genotoul-bioinfo-11d6d4f47ac33734abfa2a1377.pages.mia.inra.fr/pages/data/).
It is made of the chromosome 5 only. **This will necessarily lead to mapping issues, but that is not a problem for testing purposes.**
**So to start, let's shamelessly reuse the tutorial first steps.**
## Preparing the graph
### Run the analysis on genobioinfo (Genotoul):
```bash
ssh login@genobioinfo.toulouse.inrae.fr
# replace login by your user name / If you do not have an account ask the organizers for a training account
# ask for a node
srun -c 4 --mem 60G --pty bash
# mv to your work and create directory
cd work
mkdir pangbomics
cd pangbomics
```
### Prepare the _Z. tritici_ data
Let's download them:
```bash
wget https://genotoul-bioinfo.pages.mia.inra.fr/pangenome_hackathon/data/ztritici.tar
tar -x -f ztritici.tar
# unzip all genome files
gzip -d ztritici/*.fa.gz
# set the DATADIR path for the tutorial
export DATADIR=$(pwd)/ztritici
echo $DATADIR
```
### Data preparation
For clarity, haplotype sequences will be renamed using PanSN.
NB we call "haplotype" each individual genome assembly. We could have two haplotypes for the same diploïd genome.
Briefly, each sequence ID should be named as follow : <genome_name>#<haplotype_id>#<chr_name or ctg_name>.
```bash
for file in ztritici/*.fa; do
genome=$(basename $file .fa | cut -f1 -d'.')
echo $genome
sed -i -e "1 s/^.*$/>$genome#1#chr05/" $file
done
```
### MC Pangenome Graph Construction
First, we create a seqfile that will tell Minigraph-Cactus where genome sequences are located.
```bash
# Create a directory for MC
mkdir MC
# Create the seqfile
for file in $DATADIR/*.fa; do
echo -e "$(basename $file .fa | cut -f1 -d'.').1\t${file}" >> MC/zt.seqfile
done
```
Let's run Minigraph-Cactus. It will take around 15min:
```bash
# load cactus env
module load bioinfo/Cactus/2.7.1
# info
# minimal command is: cactus-pangenome jobstore zt.seqfile --outDir zt_mc --outName zt_mc --reference IPO323
# we will run the command with several options to obtain the graph in different formats with their indexes
cactus-pangenome jobstore MC/zt.seqfile --outDir MC/zt_mc --outName zt_mc --reference IPO323.1 TN09.1 --gfa clip filter full
#cactus-pangenome jobstore_full zt.seqfile --outDir MC/zt_mc_full --outName zt_mc --reference IPO323.1 --filter 0 --clip 0 --gbz --gfa
# decompressing outputs
gzip -d MC/zt_mc/*.full.gfa.gz
# Retrieving graphs
ln MC/zt_mc/zt_mc.full.gfa MC/zt_mc.V1-1.gfa
```
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment