Skip to content

Harvesting scripts and docker issues

There are problems with the harvesting scripts, instructions and docker integrations. Here are the ones I had when trying to harvest the data.

README instructions incorrect

The README says:

The -app parameter will trigger a harvest of the resources stored in the Git LFS subdirectories data/rare and data/faidare filtered or not (wheatis and brc4env rely on faidare and rare data respectively).

But that's not actually the case. We still have to pass the -data option when using the -app option.

The README also shows example docker commands for indexing RARe data, but they're missing the -data option, which is necessary.

Dockerfile reproducibility

The Dockerfile uses an untagged base image (alpine). So from one build to another of the Dockerfile, we don't end up with the same result. And it's an issue because with the current alpine base image, the shell script files don't run correctly. In particular, they use find -ls and the -ls option doesn't exist when run using docker.

Besides, even when removing that option from the shell scripts, building the image and running it, the script ends up failing with the following error:

Index data and suggestions...
Using timestamp corresponding to date: Fri Dec 20 10:47:26 UTC 2024
Indexing files from /opt/data/faidare/data into index located on elasticsearch:9200/faidare_search_dev-tmstp1734691646-resource-index with 4 parallel threads...
0% 0:8=0s /opt/data/faidare/data/datadiscovery-1.json.gz elasticsearch                                                    parallel: This job failed:
index_resources /opt/data/faidare/data/INRAE-URGI_Alvis_OMTD_1.json.gz data-INRAE-URGI_Alvis_OMTD_1.json elasticsearch

real	0m0.223s
user	0m0.179s
sys	0m0.106s
A problem occured (code=2) when trying to index data 
 	from /opt/data/faidare on faidare application and on dev environment
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-INRAE-URGI_Alvis_OMTD_1.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-INRAE-URGI_unfiltered_AoEwM2EyMTllZjU2ZWY4ZTM2YQ.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-IPK_unfiltered_3.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-datadiscovery-1.json-resources.log.gz
Error when indexing data, see errors above. Exiting.