Harvesting scripts and docker issues
There are problems with the harvesting scripts, instructions and docker integrations. Here are the ones I had when trying to harvest the data.
README instructions incorrect
The README says:
The
-app
parameter will trigger a harvest of the resources stored in the Git LFS subdirectoriesdata/rare
anddata/faidare
filtered or not (wheatis
andbrc4env
rely onfaidare
andrare
data respectively).
But that's not actually the case. We still have to pass the -data
option when using the -app
option.
The README also shows example docker commands for indexing RARe data, but they're missing the -data
option, which is necessary.
Dockerfile reproducibility
The Dockerfile uses an untagged base image (alpine). So from one build to another of the Dockerfile, we don't end up with the same result.
And it's an issue because with the current alpine base image, the shell script files don't run correctly. In particular, they use find -ls
and the -ls
option doesn't exist when run using docker.
Besides, even when removing that option from the shell scripts, building the image and running it, the script ends up failing with the following error:
Index data and suggestions...
Using timestamp corresponding to date: Fri Dec 20 10:47:26 UTC 2024
Indexing files from /opt/data/faidare/data into index located on elasticsearch:9200/faidare_search_dev-tmstp1734691646-resource-index with 4 parallel threads...
0% 0:8=0s /opt/data/faidare/data/datadiscovery-1.json.gz elasticsearch parallel: This job failed:
index_resources /opt/data/faidare/data/INRAE-URGI_Alvis_OMTD_1.json.gz data-INRAE-URGI_Alvis_OMTD_1.json elasticsearch
real 0m0.223s
user 0m0.179s
sys 0m0.106s
A problem occured (code=2) when trying to index data
from /opt/data/faidare on faidare application and on dev environment
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-INRAE-URGI_Alvis_OMTD_1.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-INRAE-URGI_unfiltered_AoEwM2EyMTllZjU2ZWY4ZTM2YQ.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-IPK_unfiltered_3.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-datadiscovery-1.json-resources.log.gz
Error when indexing data, see errors above. Exiting.