In many technical domains, the generic problem
solving knowledge is scarce even though a large number
of concrete resolutions exist and are well documented. This
makes the machine learning from resolution traces approach
facing a number of challenges, not least among them the
complexity of the underlying domain (concepts, relationships,
events, processes, etc.) and the machine-readability of the
documented resolution. We tackle here the acquisition of
expertise in phylogeny, which is a notoriously rich and prolific
field where hundreds, if not thousands, concrete cases are
reported in the literature, yet tools to assist the phylogenist in
analyzing a new dataset are virtually absent. Thus, we propose
an approach that amounts to ontology-based workflow mining:
Our T-GOWLer system abstracts general patterns from event
sequences previously extracted from texts. It comprises two
modules –a workflow extractor and a pattern miner– both
relying on a specific domain ontology.
Source code is available in our Github page: https://github.com/halioui/tgowler
PYHTON 2.7 https://www.python.org/download/releases/2.7
TOMCAT 8.0 https://tomcat.apache.org/download-80.cgi
SESAME OpenRDF 2.7.16 https://sourceforge.net/projects/sesame/files/Sesame%202/2.7.16
GATE 8.1 https://gate.ac.uk/download/
Please refere the data.world page in order to access data or download them from here. We include exclusively here additional data (such as previously published exctracted workflows and the un-annotated texts.) -- see below.
WfExtractor_1.0: Worklflow Extractor on Gate
Download WfExtractor 1.0 [139M]
WfExtractor_1.0: Worklflow Extractor on GateDownload WfExtractor 1.0 [139M]
The WfExtractor_1.0 tool annotates text corpus with its phylgoenetic analyses workflows. Some of the features of WfExtractor_1.0 are:
- Extract workflow components (programs, parameters, data and metadata) from texts
- Extract data flows (relations) from texts
- Create a WSD (Word Sense Disambiguation) models for both components and relations
- Export Gate Inline XML corpus.
Input data:datastore_2018_2019.zip [248M], datastore_gold.zip [17M], tgowler_resource_ontologies_PHAGE_1.0.rdf.zip [19M]
Output data:annotated_2018_2019.zip [12M], annotated_gold_100.zip [180K], annotated_gold_500.zip [800K], WF_2018_2019.xml [6,6M], WF_gold_100.xml [136K], WF_gold_500.xml [564K]
- Unzip the WfExtractor_1.0.zip file.
- Import all files from $WfExtractor_HOME/plugins to the $GATE_HOME/plugins directory
- Load the PHAGE ontology via tomcat (see Sesame deployment guide)
- If JAVA reports an error please configure the $TOMCAT_HOME/bin/catalina.sh file to prevent Entity Expansion Attacks:
JAVA_OPTS="$JAVA_OPTS -Djdk.xml.entityExpansionLimit=100000000 -Djdk.xml.FEATURE_SECURE_PROCESSING=false -Xmx6G"
- Configure the Gazetteer_LKB dictionary configuration file $WfExtractor_HOME/application-resources/Dictionary_from_remote_repository/config.ttl with changing the ontology information:
- Open Gate and import the application file WfExtractor1.0.xgapp from $WfExtractor_HOME
- Run the application (see Gate 8.1 Developer Guide).
hr:repositoryURL \< YOUR_HTTP_REPOSITORY \>"For example:
hr:repositoryURL \< http://localhost:8080/openrdf-sesame/repositories/phage11 \>
rep:repositoryID "phage11" ;
rdfs:label "PHAGE_1.1" .
WfMiner_1.1: Worklflow Pattern Miner and Rule RecommenderDownload WfMiner 1.1 [3.5M]
WfMiner_1.1 mines abstract closed patterns and generate association from XML worklfow sequence files and a specific domain ontology.
Input data:WFMiner_WF_2018_2019.xml [13M], WFMiner_WF_gold_100.xml [252K], WFMiner_WF_gold_500.xml [1,1M]
- Launch the bowlUtil_0.5 tool and transform the OWL ontology into a binary one (see the README file in $WFMINER_HOME/bowlUtil/). Bowl tranformation is used to speed up the mining process and load a lighter version of the ontology. Note: please use the bowl version of the ontology from the input data (above) to skip this step and don't forget to download the Gene Ontology (owl version)
- Unzip the WfMiner_1.1.zip file and Launch the WfMiner miner using the following code on your shell (see the README file in WFMINER_HOME/):
java -jar java -jar[PATH_TO]/OntoPattern16.jar "[minSupp]" "[PATH_TO]/[bowl_file]" "[PATH_TO]/[train_set]" "[namespace]" "[PATH_TO]/[test_set]" "[topkItems]" "[topnRules]" "[minontology_level]"
java -jar java -jar./OntoPattern16.jar "0.1" "./phylOntology_v51_small_final.bowl" "./WD-Phy-extracted-1/WD-Phy-extracted-1_2783_0.xml" "http://www.co-ode.org/ontologies/ont.owl#" "./WD-Phy-gold-1.xml" "10" "50" "2"
Other T-gowler tools
WfTransformer_1.0: Download WfTransformer 1.0 [2.9K]
This tool transforms the Gate inline XML into sequences of events (encoded in a simple XML tree).
WfSimulator_1.0: Download WfSimulator 1.0 [7.5K]
This tool simulates phylogenetic workflows using instances encoded in the ontolog PHAGE. Using apriori abstract patterns provided by an expert to guide workflow reconstruction. The simulator is based on a Montre Carlo simulation fixing a number of parameters each run to generate event sequences.
Other Sample Data
OntologiesPHAGE-schema_1.0.owl.zip [4.8K] (or use the BioPortal repository for a graphical view)
Annotated textsCorpus_PMC_2015_1_goldStandard_(gate_datastore).zip [17M]
Corpus_PMC_2013_2015_annotated.zip [1.3M] ~ corrpupted :(
Extracted WorklfowsWFPub-1-PMC_2008-2013.zip [2.9M]
For any technical issues, please e-mail admin: firstname.lastname@example.org
This work has been supported by the NSERC Discovery Grants of Canada of Petko Valtchev and Abdoulayé Banié Diallo.