Groovy recipes
- Groovy is a scripting language
- Groovy support in GATE is provided by the Groovy plugin, which provides:
- The Groovy scripting console in the GATE Developer GUI
- A Groovy script PR, which lets you use an arbitrary Groovy script in a GATE application pipeline.
- A collection of extra methods on various GATE core types that can be used from Groovy code
- see the user guide for details of all the above
- These pages are intended to be a repository of sample scripts
- There are just a few for now, add yours whenever you write something useful
1. Export to CSV
This exports all of the features and values for every annotation in every annotation set, opening up a whole world of integration and analysis possibilities to you.
scriptParams needs a key called "outputFile" with a value of the path of the file to append to. This file is appended to! If you don't like this, change the script
Note also that if any of your feature values contain newlines, then these will be output. Your importing program needs to deal with this - LibreOffice Calc will do, if the feature value is quoted as below.
new File(scriptParams.outputFile).withWriterAppend{ out -> (doc.getNamedAnnotationSets() + [Default:(doc.getAnnotations())]).each{ setName, set -> set.each{ anno -> if( anno.getFeatures() ) anno.getFeatures().each{ fName, fValue -> out.writeLine(/"${doc.getName()}","${setName}",${anno.getId()},"${anno.getType()}",${anno.start()},${anno.end()},"${fName}","${fValue}"/) } else out.writeLine(/"${doc.getName()}","${setName}",${anno.getId()},"${anno.getType()}",${anno.start()},${anno.end()},,/) } } }
2. Filter by document feature
This one is in the user guide
Factory.newCorpus("fredsDocs").addAll( docs.findAll{ it.features.annotator == "fred" } )
3. Filter if annotation sets exist - e.g. double annotated
Useful for separating out all double annotated docs from a corpus
Factory.newCorpus("doubleDocs").addAll( docs.findAll{ (it.annotationSetNames.contains("annotator1") && it.annotationSetNames.contains("annotator2")) } )
4. Filter if an annotation exists
Useful for e.g. separating out documents with a rare annotation
Factory.newCorpus("filtered").addAll( docs.findAll{ !it.getAnnotations("SETNAME").get("TYPE").isEmpty() } )
5. Choose an app to execute
You can already conditionally execute PRs based on a document feature. By placing two pipelines as PRs in a third conditional pipeline, you can extend this to execute a pipeline based on a document feature. This Groovy script goes one further, and chooses a pipeline to execute based on some other aspect of the document - in this case, the existence of a particular annotation set.
app1 = apps.find{it.name.equals("app1")} app2 = apps.find{it.name.equals("app2")} Factory.newCorpus("tempCorpus").withResource { tempCorpus -> docs.findAll{ app = (it.annotationSetNames.contains("annotator1")) ? app1 : app2 tempCorpus.add(it) app.setCorpus(tempCorpus) app.execute() tempCorpus.clear() } } println "done"
6. How many annotations?
sum = 0 docs.findAll{ def filteredAnnots = it.getAnnotations("Filtered") num = filteredAnnots["Anatomy"].size() sum += num println it.name + " " + num } println "total:" + " " + sum
7. How many documents with a particular annotation and feature?
In this case, how many documents contain image tags for png images in the default annotation set?
count = 0 docs.each{doc -> hasFeature = false doc.getAnnotations()["img"].each{anno -> if(anno.getFeatures()["src"] ==~ /.*png/) hasFeature = true } if(hasFeature) count++ } println count
8. Count annotations across all sets and documents in a corpus
This one prints out each document name, followed by all of the AnnotationsSets in that document with a count of each Annotation type, followed by a total for each Annotation type. At the end, totals for AnnotationSets and types are given across the whole corpus.
corpusTotalCounts = [:].withDefault { 0 } corpusSetCounts = [:] docs.each{doc -> println doc.getName() docCounts = [:].withDefault { 0 } // Default AnnotationSet println " Default AnnotationSet" corpusSetCounts["Default"] = corpusSetCounts["Default"] ?: [:].withDefault { 0 } setCounts = [:].withDefault { 0 } doc.getAnnotations().each{anno -> (setCounts[(anno.getType())])++ (docCounts[(anno.getType())])++ (corpusTotalCounts[(anno.getType())])++ (corpusSetCounts["Default"][(anno.getType())])++ } println " ${setCounts}" // Named AnnotationSets doc.getAnnotationSetNames().each{asName -> println " ${asName} AnnotationSet" corpusSetCounts[(asName)] = corpusSetCounts[(asName)] ?: [:].withDefault { 0 } setCounts = [:].withDefault { 0 } doc.getAnnotations(asName).each{anno -> (setCounts[(anno.getType())])++ (docCounts[(anno.getType())])++ (corpusTotalCounts[(anno.getType())])++ (corpusSetCounts[(asName)][(anno.getType())])++ } println " ${setCounts}" } println " Document totals: ${docCounts}" } println "" println "Corpus totals: ${corpusTotalCounts}" println "Corpus AnnotationSet totals:" corpusSetCounts.each{ println " ${it.key} ${it.value}" }
An alternative might be to use the following, then do some flattening and spreading to sum things?
defaultSetName = "Default" counts = [:] docs.each{doc -> docName = doc.getName() counts[(docName)] = counts[(docName)] ?: [:] counts[(docName)][(defaultSetName)] = counts[(docName)][(defaultSetName)] ?: [:].withDefault { 0 } doc.getAnnotations().each{anno -> (counts[(docName)][(defaultSetName)][(anno.getType())])++ } doc.getAnnotationSetNames().each{setName -> counts[(docName)][(setName)] = counts[(docName)][(setName)] ?: [:].withDefault { 0 } doc.getAnnotations(setName).each{anno -> (counts[(docName)][(setName)][(anno.getType())])++ } } }
9. Rename annotations
This one is for the Groovy PR, but could easily be adapted to the console using the above ideas. You could also parameterise it if needed - see the user guide for details.
inputAS.findAll{ it.type == "OldName" }.each{ outputAS.add(it.start(), it.end(), "NewName", it.features.toFeatureMap()) // clone the feature map }.each{ inputAS.remove(it) }
10. Copy annotations with new name taken from a feature
This takes all annotations of a given type with a given feature (in this case Mention annotations in the Key set with a class feature), and copies to a new annotation named for the value of that feature (in this case, the value of the class feature).
For example, if you have Mention annotations with class=Person or class=Organisation, you would end up with Person and Organisation annotations.
docs.each{d -> key = d.getAnnotations("Key") key.get("Mention").each{m -> key.add(m.start(), m.end(), m.getFeatures().get("class"), m.getFeatures().toFeatureMap()) } }
11. Remove near-duplicate annotations
This script for the Groovy PR deletes extra annotations with the same type and start and end offsets, leaving one in each place. You could also test features using the GATE API.
List<Annotation> annList = new ArrayList<Annotation>(inputAS.get("MultiWord")); Collections.sort(annList, new OffsetComparator()); for (int i=0 ; i < annList.size() - 1 ; i++) { Annotation annI = annList.get(i); for (int j=i+1 ; j < annList.size() ; j++) { Annotation annJ = annList.get(j); if (annJ.getStartNode().getOffset().equals(annI.getStartNode().getOffset()) && annJ.getEndNode().getOffset().equals(annI.getEndNode().getOffset()) ) { inputAS.remove(annI); break; } } }
12. Calculate TF-IDF score for each document of a corpus
The tfidf.groovy script in plugins/Groovy/resources/scripts creates a TF-IDF frequency map for the whole corpus, and stores this as a feature on the corpus. If you have a large number of documents, you might prefer to create a TF-IDF map for each document, and store each map as a feature on the corresponding document. If so, modify the afterCorpus() method as follows:
void afterCorpus(c) { def tfIdfDoc = [:] // Map to store TF-IDF scores for each document docMap.each { term, docsWithTerm -> def idf = Math.log((double)docNum / docsWithTerm.size()) docsWithTerm.each { docId -> tfIdfDoc[docId] = frequencies[docId] tfIdfDoc[docId][term] = frequencies[docId][term] * idf } } // Store document TF-IDF scores as a feature on each document def i = 0 c.each { doc -> doc.features.freqTable = tfIdfDoc[i] i++ } }
13. Annotating the interesting parts of a document, and the bits around it
docEnd = gate.Utils.end(doc); // set the annotation type from an existing annotation on the document interestingAnnots = inputAS.get("Interesting"); intStart = gate.Utils.start(interestingAnnots); intEnd = gate.Utils.end(interestingAnnots); // add new annotations for 4 different zones outputAS.add( 0L, docEnd, "WholeDoc", Factory.newFeatureMap()); outputAS.add( 0L, intStart, "BeforeInt", Factory.newFeatureMap()); outputAS.add(intStart, docEnd, "IntZone", Factory.newFeatureMap()); outputAS.add(intEnd, docEnd, "AfterInt", Factory.newFeatureMap());