Log in Help
Print
Homewiki 〉 groovy-recipes
 

Groovy recipes

1. Export to CSV

This exports all of the features and values for every annotation in every annotation set, opening up a whole world of integration and analysis possibilities to you.

scriptParams needs a key called "outputFile" with a value of the path of the file to append to. This file is appended to! If you don't like this, change the script

Note also that if any of your feature values contain newlines, then these will be output. Your importing program needs to deal with this - LibreOffice Calc will do, if the feature value is quoted as below.


new File(scriptParams.outputFile).withWriterAppend{ out ->
  (doc.getNamedAnnotationSets() + [Default:(doc.getAnnotations())]).each{ setName, set ->
    set.each{ anno ->
      if( anno.getFeatures() )
        anno.getFeatures().each{ fName, fValue ->
          out.writeLine(/"${doc.getName()}","${setName}",${anno.getId()},"${anno.getType()}",${anno.start()},${anno.end()},"${fName}","${fValue}"/)
        }
      else
        out.writeLine(/"${doc.getName()}","${setName}",${anno.getId()},"${anno.getType()}",${anno.start()},${anno.end()},,/)        
    }
  }
}

2. Filter by document feature

This one is in the user guide

Factory.newCorpus("fredsDocs").addAll( 
  docs.findAll{ 
    it.features.annotator == "fred"
  } 
)

3. Filter if annotation sets exist - e.g. double annotated

Useful for separating out all double annotated docs from a corpus

Factory.newCorpus("doubleDocs").addAll( 
  docs.findAll{ 
    (it.annotationSetNames.contains("annotator1")
    && it.annotationSetNames.contains("annotator2"))
  }
)

4. Filter if an annotation exists

Useful for e.g. separating out documents with a rare annotation

Factory.newCorpus("filtered").addAll( 
  docs.findAll{
    !it.getAnnotations("SETNAME").get("TYPE").isEmpty()
  } 
)

5. Choose an app to execute

You can already conditionally execute PRs based on a document feature. By placing two pipelines as PRs in a third conditional pipeline, you can extend this to execute a pipeline based on a document feature. This Groovy script goes one further, and chooses a pipeline to execute based on some other aspect of the document - in this case, the existence of a particular annotation set.

app1 = apps.find{it.name.equals("app1")}
app2 = apps.find{it.name.equals("app2")}
Factory.newCorpus("tempCorpus").withResource { tempCorpus ->
  docs.findAll{
    app = (it.annotationSetNames.contains("annotator1")) ? app1 : app2
    tempCorpus.add(it)
    app.setCorpus(tempCorpus)
    app.execute()
    tempCorpus.clear()
  }
}
println "done"

6. How many annotations?

sum = 0
docs.findAll{
  def filteredAnnots = it.getAnnotations("Filtered")
  num = filteredAnnots["Anatomy"].size()
  sum += num
  println it.name + " " + num
}
println "total:" + " " + sum

7. How many documents with a particular annotation and feature?

In this case, how many documents contain image tags for png images in the default annotation set?

count = 0
docs.each{doc ->
  hasFeature = false
  doc.getAnnotations()["img"].each{anno ->
    if(anno.getFeatures()["src"] ==~ /.*png/) hasFeature = true
  }
  if(hasFeature) count++
}
println count

8. Count annotations across all sets and documents in a corpus

This one prints out each document name, followed by all of the AnnotationsSets in that document with a count of each Annotation type, followed by a total for each Annotation type. At the end, totals for AnnotationSets and types are given across the whole corpus.

corpusTotalCounts = [:].withDefault { 0 }
corpusSetCounts = [:]
docs.each{doc ->
 println doc.getName()
 docCounts = [:].withDefault { 0 }

 // Default AnnotationSet
 println "   Default AnnotationSet"
 corpusSetCounts["Default"] = corpusSetCounts["Default"] ?: [:].withDefault { 0 }
 setCounts = [:].withDefault { 0 }
 doc.getAnnotations().each{anno ->
   (setCounts[(anno.getType())])++
   (docCounts[(anno.getType())])++
   (corpusTotalCounts[(anno.getType())])++
   (corpusSetCounts["Default"][(anno.getType())])++
 }
 println "      ${setCounts}"
 
  // Named AnnotationSets
 doc.getAnnotationSetNames().each{asName ->
   println "   ${asName} AnnotationSet"
   corpusSetCounts[(asName)] = corpusSetCounts[(asName)] ?: [:].withDefault { 0 }
   setCounts = [:].withDefault { 0 }
   doc.getAnnotations(asName).each{anno ->
     (setCounts[(anno.getType())])++
     (docCounts[(anno.getType())])++
     (corpusTotalCounts[(anno.getType())])++
     (corpusSetCounts[(asName)][(anno.getType())])++
   }
   println "      ${setCounts}"
 }
 println "   Document totals: ${docCounts}"
}
println ""
println "Corpus totals: ${corpusTotalCounts}"
println "Corpus AnnotationSet totals:"
corpusSetCounts.each{
  println "   ${it.key} ${it.value}"
}

An alternative might be to use the following, then do some flattening and spreading to sum things?

defaultSetName = "Default"
counts = [:]
docs.each{doc ->
  docName = doc.getName()
  counts[(docName)] = counts[(docName)] ?: [:]
  
  counts[(docName)][(defaultSetName)] = counts[(docName)][(defaultSetName)] ?: [:].withDefault { 0 }
  doc.getAnnotations().each{anno ->
      (counts[(docName)][(defaultSetName)][(anno.getType())])++
  }
    
  doc.getAnnotationSetNames().each{setName ->
    counts[(docName)][(setName)] = counts[(docName)][(setName)] ?: [:].withDefault { 0 }
    doc.getAnnotations(setName).each{anno ->
      (counts[(docName)][(setName)][(anno.getType())])++
    }
  }   
}

9. Rename annotations

This one is for the Groovy PR, but could easily be adapted to the console using the above ideas. You could also parameterise it if needed - see the user guide for details.

inputAS.findAll{
  it.type == "OldName"
}.each{
  outputAS.add(it.start(), it.end(),
               "NewName",
               it.features.toFeatureMap()) // clone the feature map
}.each{
  inputAS.remove(it)
}

10. Copy annotations with new name taken from a feature

This takes all annotations of a given type with a given feature (in this case Mention annotations in the Key set with a class feature), and copies to a new annotation named for the value of that feature (in this case, the value of the class feature).

For example, if you have Mention annotations with class=Person or class=Organisation, you would end up with Person and Organisation annotations.

docs.each{d ->
  key = d.getAnnotations("Key")
  key.get("Mention").each{m ->
    key.add(m.start(), m.end(), m.getFeatures().get("class"), m.getFeatures().toFeatureMap())
  }
}

11. Remove near-duplicate annotations

This script for the Groovy PR deletes extra annotations with the same type and start and end offsets, leaving one in each place. You could also test features using the GATE API.

List<Annotation> annList = new ArrayList<Annotation>(inputAS.get("MultiWord"));
Collections.sort(annList, new OffsetComparator());

for (int i=0 ; i < annList.size() - 1 ; i++) {
  Annotation annI = annList.get(i);
  
  for (int j=i+1 ; j < annList.size() ; j++) {
    Annotation annJ = annList.get(j);
    
    if (annJ.getStartNode().getOffset().equals(annI.getStartNode().getOffset())
        && annJ.getEndNode().getOffset().equals(annI.getEndNode().getOffset()) ) {
      inputAS.remove(annI);
      break;
    }
  }
}



12. Calculate TF-IDF score for each document of a corpus

The tfidf.groovy script in plugins/Groovy/resources/scripts creates a TF-IDF frequency map for the whole corpus, and stores this as a feature on the corpus. If you have a large number of documents, you might prefer to create a TF-IDF map for each document, and store each map as a feature on the corresponding document. If so, modify the afterCorpus() method as follows:

void afterCorpus(c) {
  def tfIdfDoc = [:]		// Map to store TF-IDF scores for each document
  docMap.each { term, docsWithTerm ->
    def idf = Math.log((double)docNum / docsWithTerm.size())
    docsWithTerm.each { docId ->
      tfIdfDoc[docId] = frequencies[docId]
      tfIdfDoc[docId][term] = frequencies[docId][term] * idf
    }
  }
  // Store document TF-IDF scores as a feature on each document
  def i = 0
  c.each { doc ->
  	doc.features.freqTable = tfIdfDoc[i]
  	i++
  }
}

13. Annotating the interesting parts of a document, and the bits around it

docEnd = gate.Utils.end(doc);

// set the annotation type from an existing annotation on the document
interestingAnnots = inputAS.get("Interesting");

intStart = gate.Utils.start(interestingAnnots);
intEnd   = gate.Utils.end(interestingAnnots);

// add new annotations for 4 different zones
outputAS.add( 0L, docEnd, "WholeDoc", Factory.newFeatureMap());

outputAS.add( 0L, intStart, "BeforeInt", Factory.newFeatureMap());

outputAS.add(intStart, docEnd, "IntZone", Factory.newFeatureMap());

outputAS.add(intEnd, docEnd, "AfterInt", Factory.newFeatureMap());