Chapter 11
Profiling Processing Resources [#]
11.1 Overview [#]
There is a reporting tool for GATE processing resources. It reports the total time taken by processing resources and the time taken for each document to be processed by an application of type corpus pipeline.
GATE use log4j, a logging system, to write profiling informations in a file. The GATE profiling reporting tool uses the file generated by log4j and produces a report on the processing resources. It profiles JAPE grammars at the rule level, enabling the user precisely identify the performance bottlenecks. It also produces a report on the time taken to process each document.
This initial code for the reporting tool was written by Intelius employees Andrew Borthwick and Chirag Viradiya and generously released under the LGPL licence to be part of GATE.
11.1.1 Features
- Ability to generate following two reports
- Report 1: Total time taken for each level of processing [Application, Multiphase transducer (MPT), processing resource (PR), and Phase], subtotalled at each level.
- Report 2: Top N most expensive documents in terms of total time taken by a given PR element for those documents (sorted in descending order of time taken by the processing element for that document).
- Report 1 specific features
- A command line option to sort the time statistics in order of amount of time consumed by processing elements or in order of execution.
- A command line option to hide the processing elements which took 0 milliseconds for processing.
- Ability to generate HTML with a collapsible tree.
- Report 2 specific features
- A command line option to specify the number of most expensive (in terms of processing time for given PR) documents to be included in the report.
- A command line option to specify the PR for which the processing time of a document is to be considered.
- Features common to both reports
- Ability to generate report in different formats–as indented text or in HTML format.
- Ability to generate a report only on the log entries from the last logical run of GATE.
- All processing times reported in number of seconds and in terms of percentage (rounded to nearest 0.1%) of total time.
- A command line counterpart API method exposed for using the tool from within other java projects.
- The tool halts if the benchmark.txt file is modified while generating the report.
11.1.2 Limitations
Be aware that the profiling doesn’t support non corpus pipeline as application type. There is indeed no interest in profiling a non corpus pipeline that works on one or no document at all. To get meaningful results you should run your corpus pipeline on at least 10 documents.
11.2 Graphical User Interface [#]
The activation of the profiling and the creation of profiling reports are accessible from the ‘Tools’ menu in GATE with the submenu ‘Profiling reports’.
You can ‘Start recording’ and ‘Stop recording’ the processing resource at any time. The logging is cumulative so if you want to get a new report you must use the ‘Clear the log’ menu item.
Two types of reports are available: ‘Report on processing resources’ and ‘Report on document processed’. See the previous section for more information.
11.3 Command Line Interface [#]
Report1 Usage: java gate.util.reporting.PRTimeReporter [Options]
Options:
-i input file path (default: benchmark.txt in the execution directory)
-m print media - html/text (default: html)
-z supressZeroTimeEntries - true/false (default: true)
-s sorting order - exec_order/time_taken (default: exec_order)
-o output file path (default: report.html/txt in the system temporary directory)
-l logical start (not set by default)
-h show help
Note that supressZeroTimeEntries will be ignored if the sorting order is ‘time_taken’
Report 2 Usage: java gate.util.reporting.DocTimeReporter [Options]
Options:
-i input file path (default: benchmark.txt in the execution directory)
-m print media - html/text (default: html)
-d number of docs, use -1 for all docs (default: 10 docs)
-p processing resource name to be matched (default: all_prs)
-o output file path (default: report.html/txt in the system temporary directory)
-l logical start (not set by default)
-h show help
- Run report 1: Report on Total time taken by each processing element across corpus
- java -cp ”gate/bin:gate/lib/GnuGetOpt.jar” gate.util.reporting.PRTimeReporter -i benchmark.txt -o report.txt -m text
- Run report 2: Report on Time taken by document within given corpus.
- java -cp ”gate/bin:gate/lib/GnuGetOpt.jar” gate.util.reporting.DocTimeReporter -i benchmark.txt -o report.html -m html
11.4 Application Programming Interface [#]
11.4.1 Log4j.properties
This is required to direct the profiling information to the benchmark.txt file. The benchmark.txt generated by GATE will be used as input for GATE profiling report tool as input.
- # File appender that outputs only benchmark messages
- log4j.appender.benchmarklog=org.apache.log4j.RollingFileAppender
- log4j.appender.benchmarklog.Threshold=DEBUG
- log4j.appender.benchmarklog.File=benchmark.txt
- log4j.appender.benchmarklog.MaxFileSize=5MB
- log4j.appender.benchmarklog.MaxBackupIndex=1
- log4j.appender.benchmarklog.layout=org.apache.log4j.PatternLayout
- log4j.appender.benchmarklog.layout.ConversionPattern=%m%n
- # Configure the Benchmark logger so that it only goes to the benchmark log file
- log4j.logger.gate.util.Benchmark=DEBUG, benchmarklog
- log4j.additivity.gate.util.Benchmark=false
11.4.2 Enabling profiling
There are two ways to enable profiling of the processing resources:
- In gate/build.properties, add the line: run.gate.enable.benchmark=true
- In your Java code, use the method: Benchmark.setBenchmarkingEnabled(true)
11.4.3 Reporting tool
- Instantiate the Class PRTimeReporter
- PRTimeReporter report1 = new PRTimeReporter();
- Set the input benchmark file
- File benchmarkFile = new File(”benchmark.txt”);
- report1.setBenchmarkFile(benchmarkFile);
- Set the output report File
- File reportFile = new File(”report.txt”); or
- File reportFile = new File(”report.html”);
- report1.setReportFile(reportFile);
- Set the following optional parameters:
- PrintMedia : Generate report in html or text format (default: both)
- report1.setPrintMedia(PRTimeReporter.MEDIA_TEXT); or
- report1.setPrintMedia(PRTimeReporter.MEDIA_HTML);
- SortOrder: Sort in order of execution or descending order of time taken (default:
exec_order)
- report1.setSortOrder(PRTimeReporter.SORT_TIME_TAKEN); or
- report1.setSortOrder(PRTimeReporter.SORT_EXEC_ORDER);
- SupressZeroTimeEntries: True/False (default: True). Parameter ignored if SortOrder
specified is ‘SORT_TIME_TAKEN’
- report1.setSupressZeroTimeEntries(true);
- LogicalStart: A string indicating the logical start to be operated upon for generating
reports
- report1.setLogicalStart(”InteliusPipelineStart”);
- PrintMedia : Generate report in html or text format (default: both)
- Generate the text/html report
- report1.executeReport();
- Instantiate the Class DocTimeReporter
- DocTimeReporter report2 = new DocTimeReporter();
- Set the input path for benchmark.txt
- File benchmarkFile = new File(”benchmark.txt”);
- report2.setBenchmarkFile(benchmarkFile);
- Set the output report path
- File reportFile = new File(”report.txt”); or
- File reportFile = new File(”report.html”);
- report2.setReportFile(reportFile);
- Set the following optional parameters:
- PrintMedia : Generate report in html or text format (default: both)
- report2.setPrintMedia(DocTimeReporter.MEDIA_TEXT); or
- report2.setPrintMedia(DocTimeReporter.MEDIA_HTML);
- NoOfDocs: Number of documents to be displayed in the report (default: 10
docs)
- report2.setNoOfDocs(2); // 2 docs or
- repor2.setNoOfDocs(DocTimeReporter.ALL_DOCS); // All documents
- SearchString: The PR name for which the documents to be considered (default:
MATCH_ALL_PR_REGEX).
- report2.setSearchString(”HTML”); // match ALL PRS having HTML as substring
- LogicalStart: A string indicating the logical start to be operated upon for generating
reports
- report2.setLogicalStart(”InteliusPipelineStart”);
- PrintMedia : Generate report in html or text format (default: both)
- Generate the text/html report
- report2.executeReport();