Bad News
Dr Jeremy Black, one of the world's foremost scholars of Sumerian, has died. This is a source of great sadness to all who have worked with him. We hope that our future effort in this area will serve as a small (though inadequate) memorial to his excellent work.
Summary
The GATE/ETCSL project is based at the University of Sheffield and involves collaboration with the Oriental Institute at the University of Oxford.
The project will enhance an existing, world-leading computational infrastructure to create generic tools for language researchers annotating and analysing diverse electronic corpora. As a test-bed application, the Electronic Text Corpus of Sumerian Literature (ETCSL) will gain linguistic information to exploit its maximum potential with more wide-ranging and sophisticated analysis. The enhanced ETCSL, with the associated analytical software, will be made available over the Web, thus sharing freely with other researchers not only a valuable textual resource but also powerful software facilitating its analysis. The generic tools adapted for the project will be transferable to other corpora.
Contact: Hamish Cunningham (PI).
The project will enhance two resources: 1) a corpus of Sumerian literature, and 2) a computational infrastructure for advanced language processing:
1) The project is centred on the Electronic Text Corpus of Sumerian Literature (ETCSL), edited and published by Jeremy Black and his colleagues with funding from Oxford University and the Leverhulme Trust. This digital library is the largest and most coherent corpus of Sumerian literature that is currently possible: so far approximately 36,000 lines of verse have been edited and 350 compositions published at the Web site, which currently receives some 60,000 accesses a month, from over 130 countries.
Prior to publication of this corpus, the rich literature of Sumerian was unavailable to all but a few specialist scholars. Sumerian itself is a long dead ancient Near Eastern language, which was the first to have a writing system and subsequently had an approximately 3,000-year written history; its literature includes mythological narratives, hymns of praise, didactic compositions, folk tales and proverbs. This literature is published at the Web site in transliteration in roman script (Sumerian was written in cuneiform script). Each composition has been editorially assembled from individual manuscripts, but includes details of substantive textual variants, and is accompanied by an English prose translation and bibliography. As part of ongoing AHRB funding of the ETCSL, the corpus is being expanded as new compositions are edited.
2) Over the last seven years, the University of Sheffield Computer Science Department has developed a General Architecture for Text Engineering (GATE), which supports advanced language analysis, data visualisation, and information sharing in languages from English to Urdu. A key feature of the work is its wide user community: the software has been used at hundreds of sites worldwide. This project will adapt GATE to benefit a new constituency of humanities researchers sharing language resources. This adaptation will be driven in the first instance by the needs of the ETCSL resource, but we will also emphasise applicability to other such resources (e.g., the Newton digital library), and make all the results freely available to the community. As part of a large research group we can provide technical support to our users beyond the bounds of this project, and as an open source system GATE's future is unconstrained by institutional boundaries. The system is free for all users.
Under the current AHRB-funded stylistic, lexical and grammatical research on the ETCSL, work is already in hand to provide the Web site with simple search, collocation and concordance facilities. What is envisaged here is a quite distinct undertaking: linguistic annotation at an altogether higher level. Adapting the GATE infrastructure to the ETCSL will substantially enhance the value of the resource by automatic annotation with the type of linguistic data that enable more sophisticated language analysis, along with easy-to-use tools enabling researchers to view and analyse this data over the Web. The more extensive and sophisticated this linguistic annotation is, the more useful the resource will be for a whole range of types of analysis; and the more user-friendly the Web-based tools are, the more accessible the corpus will be to a wider range of scholars. The annotation will mark features such as morphology, and part-of-speech, as well as lemmatic information, allowing homographs to be distinguished, and variant writings of the same lemma to be grouped together. Since the annotation process will be automated to a large degree, the corpus will be more comprehensively annotated than would otherwise be possible, and the significance of the results will increase accordingly.
The benefit of having morphological and lexical information as part of cultural heritage digital libraries has been dramatically demonstrated by the success of the Perseus Digital Library (http://www.perseus.tufts.edu/), where this information is used not only by scholars, but also by students and interdisciplinary researchers who want to study the electronic collections but are not proficient in the language. As demonstrated by the popularity of the language tools for ancient Greek and Latin in Perseus, the creation of Web-based tools for Sumerian is likely to enhance the accessibility of this collection to students and the general public.
The project has arisen in the context of a European Union/National Science Foundation Working group on ePhilology, in which Black and Bontcheva are participants. The goal of this group is to maximise ways in which the study of cultural heritage languages can be enhanced by the language technology developed in computer science.
Selected Publications