0.3, November 11, 2013http://www.ucomp.eu/A piece of information, such as a musical composition, a text, a word, a picture, independently from how it is concretely realized.A linguistic object consisting of a string (independently of its physical realization).
Its topological unity can change according to its physical realization: as a written realization, its boundaries are blank spaces, as a spoken realization, sometimes is silence, sometimes not, and higher order features intervene.
Grammatical notions, such as noun, verb, adjective, etc., are roles defined by a grammar, and words (or larger linguistic objects) can play those roles in a given language. E.g., the word 'share' can play both 'verb' and 'noun' roles in contemporary English, while the word 'come' can only play the 'verb' role in English, and the 'adverb' or 'conjunction' roles in Italian (but if we consider a word as only realized by phonemes, i.e. if we consider the oral realizations of 'come', there is no common word 'come' in the two languages).Opinion/Sentiment/Emotion
Working definitions:
- opinion as the positive, negative or neutral intellective statement of an individual
person opinion holder about a specific entity that we call the opinion target.
- sentiment as the positive or negative affective-intellective judgement of an individual
person about a specific entity characterized by polarity and intensity.
- emotion as the positive or negative affective state of an individual person characterized
by polarity and intensity. Unlike sentiment, emotion do not necessary have a target entity.In its most general sense, the term is synonymous with vocabulary. A dictionary can be seen as a set of lexical entries. The lexicon has a special status in generative grammar, where it refers to the component containing all the information about the structural properties of the lexical items in a language. In linguistics, ... we don't normally speak of the vocabulary of a particular language; instead, we speak of the lexicon, the total store of words available to a speaker. Very commonly, the lexicon is not regarded merely as a long list of words. Rather, we conceive the lexicon as a set of lexical resources, including the morphemes of the languages, plus the processes available in the language for constructing words from those resources. Apart from the lexicon of a language as a whole, psycholinguists are interested in the mental lexicon, the words and lexical resources stored in an individual brain.opinion mining, sentiment analysis, subjectivity analysis, emotion recognition, emotion mining, emotion detectionNamedEntity refers to a (mention of) an instance of persons, organizations, and location names, e.g. ‘‘Turkey”, ‘‘Austrian Parliament”, ‘‘David Cameron”.
Other types of named entity that we can provide will play a supportive role for adding information about InformationObjects, e.g. temporal expressions (dates and times), e.g. ‘‘2010”, ‘‘May 27 2011”, ‘‘Tuesday 4pm”, and certain types of numerical expressions (monetary values and percentages).
A collection of documents/texts.Data resources that are produced or consumed in uCompEquivalent to http://en.wikipedia.org/wiki/Ontology_(information_science)
In computer science and information science, a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts that can be used to reason about the entities within that domain, and may be used to describe the domain.a social media submission, e.g. a tweet.ugcorganizationnewsTODO: -twittergoogleplusblogforumotherimagevideoflickrfacebookyoutubeexternalmixedinternal(D5.1) Language Samples annotated with uComp OSE annotations.JSONUTF-8UTF-8
https://developers.google.com/+/api/
List of search terms, REST requestexternalFollowing Google’s terms of service.https://developers.google.com/+/api/opinion mining, sentiment analysis, subjectivity analysis, emotion recognition, emotion mining, emotion detectionblogsocialnetworkforumotherwikiList of search terms, REST requestJSONFollowing Google’s terms of services.externalUTF-8API for accessing and retrieving YouTube data, ses https://developers.google.com/youtubeUTF-8UTF-8Following Facebook’s terms of services.UTF-8List of search terms, REST requestexternalJSONAPI for accessing and retrieving Facebook data, see https://developers.facebook.comFollowing Twitter’s terms of serviceUTF-8JSONAPI for accessing and retrieving Twitter data, see https://dev.twitter.comUTF-8externalList of search terms, REST requesinternalModular open-source Python API that retrieves social data from Web sources such as Delicious, Flickr, Yahoo! and Wikipedia, including various helper classes for effective caching and data management. The toolkit provides components for content acquisition and caching, low-level natural language processing functionalities such as language detection, phonetic string similarity measures, and methods for string normalization.
http://www.weblyzard.com/ewrt/
Weichselbraun, A., Scharl, A. and Lang, H.-P. (2013). Knowledge Capture from Multiple Online Sources with the Extensible Web Retrieval Toolkit (eWRT). Seventh International Conference on Knowledge Capture (K-CAP 2013). Banff, Canada.Total number of documents from relevant project-sources grouped by type (e.g. news, blogs, Twitter)1456 documents were crawled from Twitter within the last month.Measured in absolute number of documents.Total number of unstructured data-sources relevant for the project grouped by their type (e.g. news, blogs)German news media consists out of 42 sources.346 documents from Twitter are relevant for the project.After applying the relevance and redundancy check, this number defines the relevant document for each source (e.g. news, blogs, Twitter)uComp’s human computation framework that accepts tasks from knowledge or language engineering platforms (e.g., Protege, Gate) and crowdsources these through games and/or mechanised labour marketplacesCSV file with resultsinternalTask template + CSV file with task daAn optimization engine deciding when and how certain tasks will be deployed depending on their urgency, status and potential costs.approx. 48% of Climate Quiz players are femaleLuis von Ahn and Laura Dabbish. 2008. Designing games with a purpose. Commun. ACM 51, 8 (August 2008), 58-67.in the ESP game each player plays for a total of 91 minutes“The overall amount of time the game is played by each player averaged across all people who have played it.”ALPThe average amount of money spent for solving a problem instance (e.g., can be computed by dividing the total money paid to workers for a job to the number of problem instances that make up that job).if 3$ are payed for 50 translations; each translation task costs 0.06$MPTMeasured (but not formally defined) first in Luis von Ahn, Mihir Kedia, and Manuel Blum. 2006. Verbosity: a game for collecting common-sense facts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '06), ACM, 75-78.“The maximum amount of time the game is played by a given player”Verbosity was played sometimes in sittings of over 3 hours.Captures a financial aspect of instances of HumanComputing subclasses (= the amount of money needed to design and set-up and HC application such as a Game or a job on MTurk)Game development is in the range of thousands of dollars; setting up jobs on CrowdFlower could be achieved with a few hundred dollars (e.g., a few working days from a research assistant)measured in number of basic evaluation items.set of quality measures for OSE annotated corpus (quantitative objective measures, e.g. precision, recall, F-mesure for OSE annotation)P. Paroubek and A. Pak and D. Mostefa, Annotations for opinion mining evaluation in the industrial context of the DOXA project, LREC 2010number of sentences correctly annotated with polarity.Hanne Fersøe : http://www.elra.info/services/validation_manual_lexica.pdf)measured in number of lexical entriessize, linguistic coverage, consistence of annotations etc.
.Although originally defined for games, this measure is also used by the CrowdFlower crowdsourcing marketplace to estimate the “speed” at which a task is performed.“The average number of problem instances solved, or input-output mappings performed, per human hour.”
(Some measure the amount of collected contributions per hour before these are aggregated into actual task solutions.)Luis von Ahn and Laura Dabbish. 2008. Designing games with a purpose. Commun. ACM 51, 8 (August 2008), 58-67.the ESP game assigned labels with the throughput of approx 233 labels/human hourdetects Topic in TextDetects if a language sample expresses opinions, sentiments or emotions as opposed to purely objective text.Analytics for detecting cheaters and ensuring high quality output.mixedXML, WordNet, Linked DatWeichselbraun, Albert, Wohlgenannt, Gerhard, Scharl, Arno. 2010. Refining Non-Taxonomic Relation Labels with External Structured Data to Support Ontology Learning. Data and Knowledge Engineering 69 (8): 763-778.
Wohlgenannt, Gerhard, Belk, Stefan, Schett, Matthias. 2013. A Prototype for Automating Ontology Learning and Ontology Evolution. In 5th International Conference on Knowledge Engineering and Ontology Development (KEOD-2013), Hrsg. Joaquim Filipe and Jan Dietz, 407-412. Vilamoura, Portugal: SciTePress.Learns ontologies from heterogeneous sources: text, social media, linked data. Uses spreading activation and other techniques to integrate evidence.Graph, OWLSelects a class of given input.37% in Climate QuizPercentage of tasks performed by the top 10 contributors (e.g., the top 10 scoring players in games or the 10 most active workers in a crowdsourcing project)Introduced and Discussed in “Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M. & Poesio, M. (2013) Using Games to Create Language Resources: Successes and Limitations of the Approach. In I. Gurevych & K. Jungi (Eds.) The People’s Web Meets NLP. Collaboratively Constructed Language Resources. Springer.”The amount of time invested in building the HC application (including testing)Number of players registering per month as a way to assess the success of advertisement.Phrase Detectives recruited 2000 players over 32 months, so 62 player/monthThe total number of contributors in a HC system. This parameter is particularly interesting for GWAPs as a way to assess their success.Hundreds to thousands of players for GWAPS, Many thousand participants inscribed in crowdsourcing marketplaces.In Verbosity players provide on average 29 facts eachThe average number of items/unit tasks (e.g., labels, rankings) performed by one participant22 auf of 45 concepts relevant for domain in questionmeasured in number of relevant vs irrelevant conceptsmeasures the an aspect of quality of the ontology learning system. the higher the ratio of relevant concepts suggested by the ontology learning system, the better.measured in correct suggestions of HC compared to domain experts20 of 40 relation types confirmed by HCmeasures the quality of taxonomic and non-taxonomic relation detection in ontology learning. Suggests relation tyes for unlabelled relationsPercentage of women in player population.Introduced and Discussed in “Chamberlain, J., Fort, K., Kruschwitz, U., Lafourcade, M. & Poesio, M. (2013) Using Games to Create Language Resources: Successes and Limitations of the Approach. In I. Gurevych & K. Jungi (Eds.) The People’s Web Meets NLP. Collaboratively Constructed Language Resources. Springer.”Host XY had an up-time of 99.216%.Up-time and availability for all hosts involved in the content acquisition pipeline throughout the whole project cycle.internalOn average it took 42 minutes for a News media item from its publication until the architecture fully annotated document.Average time it takes from a new publication until our architecture acquires it and completes the annotation pipeline.externalcsvtxtOpenSource project and defacto Industry standard for monitoring systems and process. WU will host a Nagios instance to monitor the content acquisition pipeline and related servers.
http://www.nagios.org/
Weichselbraun, A., Scharl, A. and Lang, H.-P. (2013). Knowledge Capture from Multiple Online Sources with the Extensible Web Retrieval Toolkit (eWRT). Seventh International Conference on Knowledge Capture (K-CAP 2013). Banff, Canada.Platform integratesCSV, txt, json, html, xls, odtCSV, xls, txt, json, htmlUTF-8UTF-8internalUTF-8Webinterface for easy configuration of the content acquisition pipeline.CSV, xls, txtUTF-8CSV, txt, JSONThe control dimension relates to the degree of power over the affect, and helps to distinguish emotions initiated by the subject from those elicited by the environment e.g., contempt versus fear; this has also been called the strength, dominance, or confidence dimension in other models.Mechanized LabourValence dimension refers to how positive or negative the affect is; this is also referred to as subjective feeling of pleasantness or unpleasantnessUTF-8
See http://demos.gate.ac.uk/trendminer/lodie/
for documentation and access.
Multilingual Entity Linking web service, produced by the TrendMiner projectXML, HTML, PDF, Word, possibly JSONexternalXMLSupports German and English now. French in development.XML, HTML, DOC, PDF, JSON, CONLLtweet tokenisation, POS tagging, tweet normalisation, and English named entity recognitionAny GATE supported: XML stand off or inline
https://gate.ac.uk/wiki/twitter-postagger.html for the English POS tagger
https://gate.ac.uk/wiki/twitie.html for the rest of the modules
internalThe affectivity dimension relates to the degree of the affectivity over the opinion, sentiment or emotion. According to this dimension, we distinguish between intellective, affective-intellective and affective expressions e.g., approval (intellective) versus joy (affective) or satisfaction (affective-intellective) versus happiness (affective).1. The value of n of the semantic category attribute is fixed to 3 for the moment but it can be changed later.
2. The fineness of the semantic description will depend on the crowd sourcing experiments results.A list of words. Each word has 2 attributes :
Polarity : will hold valence information (positive, negative)
Semantic Category : up to n (1 <= n <= 3 ) semantic categories taken from the 20 uComp OSE classes.
Example :
good, positive , [VALORIZATION, SATISFACTION, APPEASEMENT].
boring, negative, [BOREDOM].One of a group of traditional classifications of words according to their functions in context, including noun, pronoun, verb, adjective, adverb, preposition, and conjunction.An orthographic unit in text.The largest syntactically independent unit of grammar.Games With A Purpose