LLOD schema for Simplified Offensive Language Taxonomy in multilingual detection and applications

Barbara Lewandowska-Tomaszczyk; Anna Bączkowska; Olga Dontcheva-Navrátilová; Chaya Liebeskind; Giedrė Valūnaitė Oleškevičienė; Slavko Žitnik; Marcin Trojszczak; Renata Povolná; Linas Selmistraitis; Andrius Utka; Dangis Gudelis

doi:10.1515/lpp-2023-0016

Publicly Available Published by De Gruyter Mouton December 12, 2023

LLOD schema for Simplified Offensive Language Taxonomy in multilingual detection and applications

Barbara Lewandowska-Tomaszczyk
Barbara Lewandowska-Tomaszczyk is Professor Ordinarius Dr Habil. in Linguistics and English Language at the Department of Language and Communication at the University of Applied Sciences in Konin (Poland). Her research focuses on cognitive semantics and pragmatics of language contrasts, corpus linguistics and their applications in translation studies, lexicography and online discourse analysis. She is invited to read papers at international conferences and to lecture and conduct seminars at universities. She publishes extensively, supervises dissertations and also organizes international conferences and workshops.
, Anna Bączkowska
Anna Bączkowska, Dr Habil. Prof. UG, holds MA in English Philology, which she received from Adam Mickiewicz University in Poznan, as well as PhD in linguistics and D.Litt. in English Linguistics, which she received from the University of Lodz. Her research interests revolve around translation studies (film subtitles), cognitive semantics, corpus and computational linguistics, and discourse studies (media discourse). She has guest lectures in Italy, Spain, Portugal, UK, Norway, Kazakhstan and Slovakia, and she has also conducted her research during her scientific stays in Ireland, Iceland, Norway, Austria and Luxembourg.
, Olga Dontcheva-Navrátilová
Olga Dontcheva-Navrátilová is Associate Professor of English Linguistics at the Faculty of Education, Masaryk University, Czech Republic. Her research interests include English for academic and specific purposes and political discourse. She has published the books Analysing Genre: The Colony Text of UNESCO Resolutions (2009), Coherence in Political Speeches (2011) and coauthored Persuasion in Specialised Discourses (2020). She is co-editor of the journal Discourse and Interaction.
, Chaya Liebeskind
Chaya Liebeskind is a lecturer and researcher in the Department of Computer Science at the Jerusalem College of Technology. Her research interests span both Natural Language Processing and data mining. Especially, her scientific interests include Semantic Similarity, Language Technology for Cultural Heritage, Morphologically rich languages (MRL), Multi-word Expressions (MWEs), Information Retrieval (IR), and Text Classification (TC). Much of her recent work has been focusing on analysing offensive language. She has published a variety of studies and a few of her articles are under review or in preparation. She is a member of several international research actions funded by the EU.
, Giedrė Valūnaitė Oleškevičienė
Giedrė Valūnaitė Oleškevičienė is Vice-Dean for Scientific Research of the Faculty of Public Governance and Business and a professor at the Institute of Humanities, Mykolas Romeris University. Her scientific interests in humanities include discourse analysis, professional English, legal English, linguistics and translation research, while in the domain of social sciences, her scientific interests include social research methodology, modern education, philosophical issues, creativity development in modern education system, and second language teaching and learning. The researcher coordinated international research projects funded by the EU, publishes scientific articles, participates as a presenter in scientific conferences.
, Slavko Žitnik
Slavko Žitnik is Assistant Professor and Vice-dean for Education at the University of Ljubljana, Faculty for Computer and Information Science. His research focuses on natural language processing, information extraction, databases, semantic technologies, and information systems. He is actively collaborating with Université Paris 1 Sorbonne, Harvard University, University of South Florida, and University of Belgrade. He is engaged in multiple research and professional projects. As a chairman of Slovenian Language Technologies Society he is organizing lectures related to language technologies and provides grants to students to visit summer schools. He is also Chairman of the Slovene Society INFORMATIKA, and organizes national conferences on informatics and is editor of a scientific journal.
, Marcin Trojszczak
Marcin Trojszczak holds PhD in Linguistics and MA in Philosophy. He is Assistant Professor at the University of Applied Sciences in Konin (Poland). He is also actively cooperating with University of Lodz and University of Economics and Human Sciences in Warsaw. His research interests include metaphorical conceptualisations of mental and emotional processes, the impact of translation technologies on translation education, normativity and genericity in language and cognition, as well as offensive language.
, Renata Povolná
Renata Povolná is Associate Professor of English Linguistics at the Faculty of Education, Masaryk University, Czech Republic. Her research lies in the area of discourse analysis, pragmatics and conversation analysis. She has published the books Spatial and Temporal Adverbials in English Authentic Face-to-Face Conversation (2003), Interactive Discourse Markers in Spoken English (2010) and co-authored Persuasion in Specialised Discourses (2020). She is co-editor of the journal Discourse and Interaction.
, Linas Selmistraitis
Linas Selmistraitis has over 24 years of experience in higher education specifically in developing and implementing quality assurance systems for higher educational institutions. He earned his PhD in Humanities. Currently Professor Dr Linas Selmistraitis holds the position of Vice-Dean for Studies at Faculty of Human and Social Studies at Mykolas Romeris University and the position of Professor at Institute of Humanities at Mykolas Romeris University. His interest in research are semantics, morphology, cognitive linguistics. He publishes research articles and gives presentations at conferences.
, Andrius Utka
Andrius Utka is Associate Professor at the Department of Lithuanian Studies and a senior researcher at the Institute of Digital Resources and Interdisciplinary Research (SITTI), Vytautas Magnus University (Kaunas). He defended the doctoral dissertation Statistical Identification of Text Functions in 2004 (VMU, Kaunas). He was the head of Centre of Computation Linguistics in 2010-2022. He coordinated a number of national and international research projects. His research interests: statistical text analysis, language resources, computer-assisted translation, automatic summarisation, terminology extraction, and the language of disinformation.
and Dangis Gudelis
Dangis Gudelis is a professor at Mykolas Romeris University, specializing in public administration and governance. He earned his PhD in Social Sciences, focusing on performance measurement in Lithuanian municipalities. Gudelis has led and contributed to various national and international research projects, particularly in public governance and public policy. His current research interests include applications of big data and AI technologies in the public sector. He is a prolific writer, with numerous publications in scientific journals and presentations at conferences. He teaches courses at both undergraduate and graduate levels. Additionally, he has played a role in policy analysis and consultancy, advising governmental and non-governmental organizations on strategic development and public sector innovation.

From the journal Lodz Papers in Pragmatics

https://doi.org/10.1515/lpp-2023-0016

Abstract

The goal of the paper is to present a Simplified Offensive Language (SOL) Taxonomy, its application and testing in the Second Annotation Campaign conducted between March-May 2023 on four languages: English, Czech, Lithuanian, and Polish to be verified and located in LLOD. Making reference to the previous Offensive Language taxonomic models proposed mostly by the same COST Action Nexus Linguarum WG 4.1.1 team, the number and variety of the categories underwent the definitional revision, and the present typology was tested in the annotation on the publicly available offensive language datasets of each of the four languages. The results of the annotation are presented and as they are contained within the accepted statistical values on the inter-annotator agreement in the SOL categories and their aspects, we propose this taxonomy as a core ontology which represents the encoding of the supported offensive languages and justify its use on new data in terms of a more universal Linguistic Linked Open Data (LLOD) schema.

Keywords: offensive language; offensive language taxonomy; annotation; LLOD; linguistic linked open data; hate speech

1 Introduction

This paper is a presentation of the background and application of a Simplified Taxonomy of Offensive Language (SOL) (Lewandowska-Tomaszczyk 2022) in the identification of offensive language in four languages: Czech, Lithuanian, Polish, and English. The ultimate objective is to verify the taxonomy for the development of a schema for LLOD.

2 Previous models proposed by the Nexus Linguarum WG 4.1.1. team

2.1 SALLD-1 (Lewandowska-Tomaszczyk et al. 2021)

A definitional revision and enrichment of offensive language typology were the main objectives of Lewandowska-Tomaszczyk et al.’s (2021) publication. We reviewed over 60 existing corpora and compared applied tagging schemas of the existing offensive (also called abusive, toxic, etc.) language tag set systems and exemplified their classes in a proposed schema. Similar schema generated with the Sketch Engine data and non-contextual word embeddings – i.e., Word2Vec, Glove, were consulted to get better insight to their semantic difference in English. In the 2021 paper we developed a taxonomy covering a finite set of categories and aspects of offensive language representation along with linguistically sound explanations. We proposed a core ontology which represented the encoding of the defined offensive language schema. A survey of computational models of detecting offensive language was also presented basing on the HatEval Task 5 of Semeval-2019 (Basile et al. 2019) and on the OffensEval Tasks of SemEval-2019 and SemEval-2020 (Liu et al. 2019; Zampieri et al. 2019a and 2019b). We did not consider using the O-Dang! or similar ontologies (https://aclanthology.org/2022.salld-1.2/), as these proposals are rather broad and represent metadata of a corpus.

The ontology of offensive language we propose in this research provides defined classes for each concept. It was originally inspired by a three-level hierarchy of offensive language put forward by Zampieri et al. (2019a, 2019b). Contrary to Zampieri et al. (2019a, 2019b), however, in our research, offensive language is further refined and divided into two basic levels of analysis (Level I and II), and four sublevels (A, B, C, D) within Level I. Level I distinguishes lexical items that are offensive from those that are not (Level A: offensive vs. non-offensive). Secondly (Level B: targeted vs. non-targeted), the question whether the selected items are targeted at some addressee should be answered. If there is no identifiable addressee then the use of offensive language is an example of self-expression, which has an exclamatory function, e.g., the use of swear words to express anger, frustration, pain etc (abusive swearing in Andersson and Trudgill 2007: 197). Targeted offensive items are further divided into either implicit or explicit cases of offensive language (Level C: implicit vs. explicit language). While implicitness may be encoded by, for example, sarcasm and irony, whereby offence is veiled, explicitness entails more straightforward forms of verbal attack. Classes of explicit targeted categories of offence are further subcategorized into types characterized by varying kinds of internal or external targets as well as partly distinct characterization of the lexicon.

2.2 Integrated explicit and implicit offensive language taxonomy (Lewandowska-Tomaszczyk et al. 2023; Bączkowska et al. 2022)

The first attempt at proposing an integrated explicit and implicit offensive language taxonomy was given in Lewandowska-Tomaszczyk et al (2023) and supported by analyses of implicit offensive language categorization as discussed in Bączkowska (2022) and Bączkowska et al. (2022). The implicit offensive language model that we proposed (Bączkowska et al. 2022a, 2022b) is rooted in Grice’s (1989) four categories of implicitness, i.e., metaphor, irony, hyperbole (overstatement) and meiosis (understatement), which was enriched by the category of indirectness, understood in the Searlian sense of Indirect Speech Acts (Searle 1975). Additionally, the term sarcasm was also added as a subtype of irony, rhetorical question and simile. Overall, 8 main implicit categories were distinguished in our model. Whilst essentially, two models of offence (explicit and implicit) have been proposed in our project, we also identify an area which is in between the two models, and which integrates them. Their status is less obvious in terms of their typology as they seem to share features typical of both explicitness and implicitness, and this primarily includes dead metaphors e.g., hand of a clock, to fall/be in love, which, having a high frequency of occurrence and, as a result, conventionalized, are no longer seen by the receivers as opaque. In fact, they are perfectly understandable, i.e., explicit in meaning, though at the same time, technically speaking, they are instances of implicit meaning.

In the paper by Lewandowska-Tomaszczyk at al. (2023), the concept of offensive language as a superordinate category was proposed with a number of hierarchically arranged 17 subcategories, taxonomically structured into 4 levels and verified with the use of neural-based (lexical) embeddings, which automatically encode generic semantic relatedness as well as hypernym, synonymy, and other types of relationships. The graphs included in the paper visualize the relationship between the embeddings of the concepts.

Together with a taxonomy of implicit offensive language and its subcategorization levels which received little scholarly attention before, the categories were divided into offensive category levels (types of offence, targets, etc.) and aspects (offensive language property clusters) as well as the main categories of explicitness and implicitness.

2.3 A short survey of results of the First Annotation Campaign (Lewandowska-Tomaszczyk et al. accepted for Rasprave)

The integrated explicit-implicit offensive language category schema was verified on a large English data, consisting of 25 publicly available English datasets of offensive language with the INCEpTION tool (https://github.com/inception-project/inception) – a semantic annotation platform offering assistance in the annotation. The annotation categories were defined according to the annotation guidelines. The annotation setup was the same in both campaigns - annotators needed to select one or more consecutive sentences (could also be the whole passage) that were identified as offensive. For the selected sentences then the annotator needed to select appropriate annotation categories.

The list of the English datasets used in this First Annotation Campaign is available as an Appendix in Lewandowska-Tomaszczyk et al. (2023). The results (Lewandowska-Tomaszczyk et al. accepted) partly support the proposed ontology of explicit offence and positive implicitness types to provide more variance among widely recognized types of figurative language (e.g., metaphorical, metonymic, ironic, etc.). However, further results and a series of the annotators’ comments in a questionnaire showed that for some of the categories there was low or medium inter-annotator agreement. It was also more challenging for annotators to distinguish between category items than between aspect items such as offensive, insulting and abusive, being the most difficult in this respect. The need for taxonomic simplification measures in this respect was thus recognized for further annotation practice and offensive language identification.

3 A Simplified Taxonomy of Offensive Language (SOL)

3.1 Introduction

The need to simplify the taxonomy for the purposes of computation effective offensive language annotation and recognition was acknowledged and a proposal of a Simplified Offensive Language taxonomy (SOL) was soon put forward (Lewandowska-Tomaszczyk 2022). The linguistic and computational limitations of any attempt to provide water-tight categorization schemas in any language was discussed in the paper and what was proposed there was a carefully supervised simplification of the Extended Model both as far as the number and types of the key categories are concerned. To verify the taxonomy we resorted to the word embedding correlations discussed in 3.2. below.

3.2 Word embeddings for the English SOL taxonomy keywords

We used the approach given in the integrated explicit and implicit offensive language taxonomy publication (Lewandowska-Tomaszczyk et al. 2023) to examine the correlations between offensive language categories. To represent the categories as vectors in a lower dimension space, we utilized the Word2Vec word embedding method and calculated the Cosine distance between them. The word embedding learning was done from scratch with a corpus of offensive language. We also experimented with contextual embeddings such as ELMo, BERT, KeyBERT, USE and ConceptNet Numberbatch embeddings. Only the results of the latter revealed some structures, similar to Word2Vec ones, while results of the others were blurred and overlapping in the resulting 2D visualizations.

We calculated our categories and aspects' pairwise cosine similarity first. Cosine similarity heatmap is shown in Figure 1. A heatmap is a color-coded visual representation of data, with red indicating high similarity and blue indicating low similarity. The figure is mostly blue. Since words without similar contexts have a low cosine value, most of these categories and aspects may be readily isolated and form independent offensive categories.

Figure 1

Cosine similarity heatmap

While the categories can be separated readily, certain aspects cannot. Homophobic and racist aspects are the most similar (0.87), and xenophobic is very close to both (0.86 and 0.82, respectively). Sexist is also close to homophobic and racist (0.77 and 0.75, respectively). As evidenced by their proximity to the hateful category, the homophobic, racist, and xenophobic (0.85, 0.83 and 0.80, respectively) aspects are likely the most prevalent expressions of hatred within the corpus.

There is a high degree of similarity between ageism and ableism (0.76) and both ageism and ableism are close to classism (0.67 and 0.68, respectively). Ageism and ableism are also predominate aspects of the implicit category simile (0.77 and 0.74, respectively).

The lemma forms of the categories were then analyzed. We extracted the top 30 most similar terms for each category, excluding words whose substrings are the category, its lemma, or its stem. The t-SNE (t-distributed Stochastic Neighbor Embedding) method was then applied to the embeddings of the categories and their 30 most similar terms.

Figure 2 shows the fifty-to-two-dimensional t-SNE transformation of our embedding vectors. The t-SNE Figure reveals that the discredit and threat categories form relatively well-defined clusters, whereas the hateful and insult categories overlap various aspects of offensive language. The ideology aspect forms a relatively distinct cluster. However, the remaining aspects are dispersed and overlap.

Figure 2

Fifty-to-two-dimensional t-SNE transformation of embedding vectors

3.3 SOL Taxonomy

The simplified taxonomy was presented in terms of the step-by-step hierarchical procedure. This taxonomy was to prepare the ground for the Second Annotation Campaign in a multilingual context (Czech, Polish, Lithuanian, and English).

Figure 3

Step-by-step hierarchical procedure of the simplified offensive language (SOL) taxonomy

The question concerning the overall offensiveness status of the selected sample is crucial to establish. The categorical answer yes or no, similarly to the other yes-no categorization proposed in the categories below, is not a reflection of actual conceptual-linguistic reality (Lakoff 1987) but, rather, indicates annotation requirements to adjust to a computer program to distinguish between dichotomic judgments (e.g., offensive vulgar or not).

Levels 2 and 3 refer to a selection of individuals or groups, that is, the targets of an offensive act – an individual (target a), a group (target b), or else a target c, addressed at a group through a particular individual or else an individual meant to be a group representative. The main criterial property of the latter (target c) is the use of gender, race, etc., stereotypes in the offensive language sample and paves the way to the category of hate speech as one of the offence types in the hierarchy. Target 2 is a tag which represents presence or absence of the offensive language target at the locus of the interactional encounter.

The selection between vulgar and non-vulgar language (i.e., words, phrases) is taxonomically connected to the first-level selection between offensive or non-offensive type, i.e., in further judgments of the vulgarity of a particular sample provided in a larger linguistic context, as in the Yep usual bullshit response (Lewandowska-Tomaszczyk 2017). The lower distinctions are definitional with respect to the character of the used offence. i.e., the category of insult to determine an individual or group offence, but not by reference to any group stereotypes (e.g., it’s the state of your own mental health you should be VERY concerned about presents INSULT and the Aspect of ablism2), as juxtaposed to the concept of hate speech, whose discriminating property is precisely the reference to a group or individual via discriminatory group stereotypes. The discredit tag signifies an offensive act addressed at an individual or a group on grounds of accusation of lie, immorality, unprofessionalism, and unfairness, while threat is a statement intended to frighten or intimidate a person or a group into believing in prospective harm they will experience (Brenner 2002).

The level of Aspects as defined in this model is considered a type of property target of a discriminating act, and can be addressed at the offendee’s race, gender, age, religion, ethnicity, ability, any other kind of physical property or behavioural conduct (dubbed as other in the Aspects compartment)., or else any of their combinations.

The last-level distinction refers to a differentiation between linguistically explicit versus implicit types of utterances (category types 7.) and proposed here is a selection of one or more of the linguistically implicit categories (cf. Bączkowska et al. 2022).

4 Second Annotation Campaign and its results

In this section we present results of the Second Annotation Campaign performed with the use of the SOL taxonomy on the four languages independently. In the tables below we quote positive results of each annotation, while the few problematic cases are discussed in Section 6. below.

4.1 English – a comparison with the First Annotation Campaign results

The first attempts to identify a satisfying Offensive Language annotation system were carried out by our team on the example of English. In this section we will present a result of the application of the SOL taxonomy on English to see to what extent the Extended and Simplified systems compare. The present annotation of English on the SOL taxonomy was conducted on a smaller sample of 50 samples annotated by two annotators. The results are as follows:

Table 1

SOL taxonomy inter-rater annotation results for English

Annotation type	Agreement
Target 1 – Individual/group	0.82
Target 2 – present/absent	0.84
Vulgar	1.00
Offensive type – hate speech/insult	0.21
Offensive type discredit	0.57
Offensive type threat	1.00
Aspect 05	0.48
Aspect 05a	0.42
Aspect 05b	0.00 i.e., no tags given by annotators
Category 06	0.64
Category 06a	0.78
Category 06b	1.00

The SOL taxonomy annotation for English gives rather solid results. When contrasted with the results achieved in the First Annotation Campaign (Lewandowska-Tomaszczyk et al. accepted for publication), the present outcomes show a higher positive consistency in the annotators’ selections of the categories and Aspects, and higher values of their annotation agreement. One category only (Offensive type - hate speech/insult) met with some problems, which may be remedied by means of training sessions more intensive than possible in this first test period. Others are mostly above the standard, which qualifies them to be proposed as a LLOD standard.

The proposed LLOD schema will update the initial schema (Lewandowska-Tomaszczyk et al. 2021). It defines a hierarchical structure of SOL offensive types and categories along with defined lists of targets and aspects. In this work we extended the taxonomy over multiple languages, so each concept will contain rdfs:label and skos:definition in multiple languages. As the schema represents collection of datasets, we will connect the classes with existing schemas such as DCAT-AP, so that the data is more inter-operable. Apart from the schema, instances will represent exemplars of each concept for all languages that achieved highest annotation agreements and are selected by the curators.

4.2 Czech

The Czech Corpus of Offensive Language comprises 400 comments extracted from online discussions in ten Czech national newspapers and news platforms and is located in the Sketch Engine software tool. The corpus was annotated by two annotators who are linguists and share a similar social background, age, and profession. Prior to annotating the corpus, the two annotators carried several training sessions focused on discussing potential problems in applying the simplified offensive language taxonomy.

The Cohen’s Kappa results for inter-rater agreement summarised in Table 2 show that the annotator agreement is high. More specifically, it is almost perfect for the categories Target 1 (0.89), Target 2 (0.93) and Vulgar (0.85), and substantial for the Offensive type categories (0.74 for both insult and discredit); the slight agreement for the Threat category may be explained by its very low occurrence in the annotations. As to Aspects of offensive language and categories of figurativeness, there is substantial agreement at the level of the main category (0.70 and 0.61 respectively), but there is moderate agreement in the sub-classes (0.52 and 0.53). This result is possibly affected by the absence of more specific instructions, as the annotators have ranked the subclasses differently, for instance, annotator 1 has classified Metaphor as sub-category 6a and Irony as category 6b, while annotator 1 considers Metaphor as category 6b and Irony as category 6b. This is a rather technical issue that needs to be attended to in order to improve inter-rater agreement.

Table 2

SOL taxonomy inter-rater annotation results for Czech

Annotation type	Agreement
Target 1 – Individual/group	0.89
Target 2 – present/absent	0.93
Vulgar	0.85
Offensive type – hate speech/insult	0.74
Offensive type discredit	0.74
Offensive type threat	0.11
Aspect 05	0.70
Aspect 05a	0.52
Category 06	0.61
Category 06a	0.53

Overall, in the case of the Czech Corpus of Offensive Language, the use of the simplified taxonomy (Lewandowska-Tomaszczyk 2022) has yielded a considerably higher degree of inter-rater agreement in comparison with annotation campaigns using a more elaborate taxonomy of offensive language (e.g., Lewandowska-Tomaszczyk 2022). Other factors leading to a higher level of agreement might be a more intensive training campaign and careful preparation of the annotators.

4.3 Lithuanian

The pilot annotation applying the SOL taxonomy in Lithuanian was carried out on a sample of 200 items taken from the Lithuanian dataset LITIS which is freely available at CLARIN-LT repository http://hdl.handle.net/20.500.11821/11. Corpus LITIS contains user-generated comments collected from two Lithuanian news portals: www.delfi.lt and www.lrytas.lt. Each comment is in a separate file (TXT). Each file contains: a comment, date and time, the nick name of the author, URL and a title of the article commented. Comments from www.delfi.lt amounts to17,909 items, dated 2014 and comments from www.lrytas.lt amount to 182,000 items, dated 2010-2014.

The Cohen’s Kappa results for inter-annotator agreement (Table 3) provide comparable results to the other languages of the annotation experiment.

Table 3

SOL taxonomy inter-rater annotation results for Lithuanian

Annotation type	Agreement
Target 1 – Individual/group	0.4
Vulgar	0.78
Offensive type threat	1.00
Aspect 05	0.45
Aspect 05a	0.65
Aspect 05b	1.00
Category 06	1.00
Category 06a	1.00

The Cohen’s Kappa values range between 0.4-0.78 showing solid agreement on the identification of vulgar category with 1.00 reaching the maximum. There also could be observed Cohen's Kappa value 0.00 for offensive type threat and supplemental aspects and categories which were not tagged by the annotators in the annotated sample. The closer look at the annotated data revealed that lower annotator agreement for the categories of Target 1 - Individual/group and Aspect 5 could be explained by the differences in the annotator chosen chunks. It could also indicate a need for annotator pre-training sessions discussing the annotated samples, the offensive language taxonomy, comparing the results, and resolving disagreements.

4.4 Polish

The Polish annotation was performed by two raters on 100 samples derived from two Polish offensive language corpora (Ptaszyński and Masui 2018, Ptaszyński et al. 2019, Troszyński and Wawer 2017). The two datasets provide ca. 10,000 words, which make them suitable for our annotation. The majority of the inter-annotator agreement categories, their types and aspects are satisfying within the range between 0.20-0.30 and 1.00, reaching the perfect agreement.

Table 4

SOL taxonomy inter-rater annotation results for Polish

Annotation type	Agreement
Target 1 – Individual/group	1.00
Vulgar	0.62
Offensive type – hate speech/insult	0.30
Offensive type discredit	0.20
Offensive type threat	1.00
Aspect 05	0.61
Aspect 05a	0.33
Category 06	0.58
Category 06a	0.71
Category 06b	1.00

5 Problematic cases in the annotation

Some of Cohen’s Kappa results for inter-annotator agreement show two basic problematic areas. The closer analysis revealed that the values of the inter-annotator agreement could be influenced by the differences in the length of the linguistic material selected for the annotation, from one word to full sentences or larger units. This is by far the basic reason for the discrepancies in the annotation practice. It indicates that there is a need for more careful and more intensive annotator training sessions, which will certainly be considered for the further regular annotator practice in order to resolve disagreements and come to the common grounds prior to the annotation implementation. Another, independent reason, is the uneven sampling length of particular corpus excerpts.

Interestingly, the differences are particularly visible between the standard for Target 2 – present/absent, proposed in SOL taxonomy and the practice in Lithuanian (L), 0.05, are also partly visible in Polish. Lithuanian also showed lower counts on the two other categories Offensive type_hate-speech/insult, and Offensive type discredit, both acquiring values lower than random.

On the one hand, the Target 2 problems can be accounted for by the rather uncertain status of these context-unavailable properties as presented in the language samples, and, on the other, on general problems encountered, as mentioned before, in the automatic division of the language in the data sets into particular samples for the annotation. Besides, the division problems of the main offensive language category into sub-categories: Hate speech, Insult, and – independently – Discredit, as seen in the lower agreement values for these categories in all languages, was not sufficiently considered in the preparatory stages of the annotation process. A more intensive training session might be proposed in the future tasks to remedy that.

Both the Polish annotation results and the other two languages (except for English) showed a lower agreement value (Polish – 0.19) for the second Aspect type – Offensive language 5b Aspect, which is typical of the selection among a larger number of Aspects (10 in SOL) and a rather low number of a variety of exemplary annotated samples.

On the other hand, the general categorization problem very well depicts the question of the absence of the strict boundaries in linguistic categorization, as seen as early as the fifties of the twentieth century in first publications from a philosophical and logical orientation (Wittgenstein 1953; Zadeh 1964), followed up by a surge of such research studies in cognitive linguistics (Lakoff 1987).

Furthermore, as observed by Lewandowska-Tomaszczyk (2011/2012), unless asked for detailed semantic analyses, language speakers generally use rather approximative meanings in their natural interactions than those involving minute sense discrimination. The computer applications as is known, also require a more definite judgement on such issues, although congruent in a more general sense, unlike in a number of previous ontology schemes, with minute sense differentiation achieved in professional linguistic analyses.

6 Gold standard offensive language examples in 4 languages

The present section shows “gold standard” offensive language examples in English, Czech, Lithanian and Polish as identified in s SOL taxonomy and used in the annotation referred to in the previous sections in the tables below.

ENGLISH

TARGET 1	Language: English
Individual	You are such a fucking moron.
Group	Snotty 17 year olds projecting their daddy and mammy issues on the world.
Vulgar	This is the first time I’m actually replying to your shit.
OFFENSIVE TYPE
hate speech	Black people tend to be quite uncivilised.
Insult	Those wimps are the reason why we’re losing more and more rights by the day.
Discredit	It would be irresponsible for Tory MPs to opt for #BorisJohnsonShouldNotBePM.
Threat	Consider yourself reported to the admin.
Aspect 01 racist/xenophobic	This is typical nigger territorial behavior.
Aspect 02 homophobic	He may be good at anal. You never know!
Aspect 03 physical/mental	No wonder I’m being uncivil, when you’re stupid.
disabilities/behavioural properties
Aspect 04 Sexist	All women are too emotional and idiotic to form rational opinions so they just copy the opinions of the most dominant male in their lives.
Aspect 05 Social class [classism]	Kabir Singh’s character was shown as irresponsible rich spoilt brat.
Aspect 06 Ideologism	Trump bullying Ukraine has as much subtlety as Harvey Weinstein friendly conversations with the women he raped.
CATEGORY (IMPLICIT)
rhetorical questions	How does it feel to be an unbearable self-centred douche-nozzle?
Metaphor	You must be lower than excrement at the bottom of a municipal sewage system.
Simile	They are almost as bad as CNN.
Irony	I am sad when people don’t make fun of him for being Arab.
Exaggeration	Congress has taken Indian polity to a new low.

CZECH

TARGET 1	Language: Czech
Individual	Narozdíl od vás Josefe mi paměť ještě slouží. ) [Unlike you, Joseph, my memory still serves me.]
Group	Pro vládnoucí nenažrance, kteří cpou peníze do černých děr a vlastních kapes, je to málo, tak chtějí ožebračit důchodce. [It is not enough for the ruling greedy elite, who stuff money into black holes and their own pockets, so they want to impoverish the pensioners.]
Vulgar	Těžko se diskutuje s proruským trotlem. [It is hard to argue with a pro-Russian idiot.]
OFFENSIVE TYPE
Hate speech	Jednou Rusák, vždycky Rusák. To není národnost, to je diagnóza [Once a Russian, always a Russian. It is not a nationality, it is a diagnosis.]
Insult	Je to banda zlodějů [It’s a bunch of thieves]
Discredit	Má asi vyoperovaný mozek. [He must have had brain surgery]
Threat	V hloubi duše doufám, že existuje peklo, kde se tenhle hnus bude navěky smažit. [Deep in my heart, I hope there’s a hell where this shit will roast forever.
Aspect 01 racist/xenophobic	Trapní jsou servilní zádelezci Asiatů … Ještě že je opilej Eman v tahu. [It is the servile backstabbers of the Asians that are embarrassing ... Good thing that the drunk Eman (Zeman, the Czech ex-president) is gone.]
Aspect 02 homophobic	Morální úpadek společnosti právě nastal. LGBT je zhouba, slepá vývojová větev.
	[The moral decline of society has just occurred. LGBT is a blight, a blind branch of evolution.]
Aspect 03 physical/mental disabilities/behavioural properties	Někteří lidé jsou fakt jednodušší. [Some people are really rather simple.]
Aspect 04 sexist	No to je hnus zelenej, místo mozku silikonovou kostku.
	[Well, that’s so disgusting, a silicone cube for a brain.]
Aspect 05 social class [classism]	To je ta stejná lůza, která volila Babiše. [This is the very same mob that voted for Babiš.]
Aspect 06 ideologism	A toto je naprosto stejný případ, kolaborantský hlupáku [And this is exactly the same case, you collaborating fool]
CATEGORY (IMPLICIT)
Rhetorical questions	Problémy jsou všude … nejste klaun spíš vy?
Metaphor	[There are problems everywhere... aren’t you the clown actually?] Co kdyby jsi šel bojovat na Ukrajinu, mudrci?
	[Why don’t you go fight in the Ukraine, wise guy?]
Simile	Ta by byla dobrá jako záchranný člun na Titaniku, ale asi bych se raději topil, než se jí chytil.
	[She would be as good as a lifeboat on the Titanic, but I’d probably rather drown than cling on her.]
Irony	Zato v podhradí je to jeden chytrák vedle druhého … experti bez jakékoli zodpovědnosti
	[But in the sub-castle (i.e. government) it’s one smart guy after another ... experts without any responsibility]
Exaggeration	Co vy jste to za pablba slepého? [What kind of a blind idiot are you?]

LITHUANIAN

TARGET 1	Language: Lithuanian
Individual	Užsidaryk savo srėbtuvę, nes supuvę dantys matosi (Close your mouth not to show your rotten teeth)
Group	Degradų kompanija tame seime sėdi.... gaila tos LIETUVOS. (The company of degraded persons is sitting in that parliament.... it's a pity for LITHUANIA.)
Vulgar	Šūdas is tos bandos, levakų komanda (Shit from that herd, a team of loosers)
OFFENSIVE TYPE
Hate speech	Bijokite pedai, turime sąrašą su visais visais pedofilais Lietuvoje (Fear pedophiles, we have a list with all pedophiles in Lithuania)
Insult	eik tu seno kuino subine miegot (go to sleep, you old horse ass)
Discredit	šita ministerija - amžina klapčiukų prieglauda. (this ministry is an eternal shelter for flappers.)
Threat	Bijokite pedai (Be afraid pedophiles)
Aspect 01 racist/xenophobic	Jau vien del slaviškos pavardės tokį reikia kuo toliau nuo tarnybos pasiųsti. (Just because of his Slavic surname, he should be sent as far away from the service as possible.)
Aspect 02 homophobic	Tegu vyras su vyru gyvena mėnulyje, čia jiems ne vieta. (Let a man with a man live on the moon, there is no place for them here.)
Aspect 03 physical/mental disabilities/behavioural properties	Už grotų tuos daunus ir kuo greičiau! (Put those with Down syndrome behind the bars as soon as possible!)
Aspect 04 sexist	Moterytė tai tik kuklus debesėlis prie to atsakingojo. (The woman is just a modest cloud next to the person in charge.)
Aspect 05 social class [classism]	Manau, kada visi bambaliniai išmirs (I think when will all the vagrants die out)
Aspect 06 ideologism	Tu aišku už raudonsniukius balsuosi. (You obviously vote for the redfaced communists)
CATEGORY (IMPLICIT)
Rhetorical questions	Ar seniai skaityt išmokai? (Has it been a long time since you learned to read?)
Metaphor	Čia vieno asilo galvos tereikia Motiejūno, jis griauna komandą savo idiotiškais sprendimais ir pasiteisinimais. (One donkey's head is all you need here, Motiejūnas, he destroys the team with his idiotic decisions and excuses)
Simile	Moterėlės veidelis kaip Uspaskio, gal kiek gražesnis. (The woman's face is like Uspaski's, maybe a little prettier)
Irony	pedofilai violetinės spalvos bijo kaip velnias kryžiaus. (pedophiles are afraid of the purple color like the devil of the cross)
Exaggeration	Jei jau debilas,tai iki amžiaus galo. (If he is already a moron, then until the end of the days.)

POLISH

TARGET 1	Language: Polish
Individual	Odezwał się fan Rydzyka.
	[The fan of Rydzyk has just spoken out.]
Group	Odradzam wszystkim to oszuści, chamy i brudasy.
	[I advise all of you against them they are crooks, yahoos, and slobs.]
Vulgar	Niedojebane katechetki które swoją erotyczną frustrację wyładowują strasząc dzieci ogniem piekielnym.
	[Retarded catechists who unload their erotic frustration by scaring children with hell fire.]
OFFENSIVE TYPE
Hate speech	Żydzi to jednak najbardziej podła rasa.
	[Jews are actually the meanest of all human races.]
Insult	Oszołomy chcą koniecznie oczyścić Rosjan ze współudziału w tej katastrofie.
	[Nuts absolutely want to cleanse Russian of their complicity in that catastrophe.]
Discredit	Podobała mu się moja bardzo dobra koleżanka chwalił się jej ile to na tych Żydach zarobił.
	[He was into my very good friend boasting to her about how much money he made on these Jews.]
Threat	Żydku lepiej nie zabieraj głosu.
	[Better not speak up, you little Jew, a sheeny.]
Aspect 01 racist/xenophobic	Jest żydkiem tylko dla korzystnych układów. [He is a little Jew, a sheeny only for profit]
Aspect 02 homophobic	A Piroga nie lubię nie dlatego że jest pedziem tylko dlatego ze jest kiepskim tancerzem. [And I dislike Pirog not because he is a faggot but because he is a poor dancer.]
Aspect 03 physical/mental disabilities/behavioural properties Aspect 04 sexist	Ludzie nie posiadający elementarnej wiedzy bądź niedouczeni czy po prostu tępi znajdują się wszędzie. [People without basic knowledge or ignorants or just simply stupid can be found everywhere.] Najgorętsze są Hiszpanki i Brazylijki. [Spanish and Brazilian women are the hottest.]
Aspect 05 social class [classism]	Pierdolone brudasy z biedaszybów. [Fucking bootleg mining slobs.]
Aspect 06 ideologism	UE to ostoja Żydów i Muzułmanów prowadząca do zbydlęcenia byłych wyznawców chrześcijaństwa. [EU is a Jewish and Muslim stronghold that leads to the bastardisation of former Christians.]
CATEGORY (IMPLICIT)
Rhetorical questions	No i co panie wielki trenerze rozpracowałeś już Rosjan? [So what Sir Great Coach have you already worked out the Russians?]
Metaphor	Szanuję pracę innych nie jestem typem pasożyta. [I appreciate other hard working people I’m not a parasite.]
Simile	Ja też bardzo nie wiem jak zareagować gdy ktoś lata jak Żyd po pustym sklepie. [I also don’t know how to react when someone is running like a Jew in an empty shop.]
Irony	Ten Rosjanin się nadaje do pchania karuzeli jak zabraknie prądu a nie do boksu.
	[This Russian is good for pushing the carrousel when there is no electricity and not for boxing.]
Exaggeration	Gojowie to padlina która ma służyć Żydom.
	[Gentiles are carcasses who are supposed to serve Jews.]

7 Conclusions

Taking into consideration the three tests described in Section 2 . above from the first proposal, via the Extended Integrated system, to the Simplified Taxonomy of Offensive Language (SOL) , targeted towards reaching adequate measures to be considered a standard for the notoriously problematic offensive language categorization, we might propose that the most recent SOL taxonomy model of offensive language, implemented and verified on four languages: English, Czech, Lithuanian and Polish, can be postulated to function as a LLOD standard for Offensive Language taxonomy for computational applications. For the future development we might also try to achieve good performance on a number of target languages in parallel (translated) sets by training on a source language with the use of a multilingual transformer model.

What we proposed here is an ontology schema (Lewandowska-Tomaszczyk et al. 2021) that will be presented in terms of the Linguistic Linked Open Data (LLOD) system with instances from multiple languages to share and commonly (re-)use language resources.

Acknowledgments

The present study has been conducted within the Use Case WG 4.1.1. Incivility in Media and Social Media, COST Action CA 18209 European network for Web-centred linguistic data science Nexus Linguarum.

About the authors

Barbara Lewandowska-Tomaszczyk

Barbara Lewandowska-Tomaszczyk is Professor Ordinarius Dr Habil. in Linguistics and English Language at the Department of Language and Communication at the University of Applied Sciences in Konin (Poland). Her research focuses on cognitive semantics and pragmatics of language contrasts, corpus linguistics and their applications in translation studies, lexicography and online discourse analysis. She is invited to read papers at international conferences and to lecture and conduct seminars at universities. She publishes extensively, supervises dissertations and also organizes international conferences and workshops.

Anna Bączkowska

Anna Bączkowska, Dr Habil. Prof. UG, holds MA in English Philology, which she received from Adam Mickiewicz University in Poznan, as well as PhD in linguistics and D.Litt. in English Linguistics, which she received from the University of Lodz. Her research interests revolve around translation studies (film subtitles), cognitive semantics, corpus and computational linguistics, and discourse studies (media discourse). She has guest lectures in Italy, Spain, Portugal, UK, Norway, Kazakhstan and Slovakia, and she has also conducted her research during her scientific stays in Ireland, Iceland, Norway, Austria and Luxembourg.

Olga Dontcheva-Navrátilová

Olga Dontcheva-Navrátilová is Associate Professor of English Linguistics at the Faculty of Education, Masaryk University, Czech Republic. Her research interests include English for academic and specific purposes and political discourse. She has published the books Analysing Genre: The Colony Text of UNESCO Resolutions (2009), Coherence in Political Speeches (2011) and coauthored Persuasion in Specialised Discourses (2020). She is co-editor of the journal Discourse and Interaction.

Chaya Liebeskind

Chaya Liebeskind is a lecturer and researcher in the Department of Computer Science at the Jerusalem College of Technology. Her research interests span both Natural Language Processing and data mining. Especially, her scientific interests include Semantic Similarity, Language Technology for Cultural Heritage, Morphologically rich languages (MRL), Multi-word Expressions (MWEs), Information Retrieval (IR), and Text Classification (TC). Much of her recent work has been focusing on analysing offensive language. She has published a variety of studies and a few of her articles are under review or in preparation. She is a member of several international research actions funded by the EU.

Giedrė Valūnaitė Oleškevičienė

Giedrė Valūnaitė Oleškevičienė is Vice-Dean for Scientific Research of the Faculty of Public Governance and Business and a professor at the Institute of Humanities, Mykolas Romeris University. Her scientific interests in humanities include discourse analysis, professional English, legal English, linguistics and translation research, while in the domain of social sciences, her scientific interests include social research methodology, modern education, philosophical issues, creativity development in modern education system, and second language teaching and learning. The researcher coordinated international research projects funded by the EU, publishes scientific articles, participates as a presenter in scientific conferences.

Slavko Žitnik

Slavko Žitnik is Assistant Professor and Vice-dean for Education at the University of Ljubljana, Faculty for Computer and Information Science. His research focuses on natural language processing, information extraction, databases, semantic technologies, and information systems. He is actively collaborating with Université Paris 1 Sorbonne, Harvard University, University of South Florida, and University of Belgrade. He is engaged in multiple research and professional projects. As a chairman of Slovenian Language Technologies Society he is organizing lectures related to language technologies and provides grants to students to visit summer schools. He is also Chairman of the Slovene Society INFORMATIKA, and organizes national conferences on informatics and is editor of a scientific journal.

Marcin Trojszczak

Marcin Trojszczak holds PhD in Linguistics and MA in Philosophy. He is Assistant Professor at the University of Applied Sciences in Konin (Poland). He is also actively cooperating with University of Lodz and University of Economics and Human Sciences in Warsaw. His research interests include metaphorical conceptualisations of mental and emotional processes, the impact of translation technologies on translation education, normativity and genericity in language and cognition, as well as offensive language.

Renata Povolná

Renata Povolná is Associate Professor of English Linguistics at the Faculty of Education, Masaryk University, Czech Republic. Her research lies in the area of discourse analysis, pragmatics and conversation analysis. She has published the books Spatial and Temporal Adverbials in English Authentic Face-to-Face Conversation (2003), Interactive Discourse Markers in Spoken English (2010) and co-authored Persuasion in Specialised Discourses (2020). She is co-editor of the journal Discourse and Interaction.

Linas Selmistraitis

Linas Selmistraitis has over 24 years of experience in higher education specifically in developing and implementing quality assurance systems for higher educational institutions. He earned his PhD in Humanities. Currently Professor Dr Linas Selmistraitis holds the position of Vice-Dean for Studies at Faculty of Human and Social Studies at Mykolas Romeris University and the position of Professor at Institute of Humanities at Mykolas Romeris University. His interest in research are semantics, morphology, cognitive linguistics. He publishes research articles and gives presentations at conferences.

Andrius Utka

Andrius Utka is Associate Professor at the Department of Lithuanian Studies and a senior researcher at the Institute of Digital Resources and Interdisciplinary Research (SITTI), Vytautas Magnus University (Kaunas). He defended the doctoral dissertation Statistical Identification of Text Functions in 2004 (VMU, Kaunas). He was the head of Centre of Computation Linguistics in 2010-2022. He coordinated a number of national and international research projects. His research interests: statistical text analysis, language resources, computer-assisted translation, automatic summarisation, terminology extraction, and the language of disinformation.

Dangis Gudelis

Dangis Gudelis is a professor at Mykolas Romeris University, specializing in public administration and governance. He earned his PhD in Social Sciences, focusing on performance measurement in Lithuanian municipalities. Gudelis has led and contributed to various national and international research projects, particularly in public governance and public policy. His current research interests include applications of big data and AI technologies in the public sector. He is a prolific writer, with numerous publications in scientific journals and presentations at conferences. He teaches courses at both undergraduate and graduate levels. Additionally, he has played a role in policy analysis and consultancy, advising governmental and non-governmental organizations on strategic development and public sector innovation.

References

Amilevičius, Darius & Mažvydas Petkevičius. 2016. LITIS v.1, CLARIN-LT digital library in the Republic of Lithuania. Available at: http://hdl.handle.net/20.500.11821/11 (accessed 12 March 2022).Search in Google Scholar

Andersson, Lars-Gunnar & Peter Trudgill. 1990. Bad Language. London: Penguin Books Ltd.Search in Google Scholar

Basile, Valerio, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Vivian Patti, Francisco Manuel Rangel Pardo, Paolo Rosso & Manuela Sanguinetti. 2019. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. In Jonathan May, Ekaterina Shutova, Aurelie Herbelot, Xiaodan Zhu, Marianna Apidianaki & Saif M. Mohammad (eds.) Proceedings of the 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics, 54–63. Stroudsburg, PA: Association for Computational Linguistics.10.18653/v1/S19-2007Search in Google Scholar

Bączkowska, Anna. 2022. Explicit and implicit offensiveness in dialogical film discourse in Brigit Jones films. International Review of Pragmatics 14(2). 198–225.10.1163/18773109-01402003Search in Google Scholar

Bączkowska, Anna, Barbara Lewandowska-Tomaszczyk, Slavko Žitnik, Chaya Liebeskind, Giedre Valunaite Oleskeviciene & Marcin Trojszczak. 2022. Implicit offensive language taxonomy and its application to automatic extraction and ontology. Presentation at LLOD Approaches to language data research and management, Vilnius, 21–22 September 2022, Lithuania.Search in Google Scholar

Brenner, Jennifer L. 2002. True threats: A more appropriate standard for analyzing First Amendment protection and free speech when violence is perpetrated over the Internet. North Dakota Law Review 78(4). 753–784.Search in Google Scholar

Grice, H. Paul. 1989. Studies in the Way of Words. Cambridge, MA: Harvard University Press.Search in Google Scholar

Lakoff, George. 1987. Cognitive models and prototype theory. In Ulric Neisser (ed.), Concepts and conceptual development: Ecological and intellectual factors in categorization, 63–100. Cambridge: Cambridge University Press.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara. 2012. Approximative spaces and the tolerance threshold in communication. International Journal of Cognitive Linguistics 2(2). 1–19.Search in Google Scholar

Landis J. Richard & Garry G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33. 159–174.10.2307/2529310Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara. 2017. Conflict radicalization and emotions in English and Polish online discourses on immigration and refugees. In Stephen M. Croucher, Barbara Lewandowska-Tomaszczyk & Paul A. Wilson (eds.), Conflict, mediated message and group dynamics: intersections of communication, 1–24. New York: Rowman & Littlefield.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara 2022. A simplified taxonomy of offensive language (SOL) for computational applications Konin Language Studies 10(3). 213–227.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Anna Bączkowska, Chaya Liebeskind, Jelena Mitrović & Giedre Valunaite Oleskeviciene. 2021. Lod-connected offensive language ontology and tagset enrichment. In Sara Carvalho & Renato Rocha Souza (eds.), Proceedings of the workshops and tutorials held at LDK 2021 co-located with the 3rd Language, Data and Knowledge Conference, 135–150. CEUR Workshop Proceedings.Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Anna Bączkowska, Chaya Liebeskind, Gierdre Valunaite Oleskeviciene & Slavko Žitnik. 2023. An integrated explicit and implicit offensive language taxonomy. Lodz Papers in Pragmatics 23(1). 7–48.10.1515/lpp-2023-0002Search in Google Scholar

Lewandowska-Tomaszczyk, Barbara, Slavko Žitnik, Chaya Liebeskind, Giedre Valunaite Oleskevicienė, Anna Bączkowska, Paul A. Wilson, Marcin Trojszczak, Ivana Brač, Lobel Filipić, Ana Ostroški Anić, Olga Dontcheva-Navratilova, Agnieszka Borowiak, Kristina Despot & Jelena Mitrović. (accepted) Annotation scheme and evaluation: The case of OFFENSIVE language. Rasprave.Search in Google Scholar

Liu, Ping, Wen Li & Liang Zou. 2019. nlpUP at SemEval-2019 Task 6: Transfer learning for offensive language detection using bidirectional transformers. In Jonathan May, Ekaterina Shutova, Aurelie Herbelot, Xiaodan Zhu, Marianna Apidianaki, & Saif M. Mohammad (eds), Proceedings of the 13th international workshop on semantic evaluation, 87–91. Stroudsburg, PA: Association for Computational Linguistics.10.18653/v1/S19-2011Search in Google Scholar

Ptaszyński, Michał & Fumito Masui. 2018. Automatic Cyberbullying Detection: Emerging Research and Opportunities. Hershey, PA: IGI Global Publishing.10.4018/978-1-5225-5249-9Search in Google Scholar

Ptaszyński, Michał, Agata Pieciurkiewicz & Paweł Dyba. 2019. Results of the Poleval 2019 shared task 6: First dataset and open shared task for automatic cyberbullying detection in Polish Twitter. Warsaw: Institute of Computer Sciences. Polish Academy of Sciences.Search in Google Scholar

Searle, John. 1975. Indirect Speech Acts. In Peter Cole & Jerry L. Morgan (eds.), Syntax and Semantics 3: Speech Acts, 59–82. New York: Academic Press.10.1163/9789004368811_004Search in Google Scholar

Troszyński, Marek & Aleksander Wawer. 2017. Czy komputer rozpozna hejtera? Wykorzystanie uczenia maszynowego (ML) w jakościowej analizie danych. Przegląd Socjologii Jakościowej XIII(2). 62–80.10.18778/1733-8069.13.2.04Search in Google Scholar

Wittgenstein, Ludwig. 1953. Philosophical investigations. New York: Macmillan.Search in Google Scholar

Zadeh, Lofti. 1964. Fuzzy sets. Information and Control 8(3). 338–353.10.1016/S0019-9958(65)90241-XSearch in Google Scholar

Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra,& Ritesh Kumar. 2019a. Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies, 1415–1420. Stroudsburg, PA: Association for Computational Linguistics.10.18653/v1/N19-1144Search in Google Scholar

Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, & Ritesh Kumar. 2019b Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In Jonathan May, Ekaterina Shutova, Aurelie Herbelot, Xiaodan Zhu, Marianna Apidianaki, & Saif M. Mohammad (eds), Proceedings of the 13th international workshop on semantic evaluation, 75–86. Stroudsburg, PA: Association for Computational Linguistics.10.18653/v1/S19-2010Search in Google Scholar

Datasets and tools

25 English offensive language and hate speech data sets (for the itemized list cf. Lewandowska-Tomaszczyk et al. 2023_Appendix 1.)Search in Google Scholar

Sketch Engine corpus Czech Offensive Language. Available at:Search in Google Scholar

https://ske.fi.muni.cz/#dashboard?corpname=user%2Fsso_259%2Fczech_offensive_language (accessed 5 April 2022)Search in Google Scholar

Amilevičius, Darius & Mažvydas Petkevičius, M., 2016, LITIS v.1, CLARIN_LT digital library in the Republic of Lithuania. Available at: http://hddangerousspeechml.handle.net/20.500.11821/11. (accessed 12 March 2022).Search in Google Scholar

Troszczyński, Marek & Aleksander Wawer. 2017. Available at: http://zil.ipipan.waw.pl/HateSpeech (accessed 1 March 2022)Search in Google Scholar

Ptaszyński, Michał & Fumito Masui. 2018. Available at: http://ptaszynski/cyberbullying-Polish (accessed 10 April 2022)Search in Google Scholar

Ptaszyński, Michał et al. 2019. Available at: http://ptaszynski/cyberbullying-Polish (accessed 10 April 2022)10.15804/pbs.2022.11Search in Google Scholar

Annotation INCEpTION platform. Available at: https://inception-project.github.io/ (accessed 20 February 2022)Search in Google Scholar

Sketch Engine webcorpus of English. Available at: https://www.sketchengine.eu/ententen-eng-lish-corpus (accessed February 2022)Search in Google Scholar

Marco A. Stranisci, Simona Frenda, Mirko Lai, Oscar Araque, Alessandra T. Cignarella, Valerio Basile, Viviana Patti & Cristina Bosco. 2022. O-Dang! The ontology of dangerous speech messages. In Ilan Kernerman, Sara Carvalho, Carlos A. Iglesias & Rachele Sprugnoli (eds.) Proceedings of the 2nd Workshop on Sentiment Analysis and Linguistic Linked Data, 2-8. Paris: European Language Resources Association.Search in Google Scholar

Published Online: 2023-12-12

Published in Print: 2023-12-15

LLOD schema for Simplified Offensive Language Taxonomy in multilingual detection and applications

Abstract

1 Introduction

2 Previous models proposed by the Nexus Linguarum WG 4.1.1. team

2.1 SALLD-1 (Lewandowska-Tomaszczyk et al. 2021)

2.2 Integrated explicit and implicit offensive language taxonomy (Lewandowska-Tomaszczyk et al. 2023; Bączkowska et al. 2022)

2.3 A short survey of results of the First Annotation Campaign (Lewandowska-Tomaszczyk et al. accepted for Rasprave)

3 A Simplified Taxonomy of Offensive Language (SOL)

3.1 Introduction

3.2 Word embeddings for the English SOL taxonomy keywords

3.3 SOL Taxonomy

4 Second Annotation Campaign and its results

4.1 English – a comparison with the First Annotation Campaign results

4.2 Czech

4.3 Lithuanian

4.4 Polish

5 Problematic cases in the annotation

6 Gold standard offensive language examples in 4 languages

ENGLISH

CZECH

LITHUANIAN

POLISH

7 Conclusions

Acknowledgments

About the authors

References

Datasets and tools

Journal and Issue

Articles in the same Issue