1 Introduction

The allocation of the physical custody of children in cases of divorce or separation is an issue of great social relevance, and it gives rise to a large number of cases before the courts. It is estimated that in 2020, there were 800,000 divorces in Europe (Eurostat 2020) and 630,505 in the United States of America, of which 51.1% involved children (National Center for Health Statistics 2020). The experience presented in this paper is based on rulings made by Spanish courts. In 2021, there were 86,851 divorces in Spain, of which 21.2% went to court. In 53.2% of these divorce and separation cases, there were minors involved, and their custody was awarded to the mother in 53.1% of the cases and to the father in 3.5%, with joint custody in 43.1% of the cases (Instituto Nacional de Estadística 2021).

From the point of view of legal sociology and for those who work on legislation and public policies in this area, it is of great interest to gain a detailed knowledge of the way in which judges make decisions on this issue, and yet, the available statistics provide only a reduced set of data. However, it is necessary to have up to date data at all times as this matter is in continuous evolution. We propose that artificial intelligence (AI) could help in this task by automatically analyzing the court rulings and extracting information from their text about the circumstances of each case, the requests of the parents, the decisions of the judge, and the facts that were taken into account. We consider that this methodology can also be useful in other areas in which policymakers often commission experts in legal sociology to carry out studies to monitor the effects of laws, areas such as juvenile law or gender violence, since the studies conducted so far are based on the analysis of a very small set of judgments.

Law is language and, therefore, natural language processing (NLP) occupies a prominent place among the applications of AI to the legal field. NLP technology is not only used for identifying entities that are designated by a name—a technique called named entity recognition (NER)—but also to search for other entities of a more complex nature, like arguments using argument mining. Our experience uses NLP to characterize the nature of the court procedure by identifying some elements in the court rulings (Watson et al. 2022). These elements are the request (what is being requested), the decision (what is decided by the court), and some of the arguments used to justify the decision, selecting eight arguments from among those set forth in the custody laws. We developed a neural network model to extract this information from the text of the judgments, and we trained it on a set of 3047 judgments that were annotated. Each judgment was independently labeled by two annotators, and then the inter-annotator agreement (IAA) was calculated (Yamada et al. 2019). This same set was used for the training of the neural network and the analysis of the results, randomly dividing the set into a training set, a development set, and a test set in a ratio of 7:2:1. Finally, the agreement between the model and humans was calculated and compared with the IAA.

The study makes several contributions to the literature. The first is the approach of using NLP to characterize a complete document—the court ruling—and, through this, to extract knowledge about the proceedings that are concluded by the ruling. However, there is usually an important part of the content of a ruling that is dedicated to the textual quotation of precedents and, in appeal hearings, to the narration of what happened in the first instance proceedings. The model must learn to distinguish these parts that do not contain information on the current procedure. To do this, the tasks are divided into two stages: the binary classification of long texts and the multiple classification of sentences. In the first stage, the model performs an analysis of the context and decides the sentences that are relevant and those that are not. This analysis is a novelty, and it constitutes the second contribution of this research because of the complexity of the natural language understanding process required for long texts. Finally, the third contribution is the information provided on the consensus between annotators and the neural network when dealing with legal concepts. In relation to this, the effort made is very important since we worked on a set of more than 3000 court rulings with a double annotation.

The objectives of the experiment and the research questions are explained in the next section (Sect. 2). Next, some similar previous studies are presented (Sect. 3). This is followed by a detailed explanation of the methodology used for the annotation and the tasks performed to train the model (Sect. 4). Subsequently, the results obtained are presented (Sect. 5) and analyzed (Sect. 6), both as to the consensus of the annotators and the performance of the neural network and its agreement with humans. Finally, some conclusions are presented (Sect. 7).

2 Research objectives

The aim of the research was to develop a tool that is able to select from a given set of court rulings those in which a custody request is resolved and then to extract information from those rulings to answer questions such as: “How often the court agrees on the form of custody requested by the plaintiff?”; “Is this percentage higher when individual or when joint custody is requested?”; and “How often is a certain argument used to justify joint or individual custody?”.

In order to achieve this objective, three main research questions were posed:

RQ1::

Is it possible to distinguish in the text of a court ruling those sentences that refer to the proceedings that are concluded by the ruling from those sentences that refer to earlier proceedings or are verbatim quotations from judicial precedents?

RQ2::

How good is the performance of a transformer-based model at characterizing a judicial proceeding by identifying in the text of the court ruling the plaintiff’s request, the court’s decision, and some predetermined arguments?

RQ3::

Is it possible to characterize judicial proceedings by processing only the legal grounds of the court ruling?

3 Background and related works

3.1 Legal knowledge extraction using NLP

The extraction of information from legal texts is one of the main applications of NLP to the field of law. NER is used to identify entities like people, judges, lawyers, cities, organizations, institutions, courts, brands, laws, ordinances, contracts, court decisions, and legal literature (Correia et al. 2022; Leitner et al. 2020). Additional information about the entities is also obtained, like the role of a person in court judgments (Gupta et al. 2018; Samarawickrama et al. 2020) or cross-references between entities (Ji et al. 2020a). Two main strategies are followed in NER. One uses knowledge-based and feature-engineered NER systems that combine in-domain knowledge, gazetteers, and orthographic and other features with supervised or semi-supervised learning. The other uses neural network architectures based on minimal feature engineering. Within this, word- and character-level architectures—and different combinations of the two—are used, depending on how the text content is introduced into the neural network (Yadav and Bethard 2018). The recent use of different text preprocessing methods—especially word embeddings—that allow the performance of neural networks to be improved is one of the factors explaining the rapid increase in the use of deep learning in NLP (Chalkidis and Kampas 2019).

For its part, argument (or argumentation) mining allows arguments in legal texts to be identified (Lytos et al. 2019). Two types of widely used legal argumentation are the adoption of a legal norm and the reference to a precedent, which are usually incorporated into legal texts by citation of the document (e.g., the law or judgment) containing the norm or precedent. The labeling of instances of citations according to their rhetorical roles in the discourse is called citation mining and is integrated within argumentation mining (Lawrence and Reed 2020). Other types of norms are principles or directives. Shulayeva et al. (2017) used machine learning for the automatic identification of legal principles and facts associated with case citations. Our study is very similar to theirs for several reasons. First, its objective is to identify legal principles and facts within legal texts. Second, it has a similar methodology since it considers that the allocation of categories is subjective and varies according to the people who analyze the texts. Finally, sentences associated with cited cases that are neither principles nor facts are annotated as neutral, and the IAA is quantified using the Kappa index (Cohen 1960). The values obtained by these Shulayeva et al. were K = 0.65 for IAA and K = 0.72 for agreement between the annotators and the model. Yamada et al. (2019) used the F1 coefficient to measure the agreement between annotators and a model for Japanese civil law judgment documents. The annotation of the legal arguments was performed at several levels, with the identification of issue topic units as the first level, for which the performance was evaluated at F = 0.52, and rhetorical classification as the second level, performing at F = 0.63.

Other works have focused on the extraction of factual information (Lyu et al. 2022; Zhou et al. 2023). One example is a study by Ji et al. (2020b), in which five types of information about the evidence brought to the proceedings were extracted from court record documents. This type of information may span multiple sentences; thus, as in our study, two tasks were performed: a classification task, which assigns the sentences to one of the two categories of production of evidence and cross-examination of evidence, and an extraction task. These authors used 1128 court rulings in the Chinese language containing 36,364 paragraphs, of which 6446 contained information about the evidence, thereby obtaining a result of F = 0.72. Another type of content found in a court ruling is the decision reached by the court. This information is extracted in some commercial systems to produce statistics on the courts, although the information that is extracted is limited to determining whether the plaintiff’s requests were upheld, partially upheld, or denied. Fernandes and colleagues (2020) identified the court’s decision in appeal judgments, as we did in our experiment. These authors looked for changes from the first instance decision in Brazilian judgments dealing with damage awards. For this purpose, they used the following labels: the value of the moral damage, the increase or decrease in the moral damage value, the initial date of arrears of interest, the initial date of monetary correction, and the legal fees due from the defeated party. With a team of 19 annotators, 3022 documents containing 221,820 tokens—a mean of 73 tokens per document—were annotated. The performance obtained in the extraction was F = 0.94.

The extraction of information from the text of sentences is used for predictive justice applications, and Alcántara Francia et al. (2022) and Rosili et al. (2021) reported a good number of studies carried out in this line. But among these, few focus on family law. A precedent is Split-Up, developed by Zeleznikow (2004) for Australian law, which determines property division upon divorce and integrates neural networks and rule-based reasoning. Li et al. (2018) extracted certain variables, including the type of custody and the parent to whom custody was assigned, in Chinese court judgments on divorce, and then used Markov networks to develop predictions. Huang et al. (2021) developed a model to predict the parent who would be assigned custody based on 19 variables and using CHAID (Chi-squared Automatic Interaction Detector). Their sample consisted of 3028 decisions on child custody decided by Taiwanese family and district courts of first instance between 2012 and 2017; the F1 score obtained was 0.9783, indicating that the model is quite satisfactory.

3.2 Transformers in NLP

Deep learning is achieving ground-breaking improvements in several NLP tasks, such as entity recognition (Otter et al. 2020), text classification (Chen et al. 2022; Minaee et al. 2021), machine translation (Popel et al. 2020), question answering (Huang et al. 2020), and language generation (Iqbal and Qureshi 2020).

Long short-term memory networks (LSTM; Hochreiter and Schmidhuber 1997) and models such as Word2Vec are widely used (Church 2017), especially models based on the architecture named Transformer that was introduced by Vaswani et al. (2017). Models like ELMO, Bidirectional Encoder Representations from Transformers (BERT), GPT-1,2,3, or the Robustly Optimized BERT pre-training Approach (RoBERTa) are the current focus of many investigations in this field (Braşoveanu and Andonie 2020). Transformers, which are built above attention mechanisms, are the core of this type of computational language model since they are able to generalize and to generate their own model of the language. In our experiment, we leveraged Google’s BERT architecture (Devlin et al. 2019), one of the most commonly used models and one of the few to have a multilingual and a Spanish version. BERT’s key technical innovation is to apply the bidirectional unsupervised training of transformers to language modeling. In contrast to previous efforts that either looked at a text sequence from left to right or combined left-to-right and right-to-left training, BERT is deeply bidirectional. It is pretrained using a large plain text corpus on two novel objectives: masked language modelling (MLM) and next sentence prediction (NSP). While early models, such as Word2Vec or GloVe generated a single embedding representation for each token in the vocabulary, BERT considers the context for each occurrence of a given word to encode its meaning.

Mapping wider text spans, such as sentences or short paragraphs, to a dense vector space in such a way that similar sentences are close to each other has wide applications in NLP. Despite this, only a few sentence-embedding models exist for multi- and cross-lingual scenarios (Reimers and Gurevych 2020). Moreover, the BERT model suffers from fixed input length limitations, wordpiece embedding problems, and computational complexities (Yang et al. 2019). The Generalized Auto-regression Pre-training for Language Understanding (XLNet), the RoBERTa, and the DistilBERT pretrained models were necessary proposals for mitigating the different problems underpinning BERT. Like BERT, XLNet is a bidirectional transformational model, using a training approach in which a masked word in a sentence is predicted (MLM). XLNet improves on BERT by predicting every word in a sequence with any combination of other words in that sequence, while in BERT’s masked language model, only masked words (15%) are predicted. Thus, the performance of XLNet is an improvement at the price of a higher computational cost. The second tool, RoBERTa, which was developed by Facebook, outperforms BERT in the Glue benchmark (Briskilal and Subalalitha 2022; Liu et al. 2019). It improves the MLM of BERT with a strategy in which it removes the NSP objective and introduces dynamic masking, so that the masked tokens change during each iteration, which makes the model learn to predict the hidden secrets of the text. It also improves the hyperparameter setting of BERT, training it with minibatches and much higher learning rates. Finally, DistilBERT provides a tradeoff between performance and computational cost (Sanh et al. 2019). Using this approach, the BERT model is used for “knowledge distillation,” which is a compression technique in which a small model is trained to reproduce the behavior of a larger model. In this way, the small model learns only a reduced and more adjusted version of the bigger model, a version that contains its essential knowledge.

BERT is beginning to be used successfully in legal applications. Chalkidis et al. (2020) developed Legal-BERT, which builds the language model on legal texts. Their findings indicate that the common guidelines for pretraining and finetuning, which are often blindly followed, do not always generalize well in the legal domain. Thus, they propose a systematic investigation of the available strategies to apply BERT in specialized domains. These strategies are: (a) use the original BERT out of the box; (b) adapt BERT by additional pretraining on domain-specific corpora; and (c) pretrain BERT from scratch on domain-specific corpora. Chalkidis et al. (2019) used BERT for large-scale multilabel text classification (LMTC), that is, the task of assigning to each document all the relevant labels from a large set, and developed a public dataset called EURLEX57K, labeled with the EuroVoc vocabulary. Their study assessed the efficacy of different transformer-based models paired with techniques such as generative pretraining, gradual unfreezing, and discriminative learning rates to achieve a noteworthy classification performance. Their study also introduced new, cutting-edge results, with F = 0.661 for JRC-Acquis and F = 0.754 for EURLEX57K. A similar experiment with U.S. legal texts was performed by Bambroo and Awasthi (2021). Transformers are also being used to generate the timeline of a court ruling, with the system offering a visual summary of the key legal events and their associated timeframes within a case (Xu et al. 2020). As for argument mining, there are a large number of references on the use of deep learning (Galassi et al. 2020), although there are few examples in the literature of the use of transformers for the extraction of argumentation (Chernodub et al. 2019; Fromm et al. 2019) and even fewer in the legal field. However, the success of transformers today, in particular GPT-3, is behind the emergence of legal assistants, such as LawGPT (Nguyen 2023).

4 Methodology

4.1 The labels

A fundamental part of the design of the experiment was the selection and definition of the elements to be identified in the text of the court rulings. Three main factors were taken into account for this task: (1) the number of elements had to be limited in order for the model to obtain relevant results; (2) the element should appear with sufficient frequency in the court rulings; and (3) the elements should correspond to the criteria established in the laws for the assignment of custody. The elements that were finally used were grouped into three classes (request/decision, legal principles, and factual arguments) and are as follows.

4.1.1 The request made by the plaintiff and the decision that the court adopts regarding this request

The only request/decision pair that was considered is the one referring to the type of physical custody, which can be individual (when the child or children live with one of the parents and a visiting regime is established for the other parent) or joint (when the child or children live separately with each parent but for more or less equivalent periods). The corresponding labels are RQ_JOIN (Request) and DEC_JOIN (Court Decision), accompanied by a “+” sign if individual custody is requested or decided and a “−” sign if joint custody is requested or decided.

4.1.2 Legal principle

Only the principle of the best interests of the child was considered because other principles that were proposed (such as equality between parents) appeared very infrequently. This is the most important legal principle in custody matters and implies that the interests of the child should prevail, especially over those of the parents. However, beyond establishing this priority, the principle is extremely ambiguous and can be used to justify decisions either way (Kelly 1997). The corresponding label is BEST_INT.

4.1.3 Factual arguments

In Spain, there are different pieces of legislation on custody because some regions have competence in family law matters (Hayden 2011). By analyzing these laws, we identified the factual criteria that judges must consider. Of these, some have to do with the children: their circumstances (CHILD_CIRC; for example, their age, the existence of possible special needs or illnesses, and school performance); their roots (CHILD_ROOT) in a certain locality, school, etc. or with one of the two parents; and their opinions regarding custody (CHILD_OPIN). Other criteria have to do with the parents: the relationship between them (whether it is good or bad) and their attitudes concerning the children (PAR_RELAT); the availability of time and material means to care for the children (PAR_RDNS); and the dedication they have previously shown to the children (PAR_DED). Finally, the opinion expressed by the experts in the psychological–social report (PSY_REP) was included.

These categories have different functions in the analysis of the judgments. Our first objective was to identify the request and to determine whether or not it was granted by the court. There should be no contradiction in these labels since a procedure can have only one main request by the plaintiff and one decision on the request by the court. The second objective was to identify some of the arguments used by the court to support its decision. Among the eight arguments that were defined, there was one of a legal nature—the principle of the best interests of the child—and others that corresponded to factual arguments. This information will make it possible to know how often the courts use the different arguments and whether there is a relationship between this frequency and other variables, such as the request/decision or the sex of the parties.

4.2 The court ruling

In Spain, the court rulings are made up of four sections. First, there is the heading, which contains the details of the court, the parties involved, and the professionals who represent and defend the parties. Second, there is the facts section, in which the court describes the process followed in the procedure, the parties’ requests, the alleged facts, and the evidence on those facts. The appellate court rulings also describe the first instance proceedings, including the parties’ requests and the judge’s decision. Third come the legal grounds, which contain the ratio decidendi, which “can be identified as those statements of law that are based on the facts found and on which the decision is based” (Raz 2002, p. 21). In addition, they often contain reasoning on procedural incidents and other subsidiary questions. Finally, the fourth part is the verdict section that contains only the court’s decision. A basic outline of the court ruling structure and the analyzed categories is represented in Fig. 1.

Fig. 1
figure 1

Court ruling structure and analyzed categories

In our study, we also needed to determine the section or sections of the court ruling that should be analyzed to extract the different elements. Regarding the decision, this is expressed in the verdict section in relation to the request, which is accepted, partially accepted, or denied. Normally the same words are used; thus, it is relatively easy to identify through regular expressions when a judgment has been favorable or unfavorable to the applicant. But if the request is not known, this information is incomplete and does not allow a determination of whether decision is to individual or joint custody. For its part, in the facts section, the court explains all the requests and the decisions held in the first instance, the requests of the appeal, and all the facts alleged by the parties, many of which are not taken into account for the decision. As a consequence, it tends to be a very extensive section that contains a lot of information that is only “noise” for the purposes of our experiment. However, the court very often states in the legal grounds the requests on appeal and what will be its decision on them. In this part of the ruling, the court also selects from among all the proven facts those that it considers relevant for its decision and uses them to build the justification. Finally, the legal grounds contain the legal principles and regulations that support the decision. Consequently, we hypothesized that it would be possible to obtain all the searched elements from the legal grounds alone, and we decided that this was the only section that would be labeled and analyzed.

One other issue was that not all the content of the legal grounds is part of the ratio decidendi. There are multiple reasons for this, but a few stand out. One is that the court usually describes the arguments put forward by the parties, regardless of whether or not it later makes them its own. Another is that very often, fragments of judgments of higher courts or of previous judgments of the court itself are quoted. In addition, in appeal judgments, the courts usually start their argumentation by summarizing the requests, the arguments, and the decisions in the previous procedures. All this content contains expressions identical to those used in the ratio decidendi, and only an analysis of the context allows the annotators to know whether or not a given mention of one of the categories should be labeled.

4.3 Dataset

In Spain, family trials are first heard by the lower courts. If there is an appeal against the lower court’s judgment, this is resolved by a provincial court, one of which is located in each province for a total of 50 provincial courts. The rulings in the dataset were obtained from the Spanish Centre for Judicial Documentation (CENDOJ), which, with some exceptions, only holds rulings from the provincial courts and other higher courts. For this reason, the study had to be carried out on appellate rulings, although in light of the objectives, it would probably have been better to use first instance rulings. Another important limitation was that CENDOJ does not provide court rulings in bulk, and it was necessary to download them one by one from their website. This entailed significant additional work and prevented an exhaustive selection of the entire collection of court rulings. Thus, the dataset does not include all the custody court rulings in the period.

The dataset consists of 3047 court rulings dealing with all or some of the following issues: the type of custody, alimony, and the allocation of the use of the family home to one of the parents. Rulings from all the provincial courts were included, and we tried to distribute them in proportion to the number of inhabitants of each province. As for the date, priority was given to the selection of the most recent cases, and 80% corresponded to the period 2015‒2020, although there were judgments going back to 2006, the first year when there was legal recognition of joint custody in Spain. This dataset is one of the most important results of the study as it may be useful for other researches (Riera et al. 2023); it is available at https://github.com/labje/bidaraciv. More detailed information about the dataset and its spatial and temporal distribution is also available at this URL as well as about the set of labels and the labeling methodology.

The rulings were downloaded in PDF format. A process then extracted the information from the metadata and, through a system of rules, identified the gender of the plaintiff and the defendant from the header. It then converted the rulings to plain text and separated out the legal grounds, which had a mean length of 2067 words. Next, two jurists, one with a PhD (annotator 1) and one with a master’s degree (annotator 2), annotated the judgments using Brat (Stenetorp et al. 2012; see Fig. 2). Each of the annotators independently annotated all the sentences, a task that lasted 10 months, with a throughput of 15 sentences per person per day. Prior to the start of labeling, a guide was prepared that established the format of the labels and defined in detail the concepts associated with each of the categories. This guide was improved during the first weeks of labeling through research team meetings in which doubtful cases that arose were analyzed. In the first and second months, we conducted intermediate analyses of the IAA, adding one of the authors as a third annotator to refine the methodology and ensure consistent results.

Fig. 2
figure 2

Labeling example

The function of the labeling was twofold. First, the presence of the labels in a court ruling made it possible to characterize the procedure to which it corresponded and to know what the petition and the decision had been as well as the main arguments that were used by the court. Second, the text associated with the labels was to be used to train the neural network. For this purpose, two criteria were established: (1) selecting those text fragments that most clearly reflected the category associated with the label; and (2) the relationship between the fragments and the associated categories should be deducible without the need for additional context. With these criteria, the resulting fragments were quite long, with a mean length of 87 words, and diverse, with a standard deviation of 61. Table 1 shows examples of text fragments associated with each category.

Table 1 Examples of text labeled in the court sentences for each of the categories

Another issue that arose was how to proceed when there were several mentions of the same category in a court ruling. The criterion that was established was that when the court dealt extensively with the same category, only the most representative fragment was labeled. For example, it was common for a court to deal with the financial means of the parents, and sometimes the evidence on this issue was very lengthy, giving rise to protracted reasoning. The annotator had to look for a single fragment that reflected the court's conclusion on this issue and tag it with the PAR_RDNS label. If the court dealt with other arguments related to the same label, such as the availability of housing or time for child care, each instance of reasoning was treated in the same way. As a result, if the same label appeared several times in the same court ruling, it could be interpreted that the argument had been used from different perspectives and, therefore, with greater intensity.

In the present work, we used a subset formed by the 2394 court rulings that dealt with child custody as the main issue, although they may also have dealt with alimony and family housing. There were 36,087 labels in these court rulings, making a mean of 15.07 per court ruling. The number of labels for each category can be seen in Table 2.

Table 2 Number of annotations for each category

4.4 Model development

The categories we wanted to identify can be expressed in many different ways. For example, the model should be able to detect when the court is taking into consideration the financial situation of the parents to make the decision. But the court can refer to this with phrases as diverse as: “both parents have the same economic capacity”, “both parents have resources to adequately care for the common child”, “Mrs. Isidora does not earn income”, “both parents lack stability in the labor field, depending on the economic support of their families”, and so on. For this reason, the use of rule-based lexical analysis was discarded and we opted to use a transformer-based language model in order to assess the semantic capacity of this technology to identify certain categories in the court rulings, even though they were expressed in very different ways.

First, we divided the text into sentences using a standard rule-based segmentation system. There were some difficulties due to the intrinsic characteristics of legal language, which often utilizes very long sentences, as well as from frequent errors and inaccuracies in the transcriptions of the court rulings. Once sentences segmentation is performed, those with three words or fewer (2.88%) were excluded. Those with 300 words or more (2.02%) were also excluded as it was found that this length was generally due to punctuation errors. Finally, the neural network training was performed over 72,261 sentences that were randomly divided by whole court rulings into training data (72%), test data (18%), and validation data (10%).

The overall classification is based on two steps. The first consists of a binary classification to decide if there is a request, a decision, or an argument present in a sentence (i.e., to decide whether the individual sentence contains a label or not). This discrimination is difficult to resolve because a sentence is annotated only if its content is related to the current court case and not if its content is related to previous or higher court rulings. As a consequence, the classification task becomes more complex because it requires the context, including previous and subsequent sentences, to be analyzed to identify whether the content of the sentence refers to current case. Accordingly, the model generates a BERT-based sentence embedding, and this encoded representation feeds a bidirectional LSTM (Bi-LSTM) network for the final classification (Graves et al. 2005; Liu et al. 2021). This approach in which the output of the transformer model generates a contextualized sentence embedding that is used to initialize the model in charge of the classification task is similar to the one used by Phang and colleagues (2018).

This strategy suffered from one drawback: The length of the sentences was, on mean, at least 100 tokens, and if context information was appended, the total length of the data to be encoded by the BERT model far exceeded the maximum length permitted (512 tokens). This inconvenience required us to follow a more elaborate strategy to classify a sentence alongside its context. First, we used a transformer-based model instead of the traditional Doc2Vec model to represent the sentence into a vector. In this way, each document is encoded as a sequence of vectors. Afterward, a fixed-length sliding window passes through the sequence retrieving the set of sentence embeddings to feed the Bi-LSTM network to classify the central sentence, as depicted in Fig. 3. The window length was set to three sentences. Although longer window sizes were tested, no significant improvements in classification were noticed, while the computational cost increased. Bi-LSTM can extract information features from backward and forward data input at the same time (Schuster and Paliwal 1997), and by using this type of model in our approach, we attempted to gather the grammatical information about the context that is required to classify each sentence. This architecture addressed the challenge of exceeding the maximum length permitted by the BERT model when all the contextual information of a sentence was encoded.

Fig. 3
figure 3

Stage 1: classification of annotated and non-annotated sentences based on sentence context

Our approach is based on sentence transformers, an architecture that provides an efficient method for computing dense vector representations for various types of content, including sentences, paragraphs, and images. Sentence transformers are based on transformer networks, such as BERT, RoBERTa, and XLM-RoBERTa, that encode texts into a vector space where similar texts are close to each other, enabling efficient retrieval through cosine similarity. We utilized Sentence-BERT (SBERT), which achieves state-of-the-art performance for various sentence-embedding tasks (Reimers and Gurevych 2020), to encode each sentence in our dataset into a fixed-length vector that captures its semantic meaning. SBERT is based on the BERT model and applies mean pooling on the output to produce sentence embeddings (Devlin et al. 2019). In addition, we incorporated XLM-R, a pretrained network that supports 100 languages, including Spanish, as a student model in our experiments (Conneau et al. 2020). This allowed us to leverage the multilingual capabilities of XLM-R and utilize its pretrained language models to enhance our classification performance on non-English texts. While we only examined the SBERT approach in our study, this strategy can be applied to other network architectures. Future research can investigate the efficacy of using different network architectures and optimization strategies to enhance our approach.

In the second step (Fig. 4) of the approach, the model must identify the particular elements (request, decision, or arguments) mentioned in every annotated sentence. In this case, each sentence belongs to one or more classes, and this is referred to as a multilabel classification problem. Multilabel, fine-tuned, transformer-based models have been selected to classify annotated sentences, seeking the best possible performance on a test set. We benchmarked the different models in the literature by fine-tuning pretrained models on our downstream task of classifying Spanish legal documents. Whereas in English, there are several benchmarks for multilabel classification in standard texts, to the best of our knowledge, there is no such benchmarking for Spanish and, in particular, for judicial language. Thus, it was necessary to compare different options. Specifically, a careful comparison between multilingual (various languages including Spanish) and monolingual (exclusively Spanish) pretrained language models was performed. Multilingual models, such as BERT-multilingual version, XLM, DistilBERT and XLM-RoBERTa, were evaluated. As for a monolingual model, BETO, the Spanish Pre-trained BERT model trained on a large unannotated monolingual corpus of about three billion tokens (Canete et al. 2020), was utilized.

Fig. 4
figure 4

Stage 2: multilabel classification based on transformers (the sentences with X are non-annotated)

One major challenge encountered during the first training phase was the significant imbalance between annotated and non-annotated sentences, with only 24% of annotated sentences. Csányi and Orosz (2022) showed that text augmentation can improve performance in some cases of imbalance, but it can lead to overfitting and decreased performance in others. Legal language is unique and context-dependent, making it difficult to generate appropriate synthetic examples. Thus, they suggest using label-balancing procedures rather than text augmentation to address imbalanced data in legal NLP tasks. Based on our experience, we also deemed it inappropriate to engage in class balancing through synthetic text generation for a couple of reasons. First, generating synthetic textual data that maintains the subtleties and nuances of natural language is exceedingly complex. Second, artificially balanced datasets can potentially introduce noise and reduce the model’s ability to generalize from the training data to unseen data, as it may learn from features that are not representative of real-world data.

Because of this, a naive down-sampling strategy was adopted to balance the annotated and non-annotated datasets. This method involved a random selection of an equal number of sentences from both categories to ensure a representative baseline in the respective training and development sets. However, the balanced method did not substantially improve the F1. In the second stage (multilabel tagging), we decided against implementing this balancing method. The reasons for this decision were twofold: to maintain an equitable comparison ground with other architectural setups; and to portray a true reflection of the intrinsic data distribution during the model's training and validation processes.

To improve the performance of the model, 600 runs were carried out using the Optuna library. Each run corresponded to a different parameterization of the model. In stage 1, the parameters were adjusted within the following ranges: epochs, 1–20; batch size, 1–32; learning rate, 1e−5 to 1e−2; LSTM hidden size, 100–500; LSTM layers, 1–3; dropout, 0.0–0.5; learning rate adaptation, true/false; namely window size, set to 3; bidirectional setting, true; and application of early stopping, true. In stage 2, the ranges were as follows: epochs, 1–5; batch size, 1–32; learning rate, 1e−5 to 1e−2; maximum sequence length, 20–40; threshold, 0.1–0.9. Here, too, we adhered to a fixed set of parameters: early stopping was engaged, a warmup proportion of 0.1 was maintained, and the gradient accumulation steps were capped at 1.

5 Results

To evaluate performance, we first used the F1 index. However, as the sample was highly unbalanced, it was easier to reach high values in this index, which can be misleading. Therefore, we also used the Kappa index, a consensus metric that shows reduced sensitivity to this phenomenon. For the interpretation of the Kappa index, we followed the guidelines of Viera and Garrett (2005), who categorized values between 0.41 and 0.60 as indicating a moderate level of agreement, values between 0.61 and 0.80 as indicating substantial agreement, and values between 0.81 and 0.99 as representing near-perfect agreement. Kappa values are presented in the tables together with their standard errors. We also evaluated the statistical significance of F1 and K values using the chi-square test, and a strong correlation between the variables was obtained in all cases (p < 0.001).

As explained in Sect. 4.4, the model performs two tasks: first, the classification of annotated and unannotated sentences; and second, the multilabel classification of annotated sentences. We evaluated the IAA of the first task using the entire corpus (N = 72,261) and the performance of the model using the test set (N = 7017). A sentence was considered to be annotated by a human if it had been labeled by at least one annotator. Table 3 shows the results.

Table 3 IAA and agreement between network and humans in the classification of annotated and non-annotated sentences (P, accuracy; R, recall; F, F1score; K, K-index)

The two annotators labeled 24% (12% + 7% + 5%) of the sentences with a low IAA (K = 0.5782), although close to the threshold of substantial agreement. The model labeled a higher percentage of sentences (15% + 20%), and the K = 0.4991 value indicates a moderate level of agreement. The F1 values are almost equal (F = 0.65) in both cases, indicating better performance. However, this value is achieved in opposite ways: For the human annotators, it is mostly due to accuracy, which is quite high (P = 0.7044), but for the model, it is due to recall (R = 0.7628). This fact rather than the lack of balance in the sample, which is not very high, may justify the difference between the two metrics. As for the impact of these results on the reliability of the information provided on the court rulings, 15% of false positives indicates the percentage of decisions, requests, or arguments that were identified but were not relevant, while 6% of false negatives correspond to possible losses of relevant information.

We continue with the analysis of the second task. In this task, a sentence is regarded as falling within a specific category if it contains at least one instance of the corresponding label. To measure the IAA, we used the sentences labeled by the two annotators (N = 8414), and alignment between annotators was deemed present when both assigned the same category to a sentence. For the model performance, we took the sentences from the test set (N = 3328), and alignment was established between the model and annotators if the model's categorization aligned with at least one of the two annotators. Table 4 shows the results for the type of custody requested (RQ_JOIN) and the type granted (DEC_JOIN). The “+” sign corresponds to individual custody and the “−” sign to joint custody. Table 5 shows the results for the different arguments, ranking them from the highest to the lowest value of K in the IAA.

Table 4 IAA and agreement between network and humans on the requests and decisions contained in the sentences (P, accuracy; R, recall; F, F1score; K, K-index)
Table 5 IAA and agreement between network and humans on the arguments contained in the sentences (P, accuracy; R, recall; F, F1score; K, K-index)

The IAA for requests corresponds to almost perfect agreement (K = 0.9243), and for decisions (K = 0.7924), it is near the threshold of almost perfect agreement. The performance of the model, with K = 0.8398 and K = 0.6343, is in the range of almost perfect agreement and substantial agreement for requests and decisions, respectively. For the arguments, the value of K obtained by the model exceeds the level of substantial agreement in each case, except for rootedness of the children (CHILD_ROOT), although its value of K (0.6022) is very close. The IAA presents similar K values, with six arguments in the range of substantial agreement. The highest values of K for the model and for IAA are very similar and correspond to children's opinions (CHILD_OPIN; K = 0.8005) and to availability of time and material means (PAR_RDNS; K = 0.8063), respectively. The lowest model performance is observed in rootedness of the children (CHILD_ROOT; K = 0.6022), and the lowest IAA value in children's circumstances (CHILD_CIRC; K = 0.5275).

RQ3 aims to determine whether it is possible to characterize judicial proceedings by processing only the legal grounds of the court ruling. We did not have a set of court rulings in which the request, the decision, and the arguments were known with total certainty. First, because these concepts are often ambiguous and open to interpretation, even in the case of the custody modality. Second, because resources were not available to develop a gold standard. Therefore, the same strategy was followed as in the previous analyses and the IAA was compared with the performance of the model. We chose to calculate the IAA on the entire corpus of labeled rulings (N = 2394) in order to obtain a more accurate measure, even though this means that the comparison with the model performance, calculated on the rulings in the test set (N = 595), is not perfect. Table 6 shows the results for the type of custody requested (RQ_JOIN) and the type granted (DEC_JOIN).

Table 6 IAA and agreement between network and humans on the requests and decisions contained in the court rulings (P, accuracy; R, recall; F, F1 score; K, K-index)

The two annotators agreed in 76% of the court rulings: 31% in which individual custody was requested (RQ_JOIN+) and 45% in which joint custody was requested (RQ_JOIN−). There were also some court rulings for which annotator 1 included the request but annotator 2 did not (7% of RQ_JOIN+ and 5% of RQ_JOIN−), while in other cases, the opposite was true (1% of RQ_JOIN+ and 1% of RQ_JOIN−). Thus, although all court rulings are expected to have a request and a decision, it was only possible to know the request in 90% of the cases. For the court decisions, the two annotators agreed in 63% of the cases (39% of DEC_JOIN+ and 24% of DEC_JOIN−), while in 19% of the cases, only annotator 1 included a label (13% of DEC_JOIN+ and 6% of DEC_JOIN−) and in 4% of the cases (2% of DEC_JOIN+ and 2% of DEC_JOIN−), only annotator 2 included a label. Consequently, with human annotations, it was possible to know the decision in 86% of the court rulings.

The model performs significantly worse because it did not predict any request in 38% (2% + 2% + 34%) of court rulings and did not predict any decision in 52% (6% + 2% + 44%). However, its accuracy in predicting requests was quite high; while the IAA for the requested custody mode has a value of K = 0.7354, the agreement between the model and the human annotators is K = 0.8001, a value close to the threshold of almost perfect agreement. For decisions, the relationship is inverse because the IAA has a value of K = 0.6360, which implies substantial agreement, and the model agreement with the annotators has a value of K = 0.5659, a lower value, although it remains relatively close to the substantial agreement limit.

Table 7 shows the results for the identification of arguments, ordered according to the Kappa value obtained for the IAA.

Table 7 IAA and agreement between network and humans on the arguments contained in the court rulings (P, accuracy; R, recall; F, F1score; K, K-index)

The IAA ranges from F = 0.6000 to F = 0.8549, and the model performance ranges from F = 0.6557 to F = 0.9284. Using the Kappa index, the model exceeds the threshold of substantial agreement for six of the arguments, and it is very close to this threshold for the remaining two. The model's best performance (K = 0.7371) was for the availability of time and material means (PAR_RDNS), and its worst performance (K = 0.5950) was for the previous dedication to the children (PAR_DED). These values are significantly higher than the IAA values, in which only the children's opinion (CHILD_OPIN) was in the range of good agreement (K = 0.7007). The lowest IAA value (K = 0.3251) corresponds to the best interests of the child (BEST_INT).

6 Discussion

6.1 Benchmarking algorithms

In the experience, we evaluated several open-source algorithms from the HugginFace platform (https://huggingface.co/models) using several baseline models to foster a robust analysis. The configurations for these models were meticulously chosen to not only optimize the performance but also to facilitate a comparative analysis across different architectural paradigms.

For the classification of annotated and non-annotated sentences, two groups of configurations were explored. In the first one, the TF-iDF (Robertson 2004) + Random Forest (Breiman 2001) and TF-iDF + Bi-LSTM were pivotal. For both, the TF-iDF was configured with min_ngrams and max_ngrams set at 1 and 5, respectively, alongside 500 features. The default settings were utilized for both min_df and max_df, while stopwords were set to the Spanish standard. The Random Forest parameters was n: total number of trees (forest size) / n estimators = 100; m: number of splitting variables = default (sqrt(features from TFIDF)); and d: maximum tree depth = 10. The Bi-LSTM setup was finalized with 5 epochs, a batch size of 15, a learning rate of 8.2e−4, one LSTM layer, and a hidden size of 100 units.

In the second group, we focused on models incorporating Sentence transformer paired with either Bi-LSTM or a nonrecurrent NN and a unique setup involving a Windowed sentence transformer coupled with Bi-LSTM. The Sentence transformer model used was “distiluse-base-multilingual-cased”. The configuration of Bi-LSTM was the same as described above. Contrarily, the non-recurrent NN configuration was structured with 15 epochs, a batch size of 31, a learning rate of 8.2e−4, and a dropout rate of 0.35. Table 8 shows the results. The best test score (F = 0.79) was obtained using windows made up of three sentences to feed a Bi + LSTM network. This was the architecture chosen to develop the model and it is described in detail in Sect. 4.4.

Table 8 Results for stage 1: classification of annotated and non-annotated sentences

Table 9 shows the results for the multilabel classification. The F1 coefficient was used to assess the extent to which the model agrees with the annotators in assigning a given label to a sentence. However, in this scenario, the classes were not mutually exclusive, which means a sentence may contain several labels (multilabel classification problem). The Label Ranking Average Precision (LRAP; Madjarov et al. 2012) was also calculated to quantify the extent to which the model and the annotators coincided when assigning all the labels to the same sentence, and it is also displayed in the table.

Table 9 Results for Stage 2: multilabel classification based on transformers

In this case, the best results were obtained with BETO (the pretrained version of BERT for Spanish) and with XLM-RoBERTa. Given the peculiarities of legal language, it can be assumed that the development of a pretrained version of BERT specifically for this language as well as leveraging GPT-3 or other large language models or training our own model with all the CENDOJ data would significantly improve the results obtained. Future research should focus in this direction to explore the potential of these language models.

6.2 Knowledge extraction

Table 10 summarizes the F1 score and Kappa index values for the IAA and the model's agreement with human annotators in assigning requests, decisions, and arguments to sentences and court rulings. The table also includes the difference between the IAA and the model's performance as well as the mean of the absolute values of these differences. Furthermore, the table presents the standard deviation of the IAA series and the model's agreement series and the Pearson correlation coefficient between these. Bold indicate instances in which the model's performance falls below the IAA.

Table 10 Summary of F1 score and Kappa index values for IAA and agreement between model and humans

The first research question (RQ1) asked whether the model could distinguish the sentences in the text of a court ruling that are relevant to characterize it from those that are not, a task that corresponds to the classification of annotated and non-annotated sentences (described as stage 1 in Sect. 4.4). As seen in Sect. 5, in this classification, the agreement of the model with the annotators is not very high, having a value F = 0.6519 and a coefficient K corresponding to moderate agreement (K = 0.4991). To obtain more information about the performance of the model in this task, a sample of 100 annotated sentences was randomly selected, and one of the authors evaluated the model's correctness, which was 79%. This value represents a 21% false positive rate, a value somewhat higher than that obtained in the comparison with the annotators, which was 15%.

However, in the experiment, we were mainly focused on the impact that the selection of relevant sentences would have on the classification of court rulings, the performance of which depends on the performance of this task and that of the classification of sentences. A low performance in the selection of relevant sentences should make the performance of the model in classifying court rulings appreciably lower than in classifying sentences. In this sense, the mean performance of the model in both tasks was almost identical since in the mean of F1, there was a difference of 0.01 (F = 0.78 and F = 0.77), and in the mean of K, there was a difference of 0.06 (K = 0.72 and K = 0.66). However, when considering the data for each label, the correlation between the IAA and the model results, which was 0.97 for F1 and 0.74 for K in the classification of sentences, dropped to 0.65 for F1 and 0.29 for K in the classification of court rulings. This difference might be due to the selection of relevant sentences, but, as will be explained later, we assume that is mainly due to the superior performance of the model compared to the IAA. Therefore, in response to RQ1, it could be stated that the selection of relevant sentences was performed with an acceptable level of efficiency for the purposes of the experiment.

The second research question (RQ2) addressed the model's ability to identify the plaintiff's petition, the court's decision, and some predetermined arguments in a court ruling. Regarding the petition and the decision, it is noteworthy that the model only managed to assign a petition in 62% of the court rulings and a decision in 48%. Also, when the model assigned a request and/or a decision, the performance was significantly better for requests (F = 0.87, K = 0.80) than for decisions (F = 0.75, K = 0.57). One possible explanation is that during labeling, the annotators often had difficulties in finding expressions in the legal grounds that explicitly indicated the request and, above all, the decision. These results indicate that these difficulties considerably increased for the neural network and would suggest a negative answer (at least partially) to the third research question (RQ3), which asked whether it was appropriate to use only the legal grounds to characterize the court rulings.

We consider that the performance of the model could be improved by including other parts of the court ruling in the analysis. The request is always described in the facts section and more clearly and directly so than in the legal grounds (see Sect. 4.2). Therefore, it can be assumed that also analyzing the facts section would improve the identification of the request, although it should be borne in mind that the facts section of appeal court rulings also describes all requests made in previous instances. As for the decision, as already stated, it is always included in the verdict section by means of regular expressions that indicate whether the request was accepted or not. Including the analysis of this part, once the request and, by means of regular expressions, the sense of the verdict is known, it would be possible to deduce the decision and improve the performance of the system in this aspect.

Returning to RQ2, the answer depends, first, on the model’s performance in the multilabel classification of the sentences (described as stage 2 in Sect. 4.4). In this task, the F1 values obtained by the model ranged from 0.62 to 0.96, with a mean value of 0.78. Those of K ranged from 0.60 to 0.84, with a mean value of 0.72. Except for one label, these values were within the range of substantial agreement. The performance in court rulings classification was, as already stated, slightly lower. In this case, F1 values for the model ranged from 0.66 to 0.93, with a mean value of 0.77. Those for K ranged from 0.57 to 0.80 and were, except for three labels, within the range of substantial agreement. These results were in the range of those obtained in other comparable studies, such as that of Yamada et al. (2019), who, as stated, worked with Japanese court rulings and obtained an IAA on the annotation of subject thematic units of F = 0.52 and of rhetorical classification of F = 0.63. Zhang et al. (2019) annotated newspaper articles with a similar methodology and obtained values between F = 0.52 and F = 0.60.

The standard deviation reveals that the results of the model are more uniform than the IAA in all tasks. In particular, for the K-index of the court rulings classification, the standard deviation of the model was 0.07 while that of the IAA was 0.11. As already mentioned, in this classification, the agreement between the annotators for each label and of these with the model presented a low correlation coefficient. This could be due to lower annotator efficiency, since in eight of the ten labels, the agreement of the model with the annotators was higher than that of the annotators with each other. The biggest differences are in the more generic and indeterminate concepts, such as the best interest of the child (BEST_INT) and the circumstances of the child (CHILD_CIRC), where the K value obtained by the model is higher by 0.36 and 0.29. But there is also a difference in very specific labels, such as psychosocial report (PSY_REP), where K is higher by 0.12. One of the possible explanations for this better performance of the model could be that the annotators acted independently of each other, while the neural network was trained with the sentences that both had annotated, but if this is the main cause, the difference should also occur in the classification of the sentences and this is not the case.

For very general concepts, such as BEST_INT or CHILD_CIRC, a possible explanation could be that these are labels with a high number of annotations, which allows better training of the model (see Table 1). In other labels, such as PSY_REP, the factor that would favor the better performance of the neural network could be the high degree of concept concreteness. The effect of these two factors could also be observed in sentence classification. In this area, the highest performance of the model was for availability of time and material means (PAR_RDNS), which was the label with the highest number of instances. Moreover, the concept is quite specific as it refers to economic and material means, such as housing or time available. The lowest performance corresponded to the child's rootedness (CHILD_ROOT), which had a low number of labels, and to the child's circumstances (CHILD_CIRC). Both are undefined concepts and, in addition, the boundary between them was quite blurred, which probably worsened the performance of both.

Analyzing the confusion matrices of the court rulings classification, the lower percentage of false negatives of the model compared to the human annotators stands out, and was very marked in some labels as, for example, in the child's circumstances (CHILD_CIRC) with 9% and 25%, respectively. This suggests that another possible reason for the better performance of the neural network is that it performed a more exhaustive analysis of the texts. Annotators economize on labels and, moreover, their opinions differ when considering that a certain argument has been important enough in a court ruling to label it. Consequently, they would choose in many cases not to label a given sentence, especially in the more indeterminate and frequently occurring concepts. In contrast, the model would perform a much more thorough analysis of the text and would tend to label each argument whenever it is mentioned, decreasing false negatives.

6.3 Applications

The objective of the experiment was to develop a tool for the analysis of custody cases that are resolved in Spain in order to obtain information on how the laws on this matter are applied by the judges. The processing of the court rulings, starting from the original PDF format until the obtainment of the final results, was automated, which would allow comparative studies between territories and between different periods of time at a reduced cost. These studies could provide relevant information and, in particular, help to detect differences between territories and trends. However, for these studies to be carried out, it would be necessary to have direct access to the repository of court rulings or mass downloading, which is currently not possible with CENDOJ.

A reason for using neural networks in this task is that once they have been trained, they perform a more exhaustive analysis and apply the criteria more uniformly than human annotators, an ability that becomes more important when processing a large number of court rulings. Another reason is that they have proven to be able to identify concepts even when they are mentioned in the text with very different expressions. As for precision, the result obtained for requests (F = 0.87) could be considered sufficient, while for decisions (F = 0.75), there are measures that, as mentioned above, could foreseeably improve performance. With respect to arguments, it has been seen that if they are defined with sufficient precision, it is possible to obtain adequate performances. It can also be assumed that the use of the tool developed with first instance court rulings, which have a much simpler structure, would obtain higher yields.

The labels defined in the present study can serve as variables in predictive justice works. One of the authors and another researcher (Muñoz Soro and Serrano-Cinca 2021) have developed a predictive model using data obtained from manually-labeled sentences, which uses nine independent variables and has an accuracy rate of 86.4%. This or similar predictive models could be updated over time by feeding them with data extracted from more recent judgments using the tool outlined in this paper. However, since the error rates of the knowledge extraction and the predictive model would be accumulated, the reliability of the predictions would be quite low. One option to improve the results would be to use in the predictive model only those variables that obtained the highest hit rates in identification and which, in turn, have the greatest explanatory capacity, as is the case, for example, for the relationship between the parents (PAR_RELAT).

7 Conclusion

The aim of the experiment was to develop a neural network model capable of identifying the plaintiff’s request, the court’s decision, and the main arguments used by the court in custody proceedings. A two-stage process was chosen, the first of which was to filter within the text of the court ruling the relevant sentences characterizing the proceeding that was concluded by the ruling. The agreement between the model and the annotators suggests a reasonably good result for this task. The main interest in this result is that the selection was made on the basis of the context of the sentences, which demonstrates the ability of BiLSTM in combination with transformers to process that context.

The least accurate results were obtained in the identification of the court’s decision. But this point could probably be improved by processing other parts of the court ruling (not only the legal grounds) and by developing specific strategies, especially taking into account the complementarity between request and decision. For its part, the identification of the arguments yielded better results, and the F1 and Kappa values obtained were in the range of other similar pieces of research (Rosili et al 2021). The performance of the model even outperformed the IAA, which is probably due to the fact that it performs a more exhaustive analysis of the text and applies the criteria more uniformly. These results can be considered to be indicative of the high capacity of transformers to identify abstract concepts in legal texts, even without having been specifically trained in Spanish legal language. From the results, two main factors influenced the performance: the precision with which each argument category was defined, and the number of labels corresponding to it in the training set. Sometimes these are opposing factors, as the greater generality of a concept corresponds to a higher number of annotations. However, the lack of precision of a concept was not compensated for by a high number of labels, so we consider that the key factor for good performance would be the proper selection and definition of the arguments to be identified.