« IE Seminar

Dataset 3: Annotation procedure

Institute of Computer Science, Brandenburgische Technische Universität Cottbus-Senftenberg
Juan-Francisco Reyes
pacoreyes@protonmail.com

### THIS PAGE WILL BE UPDATED PERMANENTLY BASED ON INTERACTIONS ON THE FORUM, RETURN OFTEN

Overview

This document delineates the process of annotating a dataset for training a stance classification model via Hugging Face technology. The objective is to build the dataset using the bootstrapping approach, a semi-supervised learning approach where a small set of labeled data ("seed dataset") is used to train a preliminary model, which then annotates more data. This newly annotated data, after validation, is added to the training set, and the process iterates. The primary goal of bootstrapping in this project is to expand a dataset efficiently, starting from a small, manually labeled dataset. Bootstrapping is particularly useful when large-scale, hand-labeled data is scarce or expensive.

This time we use a SetFit (Sentence Transformers for Few-shot Text Classification) model on this classification task, known for its efficiency in few-shot learning and reliance on minimal training data.

The representativeness and quality of the seed dataset significantly impact the final model's performance. Over multiple iterations of bootstrapping, the dataset grows as new data is annotated based on the predictions of the model trained on the seed dataset. This iterative process gradually improves the model's accuracy and ability to handle a wider variety of data. Finally, the process stops after reaching the highest possible performance.

Each participants will generate subsets of "Dataset 3". The project is fully individual.

The project will be completed in three weeks, including weekly presentation of the progress based in this calendar:

  1. Week 1: submit Dataset 3 v1 with 100 new examples (datapoints), and a SetFit model fine-tuned with Optuna, achieving an accuracy above 0.944.

  2. Week 2: submit Dataset 3 v2 with 150 new examples (datapoints), and a SetFit model fine-tuned with Optuna, achieving an accuracy above 0.944.

  3. Week 3: submit Dataset 3 v3 with 200 new examples (datapoints), and a SetFit model fine-tuned with Optuna, achieving an accuracy above 0.944.

Participants will use the "seed" model (provided by the lecturer) to increase the dataset with representative new examples, giving in th third week a final number of 450 examples additional to the "seed" dataset.

Background

Stance

A political stance, also known as a political position or viewpoint, is an individual's or group's set of beliefs, opinions, and values regarding politics and governance. It reflects how one thinks about various political issues, policies, and ideologies. Political stances can cover a wide range of topics including economic policy, social issues, foreign policy, environmental concerns, and the role of government. Stance is integral to understanding political discourse, encapsulating the attitudes and beliefs underlying political actions and communications.

Stance detection typically operates by categorizing the stance as either in support/favor and oppose/against, regarding the target.

  1. Support stance: The speaker shows agreement, approval, or positive feelings towards the target. For example, "I fully support renewable energy initiatives." Here, the support stance is towards renewable energy initiatives. Words often used: agree, love, support, advocate, praise.

  2. Oppose stance: The speaker expresses disagreement, disapproval, or negative feelings. Example: "I am against deforestation." The oppose stance is towards deforestation. Common words: disagree, hate, oppose, criticize, condemn.

The target in stance refers to the specific subject, entity, or idea that a speaker or writer is expressing their attitude, feelings, or viewpoint towards. It is essentially what the stance is about. Here are some key points to understand about the target in stance:

  1. Subject: The target can be a person, a policy, an event, a concept, or any other subject matter that is being discussed or referenced.

  2. Focus: The target is the focus of the speaker's or writer's evaluative or affective position. When someone takes a stance, they are doing so in relation to this target.

  3. Identification: To identify the target in a piece of communication, look for what or whom the opinions, feelings, or attitudes are being directed at. It can often be found as the subject of the sentence or can be understood from the context.

  4. Explicitness: Sometimes the target is explicitly mentioned in the discourse, while other times it may be implied and needs to be inferred from the context.

The complexity of identifying stance arises from the need to understand the explicit expressions of opinion and capture the nuances and implicit cues that convey stance in natural language. Political discourse is characterized by its strategic ambiguity, where statements are carefully crafted to convey certain messages while maintaining a degree of vagueness. This ambiguity allows politicians to appeal to diverse audiences without committing to specific stances. For example, a politician might use generalized statements or rhetorical questions that imply a stance without explicitly stating it.

Additionally, the use of metaphors, euphemisms, and other figurative language can obscure the true stance, making it challenging for listeners or readers to discern the actual position being taken. This complexity is further amplified by the context in which the language is used, as the same phrase might imply different stances in different situations or when spoken by different individuals. Therefore, understanding political stance requires not only a close examination of the language itself but also a deep understanding of the broader social, cultural, and political context.

In this project we will focus on sentences of explicit stance, and some times of moderate complexity of stance, excluding highly implicit expressions charged of euphemisms, metaphors, and other figurative or rhetoric devices to make the statement less obvious.

For instance, in the sentence "I would never criticize anyone taking vigorous action against terrorism.", the support stance is towards those fighting terrorism (the real target). We will avoid sentences with such a high level of ambiguity.

On the contrary, we should prefer a sentence like "We Republicans are agreed that full employment shall be a first objective of national policy." Target: employment.

Or.

Another important problem in our foreign intercourse relates to China. Where the speaker says that China is a problem, but only with a moderated level of ambiguity.

To make this exercise more clear, our stance classification model will be built on the following types of explicit stance:

  1. Opinion: The target of the opinion must be clearly identified, stating the opinion about something/someone directly and unambiguously stated. The identification of the opinion may rely on positive or negative adjectives or verbs that leave no doubt about the stance. Avoid sentences where opinions are implied or require inference from the context.

    Example of oppose stance: And I think the threats of terrorism and the hatred that presently exists, the threat of war, the threat of economic boycotts and punishment against Egypt, are certainly not conducive to realizing the hopes of the Palestinian people. Target: terrorism.

    Example of support stance: We've made real, continuing investments in science and technology, which I think are pivotal to the long-term health of the economy and the continuation of this productivity increase. Target: economy.

  2. Direct: Focus on sentences that contain clear statements of opposition or disagreement, such as "I oppose," "I reject," "I support," or "I stand for." Ensure that the object of the stance (the thing being supported or opposed) is explicitly mentioned or unambiguously referenced in the sentence. Exclude sentences where opposition or support is conditional or qualified by concessions.

    Example of oppose stance: And so, yes, I'm absolutely committed to helping fight poverty. Target: poverty.

    Example of support stance: It's - we're in a war against these terrorists who will bring great harm to America, and I've asked these young ones to sacrifice for that. Target: terrorists.

  3. Slightly indirect: The target is referred indirectly through another subject. However, the complexity of the (implicit) reference is moderated. In this category, the stance is conveyed subtly, without directly referring the primary target, the goal of using this kind of stance is to appeal to a wider audience, as it's less confrontational, maintaining some level of diplomatic or rhetorical flexibility.

    In the following example, the target (war) is reached indirectly through another subject (goal).

    Example of oppose stance: Our goal must be to deter war of any kind. Target: war.

    Example of support stance: The more we bring China into the world, the more the world will bring change and freedom to China. Target: China.

A note on Stance vs Sentiment: While closely related and often overlapping, have distinct differences in the realm of linguistics and communication analysis. Stance is always directed towards a target, while sentiment may not always have a clear target and can be a general expression of feeling.

Stance: Stance refers to a speaker's attitude, position, or evaluation towards a specific target. Stance can be complex and multifaceted, often encompassing not just sentiment but also beliefs, commitments, and alignments. It can be expressed explicitly or implicitly.

Sentiment: Sentiment generally refers to the emotional tone or affective state expressed in a piece of communication. Compared to stance, sentiment is usually more straightforward, focusing mainly on the emotional valence of the language.

More rules: The stance in the selected sentence should come from the speaker (I) or the plural form of the speaker (we), including any variation that exclude a stance that refers to a third-persons (personal pronouns): he, she, they, it, etc. Also, the target of the stance should not be a third-person, only to a clear someone (a named-entity) or something (a concept). For instance, "I oppose to it." is a statement with a no clear target, contrary to "I oppose to green energy".

SetFit

SetFit (Tam et al., 2022), introduced by a collaboration between Intel Labs, UKP Lab, and Hugging Face, represents a significant advancement in the field of natural language processing (NLP), particularly in the context of text classification with limited labeled data. This technical analysis aims to elucidate the nature of SetFit, its technical underpinnings, the problems it addresses in comparison to existing solutions, and its distinct advantages.

SetFit, fundamentally, is a methodology aimed at the efficient fine-tuning of sentence transformers for specific classification tasks. The core of SetFit lies in its ability to adjust pre-trained sentence transformers, such as BERT, RoBERTa, or DistilBERT, to suit unique text classification requirements. This process is particularly advantageous when dealing with minimal data availability, as SetFit can effectively utilize small datasets for training, unlike traditional models that depend on large volumes of labeled data. Its computational efficiency is another standout feature, with SetFit being 1600 times smaller than models like OpenAI GPT-3, thereby offering a more accessible and scalable solution without compromising performance and involved costs of MLaaS. This attribute makes SetFit particularly suitable for applications with limited computational resources or those requiring rapid model deployment.

SetFit addresses several challenges that are prevalent in the current landscape of NLP and machine learning:

  1. Overcoming Data Scarcity: Traditional models, including the likes of GPT-3 or BERT, require extensive labeled data for training. SetFit, conversely, excels in environments where labeled data is scarce, making it a more practical choice for many real-world applications where data availability is a limitation.

  2. Resource Efficiency: Large Language Models (LLMs) demand significant computational power, which translates into high costs and limited access. SetFit presents an alternative that is not only less resource-intensive but also more cost-effective, broadening the scope of its applicability.

  3. Customized Text Classification: The adaptability of SetFit enables it to cater to specific classification needs more effectively than general-purpose models. This capability is particularly beneficial for niche or specialized applications that require tailored classification approaches.

Despite its notable advantages, SetFit, like any technological tool, has its limitations, especially when compared to current Large Language Models (LLMs). One significant limitation is its scope: SetFit is primarily designed for text classification tasks, which means its applicability is narrower than more general-purpose LLMs that can handle a wide range of NLP tasks, including text generation, translation, and more complex language understanding challenges.

Additionally, while SetFit's smaller size and efficiency are advantageous in terms of resource utilization, they may also imply a trade-off in terms of the depth of linguistic understanding and contextual nuance that larger models, with their extensive training on diverse and voluminous datasets, can offer. Likewise, the performance of SetFit is heavily reliant on the quality of the pre-trained sentence transformers it fine-tunes; if these base models are not adequately trained or are biased, SetFit's output will inherit these limitations.

Thus, while SetFit presents a valuable solution for specific scenarios, particularly in constrained-resource environments, it is not a one-size-fits-all solution and should be chosen considering the specific requirements and constraints of the intended application.

SetFit operates at the sentence level by leveraging advanced techniques in NLP to classify text with minimal training data. The core of SetFit's functionality lies in the concept of sentence transformers and sentence embeddings.

  • Sentence Transformers: These are models specifically designed to understand and represent the meaning of entire sentences, as opposed to traditional word-level models. Sentence transformers take into account the context and semantics of the whole sentence, providing a more nuanced understanding than word-level analysis alone. They do this by processing the input sentence and converting it into a numerical form (embedding) that captures its essence.
  • Sentence Embedding: This is the process where a sentence is transformed into a vector (a list of numbers). Each sentence, regardless of its length, is converted into a vector of fixed size. This vector is a numerical representation of the sentence's semantic content – its meaning, context, and nuances. The embedding process ensures that sentences with similar meanings have similar vector representations, making it easier to compare and classify them.

Different flavors of SetFit

  1. sentence-transformers/all-mpnet-base-v2: This model is based on the MPNet architecture. MPNet stands for "Masked and Permuted Pre-training for Language Understanding" and is designed to understand the context and semantics of a sentence. It's part of the Sentence Transformers library, which is optimized for generating sentence embeddings, meaning it converts sentences into numerical vectors that can be used in various NLP tasks.
  2. sentence-transformers/paraphrase-mpnet-base-v2: This is a variation of the MPNet model that has been specifically fine-tuned for paraphrase identification. The training involves identifying whether two sentences are paraphrases of each other, which enhances the model's ability to understand the nuances in sentence meanings and structures. This model is also part of the Sentence Transformers library and excels in tasks that require understanding the subtle differences and similarities in sentence meanings.
  3. BAAI/bge-small-en-v1.5: BAAI stands for Beijing Academy of Artificial Intelligence, and this model is part of their BGE (Beijing Graph Embedding) series. It's designed to capture semantic relationships and is trained on a diverse set of text sources, giving it a broad understanding of language. This model might be particularly effective in tasks involving a diverse range of topics or in cases where understanding complex relationships between entities is crucial.

Our choice: sentence-transformers/paraphrase-mpnet-base-v2, due to it is more suitable for tasks that involve distinguishing subtle differences in opinions or paraphrased stances.

Bootstrapping in NLP

Bootstrapping is a semi-supervised learning methodology that iteratively improves the performance of a model or the quality of a dataset. This method is particularly useful when labeled data is scarce, but unlabeled data is abundant. The bootstrapping process generally involves the following steps:

  1. Step 1: Fine-tune the "Seed" Model using the "seed dataset". First, use the hyperparameter optimization script using Optuna and the dataset split into three JSONL files: training, validation, and testing. Then, use the SetFit training script to fine-tune your Seed Model. You are supposed to get an accuracy score of 1.0. Save the model according to the SetFit's official documentation. The characteristics and quality of the seed dataset are crucial for the success of the bootstrapping process since it defines the "gold standard" of the dataset and the performance of the model.
  2. Apply the Seed Model to Unlabeled Data: The trained model is then applied to a larger set of unlabeled data. The goal here is to make predictions or annotations on this data, effectively generating new, albeit potentially noisy, labeled instances. Download your assigned segment of examples.

  3. Error Analysis: Understand the model's performance on a broader dataset to a deeper level to make the necessary refinements.
  4. Selecting High-Confidence Predictions: From these predictions, a subset is chosen based on certain confidence criteria. For example, instances where the model predicts labels with high probability or where multiple models agree on the label might be selected.

  5. Iterative Refinement: The high-confidence instances are added to the original labeled dataset, thereby enlarging it. The model is then retrained on this augmented dataset. This cycle of predicting, selecting, and retraining is repeated multiple times. With each iteration, the model is exposed to more data and ideally learns to make better predictions.

  6. Convergence Criterion: The process continues until a stopping criterion is met. This could be a set number of iterations, a performance threshold on a validation set, or when the model's improvements fall below a certain margin.

Note: the main challenge to consider with bootstrapping is that the model might reinforce its own errors (confirmation bias), especially if the high-confidence predictions are not accurate. Also, the diversity of the dataset might not increase significantly, as the model tends to predict labels similar to those in the initial training set.

Error Analysis

In NLP, error analysis is critical to understanding, diagnosing, and improving the model's performance. After running the model with unseen data:

  1. Store the model's predictions along with their confidence scores.

  2. Randomly sample a subset of the predictions.

  3. This subset should include a mix of high, medium, and low-confidence predictions to get a comprehensive view of the model's performance across different confidence levels. Ensure each class is equally represented in the subset, with 20 examples (datapoints) per class, giving a total of 40.

  4. Manually annotate this subset with the correct labels.

  5. Compare the model's predictions with your annotations to identify errors.

  6. Classify the errors into categories, which might include misinterpretation of semantic context, incorrect assessment of target, complex structure of the sentence, high grade of ambiguity, or misalignment with the stated stance (e.g., classifying a 'support' statement as 'oppose'). Develop your own categorization based on your findings.

  7. Analyze the patterns in these errors. Investigate, for example, if the model consistently errs in identifying the stance in complex sentence structures or in distinguishing neutral statements from those expressing subtle support or opposition. Elaborate conclusions and post them on the forum.

  8. Based on your findings, make adjustments to the model. This might include retraining with additional or different seed data, choosing another model or applying different preprocessing techniques.

  9. Adjust your criteria for selecting high-confidence predictions. For example, if the model frequently misclassifies a certain class, you might lower the confidence threshold for those instances or exclude them from the high-confidence selection in the next iteration.

  10. Apply the refined model to the unlabeled data (making sure to exclude the examples added to in step 9).

References

  1. Tam, D., Schlichtkrull, A., Ciosici, M. R., Braud, C., Augenstein, I., & Ruder, S. (2022). Efficient Few-Shot Learning Without Prompts. arXiv. Retrieved from https://arxiv.org/abs/2209.11055

  2. Fast Fine-tuning of Text Classification with SetFit (Medium): https://medium.com/@dhtien/fast-fine-tuning-of-text-classification-with-setfit-f98e0a39e631