Annotation procedure of Dataset 2

### THIS PAGE WILL BE UPDATED PERMANENTLY BASED ON INTERACTIONS ON THE FORUM, RETURN OFTEN

Overview

This document delineates the process of annotating a dataset for training a passage boundary detection model via Hugging Face technology. The objective is to construct a dataset enabling the model to identify the boundaries of a passage about a topic distinctly.

Teams will generate subsets of "Dataset 2". Utilizing an annotation tool, teams will edit passages that will be turned into datapoints for a subsequent task of BERT model fine-tuning.

The project is collaborative, with a collective outcome determining the grade; however, the annotation process is individualized.

Background

Passage Boundary Detection

Task in NLP where the goal is to identify the boundaries between distinct passages or sections within a text. It involves determining where one passage ends and another begins, essential for tasks like document summarization, text segmentation, and improving readability by structuring text into coherent units.

This process is crucial for understanding and organizing large text volumes by breaking them into more manageable and contextually distinct sections.

Next Sentence Prediction (NSP) in BERT Models

BERT models were trained with two tasks that enable the development of a sophisticated understanding of language context and sentence relationships: Masked Language Model (MLM) and Next Sentence Prediction (NSP).

Traditional language models before BERT primarily processed text in one direction (either left-to-right or right-to-left), BERT, however, reads text bi-directionally. This feature means it gains a deeper understanding of the context of a word based on all its surrounding text, not just what comes before it. This feature enables BERT to predict very accurately if two sentences have logical continuation by discerning whether a given sentence logically and coherently follows a preceding sentence in a text.

This feature can be leveraged to determine the boundaries of a passage. In the NSP task, BERT is given two sentences (A and B) and has to predict if B logically follows A. This capability is essential for understanding sentence relationships, which is a crucial aspect of language comprehension.

During training, 50% of the time, sentence B is the actual next sentence that follows A, and 50% of the time, it's a random sentence from the corpus. BERT then learns to predict whether these two sentences are related or not.

NSP helps BERT understand the narrative flow and how ideas are connected in text, which is useful in tasks that require understanding the relationship between sentences, like passage boundary detection, question answering, and summarization.

Levels of Language

The levels of language that allow us to understand the linguistics of sentence continuity are four:

Morphology deals with the structure and formation of words, studying how morphemes (the smallest grammatical units in a language) are combined to form words.
Example: The word "exobiology" is formed by combining the prefix "exo-" (meaning 'outside' or 'external') with "biology" (the study of living organisms). This morphological combination reflects the study of life beyond Earth, focusing on the possibility of extraterrestrial life.
Syntax is concerned with how words are arranged to form sentences, including the rules and principles that govern sentence structure.
Example: Consider the sentence, "Quantum computers, unlike traditional computers, leverage qubits for computations." This sentence demonstrates syntactic structure, where the introductory phrase "unlike traditional computers" is inserted to modify and contrast "Quantum computers," followed by the main clause "leverage qubits for computations."
Semantics involves the meaning of words, phrases, and sentences, including the study of how meaning is conveyed through language, the relationships between words that represent entities, and their relations between them.
Example: In the sentence, "The blockchain securely records all cryptocurrency transactions," the semantics involves understanding that "blockchain" refers to a digital ledger technology, while "cryptocurrency transactions" implies the exchange or transfer of digital currency assets. The relationship between these terms conveys a specific meaning about the security and nature of digital transactions.
Pragmatics studies how language is used in context and how context affects language interpretation. Pragmatics aims to understand the speaker's intentionality and the effect of context on meaning.
Example: Imagine a conversation where one person says, "It’s getting very warm in here", and the other opens a window. While the first sentence could be a simple statement about temperature, it's understood as a request to cool down the room in this context.

Cohesion

Grammatical and lexical links within a text or speech that hold it together and give it meaning. Cohesion is achieved through various means, such as pronoun reference, conjunctions, lexical ties, and ellipsis.

Example: "Humans have always been fascinated by space. They have dreamt of exploring the stars for centuries. This dream has led to significant advancements in technology."

In this example, cohesion is achieved through the pronoun "They," which refers back to "Humans." The lexical tie is the space exploration theme maintained throughout the sentences. The cohesive flow is further supported by the progression from a general fascination to specific advancements.

Coherence

Logical connections and the overall sense of understandability in a text. A text is coherent if its content is organized in a way that makes sense to the reader or listener, with ideas and arguments flowing logically.

Example: "Sustainable fashion aims to reduce the environmental impact of the clothing industry. To achieve this, designers use eco-friendly materials. Furthermore, they adopt ethical labor practices. As a result, the environmental footprint of clothing production is significantly reduced."

The coherence in this example is established by presenting a clear, logical sequence of ideas: the goal of sustainable fashion, the methods employed (using eco-friendly materials and ethical labor practices), and the outcome (reduced environmental footprint).

Transition Markers

Words or phrases that help to link sentences and paragraphs together, such as "however", "furthermore", or "in conclusion", are important for maintaining sentence continuity.

Example: "Solar power is a key component of renewable energy. However, it faces challenges like storage and variability. Despite these challenges, advancements in battery technology are making solar more reliable. Furthermore, government incentives are encouraging its adoption. In conclusion, solar power, while not without its hurdles, is a promising part of the future energy mix."

In this passage, transitional devices such as "However," "Despite these challenges," "Furthermore," and "In conclusion" are used to link sentences and paragraphs. These devices help to contrast points, add additional information, and provide a summarizing statement, contributing to the text's overall flow and continuity.

Examples of transition markers include:

Additive Transitions: Indicate addition or introduction of information. Examples include "and," "also," "furthermore," "in addition," "moreover."
Adversative Transitions: Signal contrast or contradiction. Examples are "but," "however," "on the other hand," "nevertheless," "yet."
Causal Transitions: Denote cause-effect relationships. Examples include "because," "therefore," "as a result," "thus," "consequently."
Temporal Transitions: Indicate time or sequence. Examples are "then," "later," "after," "subsequently," "meanwhile."
Exemplification Transitions: Used for giving examples. Examples include "for instance," "for example," "namely," "specifically."
Summarization Transitions: Used to summarize or conclude. Examples are "in conclusion," "to sum up," "in summary," "overall."

Coreference

It occurs when two or more expressions in a text refer to the same person or thing (pronoun reference). Coreference plays an essential role in cohesion.

Anaphora: a word or phrase that refers back to another word or phrase used earlier in the text. The earlier word or phrase is called the antecedent.
Example: "In quantum computing, the qubit is fundamental. This concept revolutionizes how we think about processing information." Using "this concept" helps to link the sentence back to the specific idea of a qubit in quantum computing.
Cataphora: a word or phrase that refers forward to another word or phrase that appears later in the text.
Example: "Before his discovery, Einstein was relatively unknown. The theory of relativity changed that." In this case, "His discovery" is a cataphoric reference that refers forward to "The theory of relativity," which appears later in the sentence. It introduces the subject of Einstein's significant achievement before specifying what it is.
Exophora: a word or phrase that refers to something outside the text. Unlike anaphora and cataphora, which refer to elements within the text (intra-textual), exophora is extra-textual. It relies on the listener's or reader's knowledge of the context. For example, pronouns like "this" or "that" in a conversation might refer to objects or situations in the immediate physical environment or shared situational context. When someone says, "We have to meet early tomorrow", there is implicit shared understanding between the speaker and the listener or reader – "we" is understood in the context of the conversation. For instance, "In God we trust" shows the use of "we" that doesn't need to be resolved.
Common exophoric references are "we", "that", "this", "now", "then", "here", "where", "you".

Coreference happens on the pragmatics level of language, going beyond the literal meaning of words (semantics), morphology and syntax, requiring an understanding of context, speaker intention, and the ability to make inferences.

Bootstrapping Approach using Rules

In NLP, bootstrapping is a semi-supervised learning method that iteratively improves a model's performance by using its predictions to generate new training data. This approach is often used when there is a limited amount of labeled data but an abundance of unlabeled data. In this project, we will follow the bootstrapping approach by automatically creating the first version of the dataset by using rules based on linguistic features.

Rule-Based Classification Model 1

This rule-based classification model will generate the initial version of the dataset to define the baseline of our BERT model for NSP.

The six rules are built on the following linguistic features:

Coreference Resolution: Using spaCy's experimental model for coreference resolution (en_coreference_web_trf) we identify if two sentences have coreference links, meaning if there are words (like pronouns) in one sentence that refer to words in another sentence. It is crucial for understanding the continuity and flow of ideas between sentences.
Semantic Chain Detection: Using spaCy's linguistic features, we evaluate if two sentences are semantically related or discuss the same topic. It uses spaCy's semantic similarity metrics and linguistic features like lemmatization and dependency parsing to identify common referents and subjects, ensuring textual coherence across sentences.
Parallelism: Using spaCy's linguistic features, we evaluate if two sentences follow a similar syntactic structure. It compares the dependency structures of sentences to see if they align, which helps determine if the sentences are part of the same cohesive passage.
Transition Marker: Using spaCy's linguistic features, we evaluate one-sentence transition markers or phrases in a sentence, like "however", "furthermore", or "in conclusion" that are key indicators of the sentence's relation to surrounding text, showing continuity or shifts in the discourse.
Logical continuity (from BERT's NSP): Using pre-trained BERT models feature Next Sentence Prediction (NSP), we determine if a sentence is a logical continuation of another. It provides insight into whether two sentences should be part of the same passage based on the flow and context of the conversation. The code to explore NSP was posted on the Moodle's forum.
Tense and Aspect Change: Using spaCy's linguistic features, we evaluate if the verbs in two sentences change in tense and aspect. Consistency in tense and aspect between sentences can indicate cohesive narrative flow, thereby helping define passage boundaries. While "tense" refers to when an action happens (past, present, or future), "aspect" refers to the state of the action (completed, ongoing, or recurring).

Dataset Building Workflow (Round 1)

We begin with a selection of political issues of interest (topics), i.e., climate change, employment, women, etc. The system searches for political issues in thousands of political discourse texts.
When a political issue is string-matched in a sentence (root sentence), the six rules are used to navigate precedent and subsequent sentences, evaluating when the topic ends (topic shift).
When the leading and training boundaries are defined, the surrounding sentences outside of the passage are added to define the boundaries. Extracted passages are loaded into the Annotation Tool.
Extracted passages are loaded into the Annotation Tool.

Annotation Task (Round 2)

Teams will refine the dataset by annotating 250 valid passages per team member extracted from audio/transcribed political discourses. Dataset 2 aims to set a gold standard for fine-tuning the BERT's NSP feature in the domain of American political discourse.

IMPORTANT: The goal is to collect relevant passages of political discourses, speaking about political issues, with well-defined boundaries to fine-tune a BERT's NSP feature in a specific domain (American political discourse).

Each passage will allow us to generate multiple pairs of sentences to fine-tune the NSP feature of a BERT model comprised of the combination of

two sentences defining a topic shift ("continue" class), and
two sentences defining the continuation of a topic ("not_continue" class).

Consequently, a good theoretical understanding of the the linguistics of sentence continuity is crucial.

Protocol

The annotation process involves using the annotation tool and a shared spreadsheet on Google Drive. Initiate the process by providing the lecturer with the Gmail accounts of all team members to secure Editor access to the spreadsheet.

Within the spreadsheet, locate your tab named with your last name. Find in the tab "participants" who will review the dataset of who dataset. As all teams share the spreadsheet, exercise consideration towards your peers.

Each tab has two columns: "id", and "notes".

id	notes
0000000336	Post on the forum to ask for feedback

The "id" column is used to add the ID of each passage you annotate.
"notes" (optional) is reserved for any noteworthy annotation observations.

IMPORTANT: only log valid datapoints, neither rejected not ignored. Therefore, the spreadsheet will contain only the IDs of valid datapoints.

Step 1: Access the Annotation Tool

Go to the Annotation Tool, and log in with credentials supplied by the lecturer. Select your last name from the dropdown menu and launch the tool to annotate Dataset 2.

Step 2: Recognize Passage Editor

Upon entry, the editor will be shown.

Recognize the following areas and elements:

Editor: The text editing area, where the passage range is shown.
Sidebar: Houses the document ID, text source link, the "shorten" and "extend" buttons, and the "action" buttons.

Recognize the components of a passage:

Outside (O) sentences: are not part of the passage.
Root (R) sentence: where the main topic is mentioned.
Inside (I) sentence: others that complete the passage.

A valid passage must include at least one sentence and two additional "outside" sentences, giving three sentences. In total, a passage could range from three to eight sentences, including "outside" sentences, consisting of one to six internal sentences plus two external sentences.

IMPORTANT: It is possible that after analyzing the passage, its main topic is not the matched political issue but another different one. That is not a problem; adjust the boundaries if necessary, and the passage could be accepted as valid.

The sidebar will show the following elements:

Passage ID: the unique ID of the passage document.
Source text: the web page where the text was taken.
Issue: the political issue found by the string matcher.
Wikidata entity: a link to the Wikidata entity that maps the political issue (i.e., health care).
Shorten/extend buttons: the buttons to shorten or extend the passage.
Action buttons: handles the actions to process each passage:
- Accept: records the passage in the database as valid.
- Reject: records the passage in the database as invalid. Other annotators won't ever find the same passage.
- Ignore: skip the passage. Other annotators may find the same passage later.
- Undo: the passage is reverted to its original state (reset), removing any annotated information.

Step 3: Identify political issue

Legend:

Yellow : outside passage.
White : inside passage
Gray : surrounding text.
Orange : matched political issue.

You will recognize the "root" sentence because it has the political issue in orange color. The first step is to disambiguate the political issue that the passage refers to. Each political issue has many aliases. For instance, the political issue "mass media", has the alias "media", and it could be found in another context. For example:

A very heated exchange there between Kellyanne Conway and a reporter.

That reporter joining me now.

He is Andrew Feinberg, a White House reporter for Breakfast Media.

Andrew, you say that, by asking about your ethnicity, which she did very clear there 
at the beginning of that clip, that Kellyanne Conway confirms what the president meant.

Explain that.

"Breakfast media" is a television program, and, therefore, it is not a valid reference to a political issue of interest. If the matched political issue is not a valid reference to a political issue, the passage must be rejected. On the other hand, the passage is cohesive and coherent but it is NOT part of a political discourse; therefore, the passage is not valid and must be rejected.

Step 4: Identify noisy text

The combination of algorithms described above will add good examples of passages in the Annotation Tool. However, the source may contain noisy text that confuses the algorithms. The first visual inspection should help to save effort to identify valid passages. For instance, the passage:

MARQUARDT:

A very heated exchange there between Kellyanne Conway and a reporter.

That reporter joining me now.

He is Andrew Feinberg, a White House reporter for Breakfast Media.

Andrew, you say that, by asking about your ethnicity, which she did very clear there at 
the beginning of that clip, that Kellyanne Conway confirms what the president meant.

Explain that.

ANDREW FEINBERG, WHITE HOUSE REPORTER, BREAKFAST MEDIA:

Both "outside" sentences are speaker labels from an interview, which makes the passage invalid and must be rejected. Recall that after a "Reject" action, any other annotator won't see that passage again.

Step 5: Define Passage Boundaries

Pay attention to this example of a valid passage:

NOBILO:

Climate change may be changing the planet, but it's also changing our politics.

We've seen people around the world push for leaders to translate their intentions and words 
into action.

And now, the U.N. is taking action.

This weekend, Gabon became the first African country to get funding to preserve its rainforests.

Norway will pay $150 million to battle deforestation there.

The move comes as millions of people are taking to the streets.

Open the JS console in your browser and find the continuation links between sentences corresponding to the six (6) linguistic features described before.

Notice that the passage seems to speak about the matched political issue of the United Nations ("U.N."). Let's review each pair of sentences and analyze which linguistic features link them.
In the first pair of sentences, the first ("outside") sentence speaks about climate change. However, the second sentence ("inside") introduces the subject of people pushing leaders (governments) to take action in their favor. See the JS console and observe the "outside" sentence (in yellow letters) that define the passage boundaries.
Notice that the only found continuity link is logical continuity (see NSP above), but that is not enough since most of the "outside" and "inside" sentences have logical continuity. In the algorithm's logic, logical continuity must always accompany another continuity link to link between sentences.
We accept the unresolved coreference "we" since it is an exophora. Exophoric references must be evaluated by understanding the context of the passage. If the missing (exophoric) reference does not affect the understanding of the text, we can accept the boundary of the passage containing it.
```
Climate change may be changing the planet, but it's also changing our politics.

We've seen people around the world push for leaders to translate their intentions and words 
into action.
```
In the next pair of sentences, the continuity link between both are: 1) the semantic chain defined by "action", and 2) the transition marker "and". We can confirm that by seeing in the JS console "true" in "semantic chain" and "transition markers".
The following is the list of nouns evaluated by the algorithm to evaluate semantic chain:
- Sentence 1: "people", "world", "leader", "intention", and "action".
- Sentence 2: "U.N." and "action".
This information is not visible in the annotation tool, but the annotator should evaluate if the semantic chain is creating a valid continuity link between two sentences by reviewing the nouns.
Notice that nouns have been converted to their lemma (see lemmatization).
```
We've seen people around the world push for leaders to translate their intentions and words
into action.

And now, the U.N. is taking action.
```
The continuity link between the next pair of sentences is "semantic chain". Although the explainability is low here, both sentences make sense since the second sentence is the consequence of the previous sentence and add cohesion and coherence to the passage, and therefore we keep it.
```
And now, the U.N. is taking action.

This weekend, Gabon became the first African country to get funding to preserve its rainforests.
```
Finally, the algorithm didn't find any continuity link in the next pair of sentences, defining an "outside" sentence and, consequently, the boundary of the passage.
```
This weekend, Gabon became the first African country to get funding to preserve its rainforests.

Norway will pay $150 million to battle deforestation there.
```

Overall, the passage suggests that the United Nations, in response to this global push, is starting to take concrete steps or measures. This implies a shift from planning or discussing to executing real-world actions, providing a specific instance of the U.N.'s action.

This example shows how annotators can rely on the information the algorithm gives when there are topic shifts or continuity. However, the algorithm could perform better; therefore the annotator evaluates if any sentence must be removed or included in the passage by a clear understanding of the linguistics of passage boundary definition and the theory described in this document.

Step 6: Use the Action Buttons

Whether you accept the passage or not, recall that rejecting passages is better than adding mediocre passages that will decrease the BERT model performance.

Annotation Checklist

Examples

Example 1 (Accepted, with no adjustments):

The passage is about a law that gives more treatment options for patients with breast cancer.

Example 2 (Rejected):

This is a good example of an incoherent passage. Here are the 2 reasons why the passage must be rejected:

Example 3 (Accepted, with no adjustments):

Example 4 (Accepted with adjustments):

After shortening the passage and turning sentence f into an "outside" sentence, we make a topic shift obvious, avoiding referencing another instance of international aid.

Tips

Hybrid Reclassification Approach (Round 3)

This annotation approach combines refined rules and manual annotation to reclassify pairs of sentences.

Rule-Based Classification Model 2

Goal: select examples of pairs of sentences where a topic continues or shifts at a granular level. Therefore, every sentence pair must be classified based on easily distinguishable features that define continuity or not.

This rule-based classification model will generate an improved version of the dataset using refined rules. This version aims to capture more fine-grained linguistic features that define topic continuity and shift using check functions that leverage spaCy's pre-trained NLP models.

We dropped parallelism, logical continuity (BERT's NSP), and Tense and Aspect Change from the rule-based classification Model 1 because either they overlapped the capture of features or their contribution was marginal.

Dataset 2: Annotation procedure