« IE Seminar

Dataset 2: Annotation procedure

Institute of Computer Science, Brandenburgische Technische Universität Cottbus-Senftenberg
Juan-Francisco Reyes
pacoreyes@protonmail.com

### THIS PAGE WILL BE UPDATED PERMANENTLY BASED ON INTERACTIONS ON THE FORUM, RETURN OFTEN

Overview

This document delineates the process of annotating a dataset for training a passage boundary detection model via Hugging Face technology. The objective is to construct a dataset enabling the model to identify the boundaries of a passage about a topic distinctly.

Teams will generate subsets of "Dataset 2". Utilizing an annotation tool, teams will edit passages that will be turned into datapoints for a subsequent task of BERT model fine-tuning.

The project is collaborative, with a collective outcome determining the grade; however, the annotation process is individualized.

Background

Passage Boundary Detection

Task in NLP where the goal is to identify the boundaries between distinct passages or sections within a text. It involves determining where one passage ends and another begins, essential for tasks like document summarization, text segmentation, and improving readability by structuring text into coherent units.

This process is crucial for understanding and organizing large text volumes by breaking them into more manageable and contextually distinct sections.

Next Sentence Prediction (NSP) in BERT Models

BERT models were trained with two tasks that enable the development of a sophisticated understanding of language context and sentence relationships: Masked Language Model (MLM) and Next Sentence Prediction (NSP).

Traditional language models before BERT primarily processed text in one direction (either left-to-right or right-to-left), BERT, however, reads text bi-directionally. This feature means it gains a deeper understanding of the context of a word based on all its surrounding text, not just what comes before it. This feature enables BERT to predict very accurately if two sentences have logical continuation by discerning whether a given sentence logically and coherently follows a preceding sentence in a text.

This feature can be leveraged to determine the boundaries of a passage. In the NSP task, BERT is given two sentences (A and B) and has to predict if B logically follows A. This capability is essential for understanding sentence relationships, which is a crucial aspect of language comprehension.

During training, 50% of the time, sentence B is the actual next sentence that follows A, and 50% of the time, it's a random sentence from the corpus. BERT then learns to predict whether these two sentences are related or not.

NSP helps BERT understand the narrative flow and how ideas are connected in text, which is useful in tasks that require understanding the relationship between sentences, like passage boundary detection, question answering, and summarization.

Levels of Language

The levels of language that allow us to understand the linguistics of sentence continuity are four:

  1. Morphology deals with the structure and formation of words, studying how morphemes (the smallest grammatical units in a language) are combined to form words.

    Example: The word "exobiology" is formed by combining the prefix "exo-" (meaning 'outside' or 'external') with "biology" (the study of living organisms). This morphological combination reflects the study of life beyond Earth, focusing on the possibility of extraterrestrial life.

  2. Syntax is concerned with how words are arranged to form sentences, including the rules and principles that govern sentence structure.

    Example: Consider the sentence, "Quantum computers, unlike traditional computers, leverage qubits for computations." This sentence demonstrates syntactic structure, where the introductory phrase "unlike traditional computers" is inserted to modify and contrast "Quantum computers," followed by the main clause "leverage qubits for computations."

  3. Semantics involves the meaning of words, phrases, and sentences, including the study of how meaning is conveyed through language, the relationships between words that represent entities, and their relations between them.

    Example: In the sentence, "The blockchain securely records all cryptocurrency transactions," the semantics involves understanding that "blockchain" refers to a digital ledger technology, while "cryptocurrency transactions" implies the exchange or transfer of digital currency assets. The relationship between these terms conveys a specific meaning about the security and nature of digital transactions.

  4. Pragmatics studies how language is used in context and how context affects language interpretation. Pragmatics aims to understand the speaker's intentionality and the effect of context on meaning.

    Example: Imagine a conversation where one person says, "It’s getting very warm in here", and the other opens a window. While the first sentence could be a simple statement about temperature, it's understood as a request to cool down the room in this context.

Cohesion

Grammatical and lexical links within a text or speech that hold it together and give it meaning. Cohesion is achieved through various means, such as pronoun reference, conjunctions, lexical ties, and ellipsis.

Example: "Humans have always been fascinated by space. They have dreamt of exploring the stars for centuries. This dream has led to significant advancements in technology."

In this example, cohesion is achieved through the pronoun "They," which refers back to "Humans." The lexical tie is the space exploration theme maintained throughout the sentences. The cohesive flow is further supported by the progression from a general fascination to specific advancements.

Coherence

Logical connections and the overall sense of understandability in a text. A text is coherent if its content is organized in a way that makes sense to the reader or listener, with ideas and arguments flowing logically.

Example: "Sustainable fashion aims to reduce the environmental impact of the clothing industry. To achieve this, designers use eco-friendly materials. Furthermore, they adopt ethical labor practices. As a result, the environmental footprint of clothing production is significantly reduced."

The coherence in this example is established by presenting a clear, logical sequence of ideas: the goal of sustainable fashion, the methods employed (using eco-friendly materials and ethical labor practices), and the outcome (reduced environmental footprint).

Transition Markers

Words or phrases that help to link sentences and paragraphs together, such as "however", "furthermore", or "in conclusion", are important for maintaining sentence continuity.

Example: "Solar power is a key component of renewable energy. However, it faces challenges like storage and variability. Despite these challenges, advancements in battery technology are making solar more reliable. Furthermore, government incentives are encouraging its adoption. In conclusion, solar power, while not without its hurdles, is a promising part of the future energy mix."

In this passage, transitional devices such as "However," "Despite these challenges," "Furthermore," and "In conclusion" are used to link sentences and paragraphs. These devices help to contrast points, add additional information, and provide a summarizing statement, contributing to the text's overall flow and continuity.

Examples of transition markers include:

  1. Additive Transitions: Indicate addition or introduction of information. Examples include "and," "also," "furthermore," "in addition," "moreover."

  2. Adversative Transitions: Signal contrast or contradiction. Examples are "but," "however," "on the other hand," "nevertheless," "yet."

  3. Causal Transitions: Denote cause-effect relationships. Examples include "because," "therefore," "as a result," "thus," "consequently."

  4. Temporal Transitions: Indicate time or sequence. Examples are "then," "later," "after," "subsequently," "meanwhile."

  5. Exemplification Transitions: Used for giving examples. Examples include "for instance," "for example," "namely," "specifically."

  6. Summarization Transitions: Used to summarize or conclude. Examples are "in conclusion," "to sum up," "in summary," "overall."

Coreference

It occurs when two or more expressions in a text refer to the same person or thing (pronoun reference). Coreference plays an essential role in cohesion.

  1. Anaphora: a word or phrase that refers back to another word or phrase used earlier in the text. The earlier word or phrase is called the antecedent.

    Example: "In quantum computing, the qubit is fundamental. This concept revolutionizes how we think about processing information." Using "this concept" helps to link the sentence back to the specific idea of a qubit in quantum computing.

  2. Cataphora: a word or phrase that refers forward to another word or phrase that appears later in the text.

    Example: "Before his discovery, Einstein was relatively unknown. The theory of relativity changed that." In this case, "His discovery" is a cataphoric reference that refers forward to "The theory of relativity," which appears later in the sentence. It introduces the subject of Einstein's significant achievement before specifying what it is.

  3. Exophora: a word or phrase that refers to something outside the text. Unlike anaphora and cataphora, which refer to elements within the text (intra-textual), exophora is extra-textual. It relies on the listener's or reader's knowledge of the context. For example, pronouns like "this" or "that" in a conversation might refer to objects or situations in the immediate physical environment or shared situational context. When someone says, "We have to meet early tomorrow", there is implicit shared understanding between the speaker and the listener or reader – "we" is understood in the context of the conversation. For instance, "In God we trust" shows the use of "we" that doesn't need to be resolved.

    Common exophoric references are "we", "that", "this", "now", "then", "here", "where", "you".

Coreference happens on the pragmatics level of language, going beyond the literal meaning of words (semantics), morphology and syntax, requiring an understanding of context, speaker intention, and the ability to make inferences.

Bootstrapping Approach using Rules

In NLP, bootstrapping is a semi-supervised learning method that iteratively improves a model's performance by using its predictions to generate new training data. This approach is often used when there is a limited amount of labeled data but an abundance of unlabeled data. In this project, we will follow the bootstrapping approach by automatically creating the first version of the dataset by using rules based on linguistic features.

Rule-Based Classification Model 1

This rule-based classification model will generate the initial version of the dataset to define the baseline of our BERT model for NSP.

The six rules are built on the following linguistic features:

  1. Coreference Resolution: Using spaCy's experimental model for coreference resolution (en_coreference_web_trf) we identify if two sentences have coreference links, meaning if there are words (like pronouns) in one sentence that refer to words in another sentence. It is crucial for understanding the continuity and flow of ideas between sentences.

  2. Semantic Chain Detection: Using spaCy's linguistic features, we evaluate if two sentences are semantically related or discuss the same topic. It uses spaCy's semantic similarity metrics and linguistic features like lemmatization and dependency parsing to identify common referents and subjects, ensuring textual coherence across sentences.

  3. Parallelism: Using spaCy's linguistic features, we evaluate if two sentences follow a similar syntactic structure. It compares the dependency structures of sentences to see if they align, which helps determine if the sentences are part of the same cohesive passage.

  4. Transition Marker: Using spaCy's linguistic features, we evaluate one-sentence transition markers or phrases in a sentence, like "however", "furthermore", or "in conclusion" that are key indicators of the sentence's relation to surrounding text, showing continuity or shifts in the discourse.

  5. Logical continuity (from BERT's NSP): Using pre-trained BERT models feature Next Sentence Prediction (NSP), we determine if a sentence is a logical continuation of another. It provides insight into whether two sentences should be part of the same passage based on the flow and context of the conversation. The code to explore NSP was posted on the Moodle's forum.

  6. Tense and Aspect Change: Using spaCy's linguistic features, we evaluate if the verbs in two sentences change in tense and aspect. Consistency in tense and aspect between sentences can indicate cohesive narrative flow, thereby helping define passage boundaries. While "tense" refers to when an action happens (past, present, or future), "aspect" refers to the state of the action (completed, ongoing, or recurring).

Dataset Building Workflow (Round 1)

  1. We begin with a selection of political issues of interest (topics), i.e., climate change, employment, women, etc. The system searches for political issues in thousands of political discourse texts.

  2. When a political issue is string-matched in a sentence (root sentence), the six rules are used to navigate precedent and subsequent sentences, evaluating when the topic ends (topic shift).

  3. When the leading and training boundaries are defined, the surrounding sentences outside of the passage are added to define the boundaries. Extracted passages are loaded into the Annotation Tool.

  4. Extracted passages are loaded into the Annotation Tool.

Annotation Task (Round 2)

Teams will refine the dataset by annotating 250 valid passages per team member extracted from audio/transcribed political discourses. Dataset 2 aims to set a gold standard for fine-tuning the BERT's NSP feature in the domain of American political discourse.

IMPORTANT: The goal is to collect relevant passages of political discourses, speaking about political issues, with well-defined boundaries to fine-tune a BERT's NSP feature in a specific domain (American political discourse).

Each passage will allow us to generate multiple pairs of sentences to fine-tune the NSP feature of a BERT model comprised of the combination of

  • two sentences defining a topic shift ("continue" class), and

  • two sentences defining the continuation of a topic ("not_continue" class).

Consequently, a good theoretical understanding of the the linguistics of sentence continuity is crucial.

Protocol

The annotation process involves using the annotation tool and a shared spreadsheet on Google Drive. Initiate the process by providing the lecturer with the Gmail accounts of all team members to secure Editor access to the spreadsheet.

Within the spreadsheet, locate your tab named with your last name. Find in the tab "participants" who will review the dataset of who dataset. As all teams share the spreadsheet, exercise consideration towards your peers.

Each tab has two columns: "id", and "notes".

id

notes

0000000336

Post on the forum to ask for feedback

  • The "id" column is used to add the ID of each passage you annotate.

  • "notes" (optional) is reserved for any noteworthy annotation observations.

IMPORTANT: only log valid datapoints, neither rejected not ignored. Therefore, the spreadsheet will contain only the IDs of valid datapoints.

Step 1: Access the Annotation Tool

Go to the Annotation Tool, and log in with credentials supplied by the lecturer. Select your last name from the dropdown menu and launch the tool to annotate Dataset 2.

Step 2: Recognize Passage Editor

Upon entry, the editor will be shown.

Annotation Tool, editor.
Annotation Tool, editor.

Recognize the following areas and elements:

  • Editor: The text editing area, where the passage range is shown.

  • Sidebar: Houses the document ID, text source link, the "shorten" and "extend" buttons, and the "action" buttons.

Recognize the components of a passage:

  • Outside (O) sentences: are not part of the passage.

  • Root (R) sentence: where the main topic is mentioned.

  • Inside (I) sentence: others that complete the passage.

Annotation Tool, editor.
Elements of a valid passage.

A valid passage must include at least one sentence and two additional "outside" sentences, giving three sentences. In total, a passage could range from three to eight sentences, including "outside" sentences, consisting of one to six internal sentences plus two external sentences.

IMPORTANT: It is possible that after analyzing the passage, its main topic is not the matched political issue but another different one. That is not a problem; adjust the boundaries if necessary, and the passage could be accepted as valid.

The sidebar will show the following elements:

  • Passage ID: the unique ID of the passage document.

  • Source text: the web page where the text was taken.

  • Issue: the political issue found by the string matcher.

  • Wikidata entity: a link to the Wikidata entity that maps the political issue (i.e., health care).

  • Shorten/extend buttons: the buttons to shorten or extend the passage.

  • Action buttons: handles the actions to process each passage:

    • Accept: records the passage in the database as valid.

      Accept button.
    • Reject: records the passage in the database as invalid. Other annotators won't ever find the same passage.

      Reject button.
    • Ignore: skip the passage. Other annotators may find the same passage later.

      Reject button.
    • Undo: the passage is reverted to its original state (reset), removing any annotated information.

      Reject button.

Step 3: Identify political issue

Legend:

  1. Yellow : outside passage.

  2. White : inside passage

  3. Gray : surrounding text.

  4. Orange : matched political issue.

You will recognize the "root" sentence because it has the political issue in orange color. The first step is to disambiguate the political issue that the passage refers to. Each political issue has many aliases. For instance, the political issue "mass media", has the alias "media", and it could be found in another context. For example:

A very heated exchange there between Kellyanne Conway and a reporter.

That reporter joining me now.

He is Andrew Feinberg, a White House reporter for Breakfast Media.

Andrew, you say that, by asking about your ethnicity, which she did very clear there 
at the beginning of that clip, that Kellyanne Conway confirms what the president meant.

Explain that.

"Breakfast media" is a television program, and, therefore, it is not a valid reference to a political issue of interest. If the matched political issue is not a valid reference to a political issue, the passage must be rejected. On the other hand, the passage is cohesive and coherent but it is NOT part of a political discourse; therefore, the passage is not valid and must be rejected.

Step 4: Identify noisy text

The combination of algorithms described above will add good examples of passages in the Annotation Tool. However, the source may contain noisy text that confuses the algorithms. The first visual inspection should help to save effort to identify valid passages. For instance, the passage:

MARQUARDT:

A very heated exchange there between Kellyanne Conway and a reporter.

That reporter joining me now.

He is Andrew Feinberg, a White House reporter for Breakfast Media.

Andrew, you say that, by asking about your ethnicity, which she did very clear there at 
the beginning of that clip, that Kellyanne Conway confirms what the president meant.

Explain that.

ANDREW FEINBERG, WHITE HOUSE REPORTER, BREAKFAST MEDIA:

Both "outside" sentences are speaker labels from an interview, which makes the passage invalid and must be rejected. Recall that after a "Reject" action, any other annotator won't see that passage again.

Step 5: Define Passage Boundaries

Pay attention to this example of a valid passage:

NOBILO:

Climate change may be changing the planet, but it's also changing our politics.

We've seen people around the world push for leaders to translate their intentions and words 
into action.

And now, the U.N. is taking action.

This weekend, Gabon became the first African country to get funding to preserve its rainforests.

Norway will pay $150 million to battle deforestation there.

The move comes as millions of people are taking to the streets.

Open the JS console in your browser and find the continuation links between sentences corresponding to the six (6) linguistic features described before.

Annotator tool with JS console open.
Annotator tool with JS console open.
Annotator tool with JS console open.
JS console with information about the six (6) linguistic features that define sentence continuity between each pair of sentences.
  1. Notice that the passage seems to speak about the matched political issue of the United Nations ("U.N."). Let's review each pair of sentences and analyze which linguistic features link them.

  2. In the first pair of sentences, the first ("outside") sentence speaks about climate change. However, the second sentence ("inside") introduces the subject of people pushing leaders (governments) to take action in their favor. See the JS console and observe the "outside" sentence (in yellow letters) that define the passage boundaries.

    Notice that the only found continuity link is logical continuity (see NSP above), but that is not enough since most of the "outside" and "inside" sentences have logical continuity. In the algorithm's logic, logical continuity must always accompany another continuity link to link between sentences.

    We accept the unresolved coreference "we" since it is an exophora. Exophoric references must be evaluated by understanding the context of the passage. If the missing (exophoric) reference does not affect the understanding of the text, we can accept the boundary of the passage containing it.

    Climate change may be changing the planet, but it's also changing our politics.
    
    We've seen people around the world push for leaders to translate their intentions and words 
    into action.
  3. In the next pair of sentences, the continuity link between both are: 1) the semantic chain defined by "action", and 2) the transition marker "and". We can confirm that by seeing in the JS console "true" in "semantic chain" and "transition markers".

    The following is the list of nouns evaluated by the algorithm to evaluate semantic chain:

    • Sentence 1: "people", "world", "leader", "intention", and "action".

    • Sentence 2: "U.N." and "action".

    This information is not visible in the annotation tool, but the annotator should evaluate if the semantic chain is creating a valid continuity link between two sentences by reviewing the nouns.

    Notice that nouns have been converted to their lemma (see lemmatization).

    We've seen people around the world push for leaders to translate their intentions and words
    into action.
    
    And now, the U.N. is taking action.
  4. The continuity link between the next pair of sentences is "semantic chain". Although the explainability is low here, both sentences make sense since the second sentence is the consequence of the previous sentence and add cohesion and coherence to the passage, and therefore we keep it.

    And now, the U.N. is taking action.
    
    This weekend, Gabon became the first African country to get funding to preserve its rainforests.
  5. Finally, the algorithm didn't find any continuity link in the next pair of sentences, defining an "outside" sentence and, consequently, the boundary of the passage.

    This weekend, Gabon became the first African country to get funding to preserve its rainforests.
    
    Norway will pay $150 million to battle deforestation there.

Overall, the passage suggests that the United Nations, in response to this global push, is starting to take concrete steps or measures. This implies a shift from planning or discussing to executing real-world actions, providing a specific instance of the U.N.'s action.

This example shows how annotators can rely on the information the algorithm gives when there are topic shifts or continuity. However, the algorithm could perform better; therefore the annotator evaluates if any sentence must be removed or included in the passage by a clear understanding of the linguistics of passage boundary definition and the theory described in this document.

Step 6: Use the Action Buttons

Whether you accept the passage or not, recall that rejecting passages is better than adding mediocre passages that will decrease the BERT model performance.

Annotation Checklist

Follow this checklist in your annotation process:

  1. Confirm presence of a political issue of interest: Disambiguate synonyms or terms with another meaning. For instance, the term "work". In this case, the political issue of interest is "employment" (whose aliases are "work", "job", etc.), and not the action of "work" (a verb) or the use of work in another context, like "you did a good work." Disambiguation, in many cases, is necessary.

  2. Identify if noisy text is in "outside" or "inside" sentences: Confirm only natural language is part of the extracted passages ("inside") and its "outside" sentences. Speaker labels, dates, links, Etc. are not natural language, and they should never be part of an annotated passage. "Outside" sentences also must be composed of natural language text.

  3. Confirm if passage is part of political discourse: Ensure the passage is part of transcribed conferences, interviews, speeches, remarks, Etc. Recall that many political discourses have contextual information about the discourse, like summaries, credits, introductions, abstracts, etc., usually surrounding the political discourse itself. Avoid written text, such as press releases or similar.

  4. Confirm the passage focuses on one (1) or two (2) political issues: Avoid passages that refer to so many topics simultaneously. We want our model to extract passages related to one or two topics at the most (especially if both are closely related) to capture the important facts of political discourse.

  5. Confirm cohesion: Recall that a cohesive text segment holds different continuity links (explained above) to capture a distinguishable clear meaning about something in particular. No element should be missing or left over. Use the metadata shown in the JS console as a guide. Recall that that data is only referential, and you never use it as a conclusive way to delimit passage boundaries. A passage of only one sentence that conveys a whole cohesive idea is allowed.

  6. Confirm coherence: Recall that you are dealing with raw text massively scraped from the WWW, and undesirable text may have been included in texts. Therefore, you must confirm the passage is coherent, makes sense, and presents a logical sequence of ideas.

  7. Confirm topic shift: A topic shift may indicate the location of the passage boundaries. Although a topic shift is a good indicator, use cohesion as the most important indicator to define passage boundaries.

  8. Avoid missing coreferences: avoid passages that have coreferences to issues (entities or nouns) previously introduced outside the passage. Allow exophoric coreferences, by evaluating first if those rely on reader/listener common or implicit knowledge or context. Avoid passages that have unresolved intra-textual correference, like anaphora or anaphora. Treat "that" with care; sometimes, it may refer to something "exophoric." Use your understanding of context and linguistics to make decisions.

  9. Shorten or extend passage if necessary: Sometimes, a passage can be improved by shortening or extending one sentence or two. Do this by evaluating all the previous checks in this list.

Examples

Example 1 (Accepted, with no adjustments):

a) This includes seeking the advice of healthcare providers, who can better educate patients of the 
importance of getting appropriate cancer screening tests at the right time, knowing their family 
history and other risk factors, and making lifestyle changes that may reduce the possibility of 
breast cancer.

b) My Administration is committed to supporting our Nation's dedicated researchers in their diligent 
efforts to advance medical breakthroughs that will save and improve lives.

c) Earlier this year, I signed into law Federal Right to Try legislation, which provides those 
diagnosed with a terminal illness expanded options for treatment that could save their lives.

d) Cutting-edge developments in the fight against breast cancer include interventions and treatments 
that are more effective and less debilitating.

e) Recently, a groundbreaking national study found that most women with an early-stage diagnosis of 
the most common type of breast cancer can safely forgo chemotherapy.

The passage is about a law that gives more treatment options for patients with breast cancer.

  1. The first "inside" sentence (b) shows a good starting ("My Administration...") indicating that it does not have any continuity marker that depends on the previous ("outside") sentence. Although the topic is the same (breast cancer), the passage shows a topic shift in the discourse flow: Sentence a focuses on "the need to educate people about breast cancer's risk factors". In contrast, sentence b discusses "the given support to research on medical innovation."

  2. The rule-based extraction algorithm found coreferences between "I" in the c sentence and "My" in the b sentence, and semantic chain between nouns ("lives", "illness", "research", "treatment", "researcher", "legislation", "administration", etc.). The main links are between "administration commitment" and "legislation", being one a consequence of the other.

  3. Between sentences c and d, the algorithm found a semantic chain between nouns ("treatment", "terminal illness", "breast cancer", "intervention", etc.

  4. Sentence e represents a topic shift towards a study of women diagnosed with cancer.

  5. We also observe that:
    • One political issue of interest is present: disease (illness).

    • The passage is part of a political discourse.

    • The passage is cohesive, capturing a whole idea around the political issue "illness".

    • The passage is coherent, presenting a logical order of ideas.

    • Within the main topic (a law that gives more treatment options for patients with breast cancer), this passage represents a sub-topic with clear topic shifts

Example 2 (Rejected):

a) Senator Baldwin's legislation addresses many of the serious problems that deter young scientists 
from pursuing careers in biomedical research and commits essential resources for supporting a 
strong and vibrant pipeline of future scientists.

b) - Robert Golden, M.D., Dean at the University of Wisconsin School of Medicine and Public Health, 
Vice Chancellor for Medical Affairs

c) Years of efforts by physicians, scientists, and other investigators have provided us the high 
quality of medical care we enjoy today.

d) Researchers at the University of Wisconsin School of Medicine and Public Health are at the 
forefront of this revolution in medical knowledge.

This is a good example of an incoherent passage. Here are the 2 reasons why the passage must be rejected:

  1. The two "inside", b and c, sentences lack cohesion, since sentence b does not convey any meaning as part of a discourse.

  2. Sentence c conveys a cohesive meaning about "medical care"; however, if we shorten the passage from the upper edge, sentence b would be the upper "outside" sentence, but we would not accomplish that "inside" and "outside" sentences must be composed of natural language.

Example 3 (Accepted, with no adjustments):

a) And look at the world on this bright August night.

b) The spirit of democracy is sweeping the Pacific rim.

c) China feels the winds of change.

d) New democracies assert themselves in South America.

e) And one by one, the unfree places fall, not to the force of arms but to the force of an idea: 
Freedom works.

f) And we have a new relationship with the Soviet Union: the INF Treaty, the beginning of the Soviet 
withdrawal from Afghanistan, the beginning of the end of the Soviet proxy war in Angola and, with 
it, the independence of Namibia.

g) Iran and Iraq move toward peace.

Example 4 (Accepted with adjustments):

a) And the United States and its partners are working around the clock, literally, to move food and 
release supplies into Afghanistan from surrounding countries, positioning it where it will be 
needed most as the harsh winter weather approaches.

b) Administrator Natsios just returned from a week in Central Asia.

c) He was reviewing humanitarian operations in the region, as well as in the staging areas where the 
aid is stockpiled for the purpose of getting it onto site and helping the people who need help the 
most.

d) The United States has supplied more than 80 percent of all food aid to vulnerable Afghans through 
the United Nations World Food Program.

e) Last year, the United States government provided over $178 million that year alone to aid the 
Afghan people, and the United States government has provided $237 million in aid to Afghanistan 
thus far in 2002.

f) One more example on that.

g) The U.S. has airlifted 20, 000 wool blankets, 100 rolls of plastic sheeting, 200 metric tons of 
high-energy biscuits, and 1 metric ton of sugar to Turkmenistan for distribution in Afghanistan.

After shortening the passage and turning sentence f into an "outside" sentence, we make a topic shift obvious, avoiding referencing another instance of international aid.

Resulting passage after adjustment:

a) And the United States and its partners are working around the clock, literally, to move food and 
release supplies into Afghanistan from surrounding countries, positioning it where it will be 
needed most as the harsh winter weather approaches.

b) Administrator Natsios just returned from a week in Central Asia.

c) He was reviewing humanitarian operations in the region, as well as in the staging areas where the 
aid is stockpiled for the purpose of getting it onto site and helping the people who need help the 
most.

d) The United States has supplied more than 80 percent of all food aid to vulnerable Afghans through 
the United Nations World Food Program.

e) Last year, the United States government provided over $178 million that year alone to aid the 
Afghan people, and the United States government has provided $237 million in aid to Afghanistan 
thus far in 2002.

f) One more example on that.

g) The U.S. has airlifted 20, 000 wool blankets, 100 rolls of plastic sheeting, 200 metric tons of 
high-energy biscuits, and 1 metric ton of sugar to Turkmenistan for distribution in Afghanistan.

Tips

  1. The Annotation Tool is loaded with thousands of passages; therefore, do not be afraid to ignore or reject passages that demand further adjustments, show ambiguity or incoherence, or clearly are not the best examples to create our dataset. Choose always the best examples.

  2. Do you want to load a specific passage? Replace the Passage ID in the URL after "passage_id="
    https://annotation-nlp-rfqv643p3a-lm.a.run.app/dataset2/edit?passage_id=

    with the Passage ID you want to load.

    https://annotation-nlp-rfqv643p3a-lm.a.run.app/dataset2/edit?passage_id=0000000025
  3. Do you need to see information on an annotated passage? Load an annotated passage ("accepted" or "rejected") as described before. Despite of the UI won't show the current state of the passage, it is possible to see the datapoint data in the JS console:

    Annotator tool, Passage Info.
    Annotator tool, Passage Info in two status: Rejected and Accepted.
  4. Do you need to know what datapoints have been tagged by you? Just download your dataset by clicking the "Download Dataset 2".

Hybrid Reclassification Approach (Round 3)

This annotation approach combines refined rules and manual annotation to reclassify pairs of sentences.

Rule-Based Classification Model 2

Goal: select examples of pairs of sentences where a topic continues or shifts at a granular level. Therefore, every sentence pair must be classified based on easily distinguishable features that define continuity or not.

This rule-based classification model will generate an improved version of the dataset using refined rules. This version aims to capture more fine-grained linguistic features that define topic continuity and shift using check functions that leverage spaCy's pre-trained NLP models.

The five rules are built on the following linguistic features:

  1. Coreference Continuity: As in the rule-based classification model 1, using spaCy's experimental model for coreference resolution (en_coreference_web_trf), we identify if two sentences have coreference links, meaning if there are words (like pronouns) in one sentence that refer to words in another sentence. It is crucial for understanding the continuity and flow of ideas between sentences.

    spaCy's coreference model is very effective in capturing anaphoric and cataphoric references:

    [0000000014] The Iraqis have been trying to acquire weapons of mass destruction. [SEP] That's the only explanation for why Saddam Hussein does not want inspectors in from the U.N.

    The cataphoric reference "that" from the second sentence refers to "acquire" weapons in the first sentence.

    [
      {
        "coreference": {
          "coreference_group_1": [
            {
              "coref": "acquire",
              "start": 6,
              "end": 7
            },
            {
              "coref": "That",
              "start": 12,
              "end": 13
            }
          ]
        }
      }
    ]

    However, not every coreference is captured by the spaCy's coreference model:

    * [0000000563] How can we hope to keep our military strong if our service members are forced to choose between their love of country and their love of family? [SEP] That's why supporting your physical, social and emotional health is a national security imperative.

    [0000000655] And it's a very sad thing. [SEP] That's one of many things, but it would have never happened.

    [0000000532] We're going to start drilling in ANWR - one of the largest oil reserves in the world - that for 40 years this country was unable to touch. [SEP] That by itself would be a massive bill.

    Therefore, pairs of sentences with no other continuity markers but with a clear presence of coreference must be evaluated cautiously to be classified accordingly (*).

    A note on "that": Given the pair of sentences:

    The President has also made clear that he believes that the acquisition of an effective missile defense system for the United States and its allies is one of his highest priorities, that he believes the only way to get there is a robust testing and evaluation system, and that he is not prepared to permit the treaty to get in the way of doing that robust testing. All that is unexpected.

    The "that"s in the first sentence, "that he believes" and "that he is not prepared" are relative pronouns introducing a clause to specify what the President does. It serves to give additional information about the President's priorities and actions. The second "that" in the sentence "All that is unexpected." is a demonstrative pronoun referring to the preceding situation or statement: coreference. It is used here to express an opinion or judgment about the President's previously mentioned actions or policies. Essentially, the first two "that"s are used to introduce detail, while the second "that" refers back to and comments on the entire statement.

    Use the JSON viewer to make the JSON structures more human-readable.

  2. Lexical Continuity: Using spaCy's linguistic features, we evaluate if two sentences share common linguistic elements, specifically nouns, proper nouns, verbs, adjectives, adverbs, subordinating conjunctions, and numbers. In essence, it seeks to capture the presence of words central to sentence meaning and structure. The check function check_lexical_continuity returns common elements between both sentences that are indicators of lexical continuity. Lexical continuity is vital to being cohesive and coherent, ensuring that ideas are expressed and interconnected, offering a smoother and more logical transition between thoughts.

    While lexical continuity is concerned with what words are used, syntactic continuity is about how these words are structurally interrelated within the sentence, being able to detect syntactic patterns that may predict continuity.

    Example:

    • "The President announced a new climate policy focusing on renewable energy sources."

    • "This policy could dramatically shift the country's reliance on fossil fuels."

    In these sentences, the shared syntactic unit is "policy." The first sentence introduces a specific action by the President – announcing a new climate policy. The second sentence refers back to this policy, discussing its potential impact. This shared noun ("policy") establishes lexical continuity between the two sentences, indicating a clear, contextual link in the discourse about environmental policy.

    The check function would identify "policy" as the common element, confirming a lexical bridge between the two sentences, demonstrating that both are related, and ensuring the conversation remains focused on a specific topic.

    [0000000021] That doesn't even count all that the rest of the world is trying to do for the starving people of Afghanistan, or those who need food. [SEP] The problem in getting food to the people of Afghanistan is the Taliban tries to tax it, they threaten and harass and take away the equipment of U.N. workers.

    {
      'lexical_continuity': [
        'try',
        'people',
        'Afghanistan',
        'food'
      ]
    }

    The presence of several common syntactic units is a strong indicator of the continuity of both sentences since they share many aspects of the same topic. More than two concurrent syntactic units in a pair of sentences can be considered a good number to predict continuity.

    On the other hand, usually, the presence of only one syntactic unit when lexical continuity is the only continuity feature captured by the rule-based model is not enough to predict continuity. See the next example.

    [0000000832] The G20 countries could mitigate the vast majority of climate change, keeping warming to 1.7 degrees Celsius if we act now - and we must. [SEP] When our nations meet in Glasgow next month - and thank you, Mr. Speaker, for the hospitality of your country, Speaker of the U.K. and the cooperation of Italy for the G26 - the COP26.

    Despite the presence of the word country, the lemmatized version of the words country and countries, there is a clear topic shift that helps to define the classification of this pair of sentences.

    [
      {
        'lexical_continuity': [
          'country'
        ]
      }
    ]

    Another example:

    [0000000845] And their parents, low-income Americans who desperately want to work, will have more ladders out of poverty. [SEP] Pass this jobs bill, and companies will get a $4, 000 tax credit if they hire anyone who has spent more than 6 months looking for a job.

    [
      {
        'lexical_continuity': [
          'more'
        ]
      }
    ]
  3. Syntactic Continuity: Using spaCy's linguistic features, we evaluate if two sentences identify commonality in the way words are related to each other within different sentences, through their dependency patterns, extending beyond the simple presence of shared words or phrases, but delving into the deeper structure of sentences. The check function check_syntactic_continuity finds dependency patterns between two sentences to conclude that they exhibit syntactic continuity. A dependency pattern is a triad representing the relationship between words in a sentence, consisting of a head word, a dependent word, and the type of dependency that connects them.

    Example:

    [0000000959] It's hard to run a business if you're marching to war. [SEP] It's not conducive to capital investment.

    First, we understand what is the syntactic structures present in both sentences.

    The following visualization shows the syntactic structure of the relationships between words in the first sentence, "It's hard to run a business if you're marching to war.", based on dependency parsing. The term "ROOT" in dependency parsing refers to the central node of the parse tree from which all words are connected either directly or indirectly. We visually recognize the ROOT node in a dependency parsing visualization because we see only outgoing arrows coming from it, connecting directly or indirectly all the other nodes. In the first sentence, the ROOT word is "'s", the contraction of "is", which lemma is "be".

    Additionally, we can see the word "it" is a nominal subject (nsubj) of the verb "be", present as "'s", the contraction of "is", which lemma is "be", the ROOT node.

    Note: Scroll horizontally in the visualization to have a complete view of the dependency parsing.

    It it PRP 's be VBZ hard hard JJ to to TO run run VB a business business NN if if IN you you PRP 're be VBP marching march VBG to to IN war. war NN nsubj acomp aux xcomp dobj mark nsubj aux advcl prep pobj

    The visualization of the second sentence, "It's not conducive to capital investment.", shows the same two patterns: the ROOT node is "'s" and the word "it" is the nominal subject (nsubj) of the lemmatized form of the verb "be".

    It it PRP 's be VBZ not not RB conducive conducive JJ to to IN capital investment. investment NN nsubj neg acomp prep pobj

    The presence of both dependency patterns is represented as triads. The first triad says that the head word is the lemma "be" ("'s"), and since it is the ROOT node of the sentence, its dependent word is also "be". The type of dependency in this case is ROOT.

    The second triad expresses that the head word is again the lemma "be" (from "'s"), and the word "it" is the nominal subject (nsubj) of the verb "'s", the ROOT node expressed as the lemma "be".

    [
      {
        'syntactic_continuity': {
          ('be', 'be', 'ROOT'),
          ('be', 'it', 'nsubj')
        }
      },
      {
        'coreference': {
          'coreference_group_1': [
            {
              'coref': 'run',
              'start': 4,
              'end': 5
            },
            {
              'coref': 'It',
              'start': 14,
              'end': 15
            }
          ]
        }
      }
    ]

    Is the presence of both patterns enough to determine syntactic continuity? In this case, yes. However, the coreference also determines continuity since "it", from the second sentence, refers to "run".

    Let's see an example in which syntactic continuity defines continuity by its own means.

    [0000002833] That's the plan that will move us forward. [SEP] That's why I'm running for a second term.

    Both triads show that both sentences have similar structures, referring to a precedent reference referenced as "That" with the verb "be" as the ROOT word.

    [
      {
        'syntactic_continuity': {
          ('be', 'be', 'ROOT'),
          ('be', 'that', 'nsubj')
        }
      }
    ]

    By reading both sentences, we find syntactic parallelism (or structural parallelism), where elements of sentences are structurally similar or symmetrical, particularly in terms of syntax. In general terms, parallelism refers to using components in a sentence that are grammatically the same or similar in their construction, sound, meaning, or meter.

    Parallelism is a powerful device used to create rhythm and emphasis, and to reinforce relationships between ideas in discourse or text. It makes a text or discourse more engaging and easier to understand. In political communication, it is common to find parallelism, which is used to articulate ideas more effectively and persuasively.

    Syntactic continuity is a nuanced continuity feature that may be used to define a classification even though both sentences do not share nouns (entities), verbs (relations), etc. However, the check function may capture irrelevant similar syntactic structures. This continuity must be evaluated with other features but with extra attention when other features are absent.

    The next example shows a pair of sentences with the presence of one syntactic pattern that is not enough to define syntactic continuity:

    [0000003149] That's because the economy is so good, and we've given incentives, and it's been an incredible success. [SEP] Employers are really, really happy.

    [
      {
        'syntactic_continuity': {
          ('be', 'be', 'ROOT')
        }
      }
    ]

    The dependency visualization of the first sentence:

    That that DT 's be VBZ because because IN the economy economy NN is be VBZ so so RB good, good JJ and and CC we we PRP 've 've VBP given give VBN incentives, incentive NNS and and CC it it PRP 's be VBZ been be VBN an incredible success. success NN nsubj mark nsubj advcl advmod acomp cc nsubj aux conj dobj cc nsubjpass auxpass conj attr

    And the dependency visualization of the second sentence:

    Employers employer NNS are be VBP really, really RB really really RB happy. happy JJ nsubj advmod advmod acomp

    As observed, this syntactic pattern present in both sentences is not enough to define continuity. A hypothesis to confirm is that as observed, one syntactic pattern in both sentences is not enough to define continuity. A hypothesis to confirm is that the presence of more than one syntactic pattern in both sentences is a better predictor of syntactic continuity, especially if the ROOT node is present in these patterns.

  4. Semantic Continuity: Leveraging spaCy's linguistic capabilities, the check_semantic_continuity function determines whether two sentences exhibit semantic connections by examining key semantic units such as nouns, verbs, adjectives, and adverbs. The function extracts key semantic units from each sentence, targeting noun chunks and individual tokens while explicitly excluding personal pronouns and other stop words. This selective process ensures the consideration of only the sentences' most relevant and meaningful components. The function then vectorizes these semantic units utilizing spaCy's vectorization features, converting textual information into a numerical format that can be computationally analyzed for similarity. This semantic comparison discerns a degree of continuity between the sentences. This approach ensures a focused and nuanced understanding of the sentences' meanings, complementing the lexical continuity function's goal, which focuses on identical words. This sentence searches for the same meaning expressed lexically differently.

    Example:

    • "The administration is focusing on healthcare reform to improve access to medical services for all citizens."

    • "Efforts to expand medical coverage have been a central pillar of the government's agenda."

    Key terms would find a high degree of semantic similarity between both sentences. Though not immediately obvious through word choice or grammar, these semantic connections indicate a strong thematic link between the sentences.

    {
      'semantic_continuity': [
        ('medical services', 'medical coverage')
      ]
    }

    More examples:

    [0000000676] It's been a headache for everybody. [SEP] It's been a nightmare for many.

    The check function check_semantic_continuity found the phrases "a headache" and "a nightmare" semantically similar creating a continuity link.

    [
      {
        'syntactic_continuity': {
          ('be', 'be', 'ROOT'),
          ('be', 'it', 'nsubjpass'),
          ('be', 'be', 'auxpass')
        }
      },
      {
        'semantic_continuity': [
          ('a headache', 'a nightmare')
        ]
      }
    ]

    [0000001642] A noose. [SEP] A gallows.

    This time, the phrases "a noose" and "a gallows" are semantically related concepts since a gallows is a frame, usually wood, made up of a horizontal crossbeam from which a noose or rope is suspended to hang a person.

    [
      {
        'semantic_continuity': [
          ('a noose', 'a gallows')
        ]
      }
    ]

    However, semantic continuity between two sentences involves understanding not only the literal meanings of words and phrases but also the contextual, pragmatic, and often subtle implications that span across sentences. Therefore, critical assessment to resolve the continuation between sentences is necessary, and the information provided by this check function must be evaluated in a wider context of the pair of sentences.

  5. Transition Marker Continuity: Transition markers are words or phrases that guide the reader through the flow of ideas, signaling whether the sentence continues a topic or shifts to a new one. The function check_transition_markers_continuity uses a lexicon of transition markers to categorize transition markers into different types based on their function and position in the sentence. The categories include "leading_markers" and "flexible_markers," each subdivided into "topic_continuity" and "topic_shift" markers.

    Leading markers are typically found at the beginning of a sentence and are definitive signals of the sentence’s intent, whether continuing a topic or introducing a new one. For instance, markers like "furthermore" or "however" clearly indicate the continuation or shift of the topic, respectively. Flexible markers, on the other hand, can appear anywhere in a sentence and often serve dual roles. They can signal continuity or shift depending on their usage and the sentence context. Markers like "as well" or "including" can either support the ongoing topic or subtly introduce a shift in the discourse.

    This analysis of transition markers is vital for understanding the text's structural and rhetorical strategies. It helps dissect complex arguments, follow narrative flows, and appreciate written communication's subtleties.

    Example:

    • "The government has increased funding for public education."

    • "However, many schools are still facing resource shortages."

    • "Despite these challenges, some educational institutions have made significant improvements."

    In these sentences, the transition markers "However" and "Despite" are key in signaling the direction of the discourse. The second sentence begins with "However," a leading marker from the "topic_shift" category, indicating that the following sentence will present a contrasting or opposing point to the one just made. This sets up the expectation for a shift in the discussion, moving from a statement about increased funding to a caveat about persistent issues in schools. The third sentence begins with "Despite," another leading marker, but this time indicating a continuation of the contrasting idea introduced by "However." It acknowledges the challenges mentioned in the first sentence but shifts the focus to a more positive aspect, highlighting improvements in some educational institutions.

We dropped parallelism, logical continuity (BERT's NSP), and Tense and Aspect Change from the rule-based classification Model 1 because either they overlapped the capture of features or their contribution was marginal.

Model baseline

(soon)

Conclusions

Based on observation, we conclude that:

  1. Coreference: spaCy's coreference model performs very well providing granular information of the references between both sentences that is useful in the annotation process. However, a few obvious coreferences are not captured correctly by the model, needing human evaluation to classify the dataset instance correctly, especially on examples with coreference continuity as the only continuity feature between sentences. The presence of a coreference does not define continuity: both sentences may be referring to the same noun but are still shifting the specific topic.

  2. Lexical continuity: The presence of one or more common syntactic units (identical words) is a strong indicator of continuity. More than one lexical duplicity is a stronger indicator of continuity. However, critical assessment and concurrency of other continuity features must be used to resolve correct classification. Lexical continuity is about "identical words," not identical meaning: meaning is the ultimate criterion to define continuity.

  3. Syntactic continuity: This feature is very useful to disambiguate continuity, especially if it is aligned to other features.

  4. Transition markers: THIS FEATURE WILL BE HANDLED BY REYES, skip every example containing the transition_marker_continuity feature. Most of the leading continuation markers that define the "continue" class are "and" and "so".

  5. Semantic continuity: Only present in a few cases in Dataset 2 and may help to disambiguate continuity after understanding both sentences' deeper meaning and context.

Classification rules

  1. Flexibility and Critical Assessment: Since we aim to capture subtleties of natural language that strict rules cannot capture, this classification task requires an understanding of the complexity and subjectivity in language, especially in a domain like political discourse where rhetoric and persuasion play significant roles. This approach allows for the capture of subtleties that rigid rule-based systems might miss.

  2. Topic shift: A topic shift is when the main subject of in the first sentence (S1) does not continue in the second sentence (S2). When a topic shifts, the entities in S1 may be present in S2 but the focus (main topic) is different.
  3. Rules Cascade: Coreferences and parallel syntactic, should be used primarily in the classification assessment of continuity, and lexical and semantic similarity between sentences should be used secondarily.
  4. Coreference: The presence of any coreference based on demostrative, object and posesive pronouns between sentences defines continuity. This rule is strict and have higher weight over others whenever any of the following anaphoric expressions refer to an entity mentioned in the previous sentence.

    • Demostrative Pronoun, used to refer to specific things or people mentioned before. Example: "We're begging we have blackouts all over our country. We've never had anything like that."

      • That (not the relative pronoun "that"), singular noun that is farther away from the speaker.

      • Those, plural form of "that".

      • This, singular noun that is close to the speaker.

      • These, plural form of "this".

    • Object pronoun, used as the object of a verb or preposition. Example: [0000001213] "I was very intrigued by dealing with a very strong woman who had been raised in a Communist country and what it meant-what it meant. I spent some time with her upstairs in the private dining quarters here in the White House complex, listening to her."

      • Me

      • You
      • Her

      • Him

      • It
      • Us

      • Them

    • Possessive Pronouns: used to refer to that something belongs to or is associated with an entity. Example: [0000000519] He has proven instead only his contempt for the United Nations and for all his pledges. By breaking every pledge, by his deceptions, and by his cruelties, Saddam Hussein has made the case against himself.

      • Mine

      • Yours
      • Hers

      • His

      • Its
      • Ours

      • Theirs

  5. Coreference by Subject Pronoun: While subject pronouns also define coreference, their evaluation requires a nuanced critical assessment because the presence of subject pronouns in both sentences might allow the introduction of a new topic or subtopic, shifting the line of topic continuity. Subject pronouns are used as the subject of a verb.

    • I

    • You
    • She

    • He

    • It
    • We

    • They

    Yes, if the subject pronoun refers a named entity: "Jane is a worker. She's heard your stories."

    Yes, if the subject pronoun refers to another subject pronoun and the main topic continues and/or other continuity features are present: [0000000960] "Now we're marching to peace. We took the tough decision, but now we're marching to peace." The coreference between "we", "We" and "we" is accepted because the main topic continues, observable by identical words (lexicality) and similar syntactic structures.

    No, if the subject pronoun refers to another subject pronoun and the main topic shifts: [0000000511] Now, we moved to west Texas 40 years ago, 40 years ago this year. [SEP] The war was over, and we wanted to get out and make it on our own.

  6. Syntactic Structure Similarity: The syntactic parallelism serves as a robust marker for discerning sentence continuity within texts. It predominantly illuminates cognitive dimensions of text comprehension, encapsulating advanced discourse analysis focused on intentional and rhetorical structures. This notion underscores the alignment of syntactic constructions as a reflection of coherent and strategic text composition, thereby facilitating a deeper understanding of the interplay between linguistic form and communicative function in discourse.

    Yes, if the syntactic structure is easily distinguishable in both sentences. In these examples the main topic continues along with the parallel syntactic structure: [0000000649] And now we're begging everybody for energy. We're begging we have blackouts all over our country. [0000001018] This is a crisis. It is a crisis that the Republicans in Congress are refusing to address.

    However, in this second example, the main topic ("inflation") does not continue in the second sentence, but the parallel syntactic structure continues: [0000000960] Inflation is low. Interest rates are low.

    No, if the syntactic structure is present in both sentences but it is not dominant not distinguishable in both sentences. Example: [0000001129] There's never been anything like this. When the pandemic struck, there were zero tests for the China virus, but we've marshaled all of America's resources to achieve these unparalleled capabilities.

  7. Lexical Cohesion: The presence of identical words may define continuity, but other continuity features must be evaluated to conclude if the topic continues.

    In the example, the second sentence is shifting from the main topic, One Warm Coat (an organization that collects donations.), to another aspect of it, its location.

    Yes, if the main topic continues being the subject of the sentence. [0000001129] We slashed redtape and approved emergency use authorizations for 243 type of tests. [SEP] That's how many tests we have.

    No, if the main topic shifts to another topic or subtopic. [0000008389] Thousands of people participate in One Warm Coat, and there are over 450 distribution centers in all 50 states. The location for today's One Warm Coat coat drive is Pathways to Housing, D.C., and they are one of the distribution agencies across the country.

  8. Semantic Similarity: This involves the analysis of the meaning behind the words and sentences. It's crucial for determining thematic progression in a text.

    Yes, if the semantic similarity is about the main topic in both sentences. [] (find example)

    No, if there is semantic similarity with an entity in another sentence, but not the main topic. [0000004235] On Pakistan more broadly, you've got the recent missile strike that has caused so much difficulty. You've got the floods that have caused so much instability. Example: [0000001135] We have not seen hospital overcrowding. We're doubling down now with testing to protect vulnerable people not just inside nursing homes, but in residential settings and senior daycare centers.

Guiding Principles

  1. In a classification task, it is better to remove an ambiguous example than accepting a datapoint that will teach something wrong or confusing to the model. The ambiguous example must be flagged as "discuss" (in red color - see legend tab) and Reyes will make the final decision.

  2. We aim to capture passages about a specific topic at a granular level, therefore the main principle is to define topic continuity/not continuity by finding the minimal sign of topic shift, even though the main topic remains but the discourse shifts to another aspect of the same. Extracting passages at a granular level means the extraction of passages of only a few sentences about something very specific about a topic of interest.

Error Analysis Task

  1. Create gold standard: The gold standard dataset is composed of manually labelled examples that have been reliably annotated and are considered the benchmark or reference for the correct outcome. The gold standard dataset will be composed of 900 manually annotated examples that represent the perfect examples of both classes. Each annotator (Nitin, Pranesh, and Reyes) will annotate 300 examples divided into 150 datapoints per class. Since the "not_continue" class is more scarce than the "continue" class, we will begin by randomly classifying the examples labelled as "not_continue". We will not follow a sequential order but prefer a random order that ensures a variety of examples from different passages - no sequential examples will be annotated. This task will be completed in 1 week.

  2. Baseline fine-tuning: The dataset of 900 examples will be split into two datasets: the "training" dataset (800 examples) and the "validation" dataset (90 examples). The remaining datapoints will compose the "test" dataset, which will be used to fine-tune a BERT model for binary classification in 2 or 3 iterations. The "test" dataset is imperfect, with misclassifications.

  3. Prediction and Error Logging: In each iteration, using the trained model, we will predict on a portion of the test set while logging instances where the model's predictions do not match the labels. Each predicted classification will be assessed individually and reclassified if necessary.

  4. Iterative Approach: The test dataset will be split into three parts that will be processed weekly using an iterative approach until finishing the correct classification of the dataset. The error analysis task will be completed in three weeks.

Procedure

  1. Assess the datapoint and define class.

  2. Select value in "reclass_3" attribute.

  3. The corresponding color will be added automatically to the row, according to the legend.

  4. Select your name in the attribute "annotator_3".

  5. In case of an ambiguous datapoint, add "discuss" in attribute "note", or add a note if necessary.