« IE Seminar

Information Extraction and Knowledge Graph Population

Institute of Computer Science, Brandenburgische Technische Universität Cottbus-Senftenberg
Juan-Francisco Reyes
pacoreyes@protonmail.com

### THIS PAGE WILL BE UPDATED PERMANENTLY BASED ON INTERACTIONS ON THE FORUM, RETURN OFTEN

Overview

This document delineates the process of building an information extraction system for extracting political statements at the passage level and information in the form of triples to create a knowledge graph.

This project is individual, not in groups.

This tutorial provides incomplete parts of the code necessary to complete this project. Thus it is expected that the participant put together all the parts and fill the gaps to achieve the project's final goal. Therefore, it is advised to use the forum to request guidance from the lecturer and classmates.

Files Structure

Part 1: Build the Pipeline for Passage Extraction

  1. Collect the models:

  2. Collect political discourses: For this project, you will use the Rev.com library exclusively as the source to collect 10 texts, 5 monologic (i.e. interviews) and 5 dialogic (i.e. speeches). Examples: monologic, Trump Speaks After New York Civil Fraud Ruling Transcript; dialogic, Fed Chair Jerome Powell 2024 60 Minutes Interview Transcript. All political discourse MUST BE from the year 2024, not before.

  3. Preprocess political discourse texts:

    • Cleaning: Remove contextual information, such as the introduction to the interview or speeches by another speaker, dates, and notes at the end, such that you strip the text to feature their monological or dialogical nature. We recommend the use of any plain text editor.

    • Normalize text: Use the preprocess_text function and enable/disable parameters according to your needs. We recommend the following setup of the function:

      import spacy
      
      nlp = spacy.load("en_core_web_trf")
      
      sent = preprocess_text(sent, nlp,
                                 with_remove_known_unuseful_strings=True,
                                 with_remove_parentheses_and_brackets=True,
                                 with_remove_text_inside_parentheses=True,
                                 with_remove_leading_patterns=True,
                                 with_remove_timestamps=True,
                                 with_replace_unicode_characters=True,
                                 with_expand_contractions=False,
                                 with_remove_links_from_text=True,
                                 with_put_placeholders=False,
                                 with_final_cleanup=False)

      The use of spaCy's Transformers model is optional, but it includes improved results in tasks like named entity recognition (NER), part-of-speech (POS) tagging, dependency parsing, and text classification. For development, you can use en_core_web_sm, en_core_web_md, or en_core_web_lg.

  4. Make inferences of Text Class using Sliding window approach: Split text in windows of not more than 512 tokens, measured with the Transformers library tokenizer BertTokenizer. Discard windows with less than 400-450 tokens because the model may not be sensitive enough to classify with just a few tokens. The model should pass each window, and the prediction scores should be averaged to evaluate by a threshold if the text is monologic or dialogic.

    The code for the sliding window approach was provided in Project 1.

    The following is a code to make predictions using a BERT model. Use it as a template to develop your own code including the sliding window approach. Use this code for Model 1.

    import torch
    from transformers import BertTokenizer, BertForSequenceClassification
    import torch.nn.functional as F
    from lib.utils import load_txt_file
    
    # Replace this with your model's name or path if you used a different model
    MODEL_NAME = 'bert-base-uncased'
    
    # 1. Load Pre-trained Model
    model = BertForSequenceClassification.from_pretrained(MODEL_NAME)
    
    # 2. Load Saved Weights
    model.load_state_dict(torch.load('models/example.pth'))
    
    # 3. Prepare Model for Evaluation
    model.eval()
    
    # 4. Preprocess Input Data
    tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
    
    file_path = 'new_text.txt'
    with open(file_path, 'r') as file:
        lines = file.readlines()
    
    def preprocess(text):
        inputs = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")
        return inputs
    
    
    # 5. Make Predictions
    def predict(text):
        inputs = preprocess(text)
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        probabilities = F.softmax(logits, dim=1)
        return probabilities
    
    for line in lines:
        data_point = line.strip()
        result = predict(data_point)
        print(result)

    The following code is used to make inferences from Model 2, which was trained for pairs of sentences. Notice the use of the tokenizer encoder "encoder_plus", which handles sentences separately, contrary to the previous code that expects only one text segment.

    import torch
    from transformers import BertTokenizer, BertForSequenceClassification
    
    
    def get_device():
      """Returns the appropriate device available in the system: CUDA, MPS, or CPU"""
      if torch.backends.mps.is_available():
        return torch.device("mps")
      elif torch.cuda.is_available():
        return torch.device("cuda")
      else:
        return torch.device("cpu")
    
    
    device = get_device()
    print(f"\nUsing device: {str(device).upper()}\n")
    
    # Replace this with your model's name or path if you used a different model
    MODEL_NAME = 'bert-base-uncased'
    
    # Load Pre-trained Model
    model = BertForSequenceClassification.from_pretrained(MODEL_NAME)
    
    # Load Saved Weights
    model.load_state_dict(torch.load('models/2/paper_b_hop_bert_reclass.pth'))
    
    # Move model to device
    model.to(device)
    
    # Prepare Model for Evaluation
    model.eval()
    
    # Initialize Tokenizer
    tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
    
    
    def predict_with_bert(_model, _tokenizer, _text):
        # Tokenize and preprocess the input text
        sentence1, sentence2 = _text.split('[SEP]')
        encoded_input = _tokenizer.encode_plus(
            text=sentence1.strip(),
            text_pair=sentence2.strip(),
            add_special_tokens=True,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
    
        # Move the tensors to the same device as the model
        input_ids = encoded_input['input_ids'].to(device)
        attention_mask = encoded_input['attention_mask'].to(device)
    
        # Tell model not to compute or store gradients, saving memory and speeding up prediction
        with torch.no_grad():
            # Forward pass, calculate logit predictions
            outputs = _model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
    
        # Convert logits to probabilities
        _probabilities = torch.nn.functional.softmax(logits, dim=1)
    
        # Optionally, convert probabilities to class labels
        # pred_class = _probabilities.argmax(dim=1)
    
        # Convert to numpy arrays for easier handling
        _probabilities = _probabilities.cpu().numpy()
        # pred_class = pred_class.cpu().numpy()
    
        return _probabilities  # , pred_class
    
    
    # Example usage
    text = "Tokyo is reach. [SEP] Yes it is reach"
    probabilities = predict_with_bert(model, tokenizer, text)
    print("Probabilities:", probabilities)
  5. Handle Speaker-Specific Text Extraction: Since we will focus on political statements by specific politicians, in the case of dialogic texts, we need to remove the text segments that do not belong to those politicians of interest. For example, let's say that from the interview Fed Chair Jerome Powell 2024 60 Minutes Interview Transcript, we will extract statements by Jerome Powell; therefore, we will remove all segments unrelated to him. This step is important to attribute the authorship of the statement later—we don't want a politician's statement be attributed to another politician.

    Use any means to identify parts of the politician of interest by matching speaker labels, like spaCy's matchers or REGEX—consider using ChatGPT for REGEX creation.

    In Figure 1 you see how the original text is processed to remove the parts (in opaque) that do not belong to the politician of interest (Jerome Powell), also excluding speaker labels and timestamps.
    Speaker Specific Text Extraction.
    Figure 1. Speaker Specific Text Extraction.
  6. Detect Stance with Model 3: At this point, we should know who is the politician author of a discourse, and we will use spaCy to iterate sentence by sentence to find political issues of interest. Use the match_issues function for that purpose.

    import spacy
    
    # Load the spaCy model
    nlp = spacy.load("en_core_web_sm")
    
    text = [an individual text from one political discourse]
    
    # Process the text with spaCy
    doc = nlp(text)
    
    # Iterate over the sentences
    for sent in doc.sents:
      print(sent.text)
    
      # check if sentence has any political issue
      matches = match_issues(sent.text)
      matches = list({_match[4] for _match in matches})
      if not matches:
        continue

    Whenever a sentence has a political issue, pass the sentence to the SetFit model 2 for stance classification. Choosing the higher prediction confidence score returned, finding the best sentences, and saving the stance (support/oppose).

  7. Detect passage boundary with Model 2: Once you detect a sentence with a stance toward a political issue of interest, navigate the text (upwards and downwards) to find out if the two sentences continue or not, defining the boundaries of a passage. Use Model 2 to evaluate pairs of sentences and navigate the text to identify when a passage begins and ends.

    The following code is used to make inferences from Model 2, which was trained for pairs of sentences. Notice the use of the tokenizer encoder "encoder_plus", which handles sentences separately, contrary to the previous inference code that expects only one text segment and uses the regular encoder "encoder".

    import torch
    from transformers import BertTokenizer, BertForSequenceClassification
    
    
    def get_device():
      """Returns the appropriate device available in the system: CUDA, MPS, or CPU"""
      if torch.backends.mps.is_available():
        return torch.device("mps")
      elif torch.cuda.is_available():
        return torch.device("cuda")
      else:
        return torch.device("cpu")
    
    
    device = get_device()
    print(f"\nUsing device: {str(device).upper()}\n")
    
    # Replace this with your model's name or path if you used a different model
    MODEL_NAME = 'bert-base-uncased'
    
    # Load Pre-trained Model
    model = BertForSequenceClassification.from_pretrained(MODEL_NAME)
    
    # Load Saved Weights
    model.load_state_dict(torch.load('models/2/paper_b_hop_bert_reclass.pth'))
    
    # Move model to device
    model.to(device)
    
    # Prepare Model for Evaluation
    model.eval()
    
    # Initialize Tokenizer
    tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
    
    
    def predict_with_bert(_model, _tokenizer, _text):
        # Tokenize and preprocess the input text
        sentence1, sentence2 = _text.split('[SEP]')
        encoded_input = _tokenizer.encode_plus(
            text=sentence1.strip(),
            text_pair=sentence2.strip(),
            add_special_tokens=True,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
    
        # Move the tensors to the same device as the model
        input_ids = encoded_input['input_ids'].to(device)
        attention_mask = encoded_input['attention_mask'].to(device)
    
        # Tell model not to compute or store gradients, saving memory and speeding up prediction
        with torch.no_grad():
            # Forward pass, calculate logit predictions
            outputs = _model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
    
        # Convert logits to probabilities
        _probabilities = torch.nn.functional.softmax(logits, dim=1)
    
        # Optionally, convert probabilities to class labels
        # pred_class = _probabilities.argmax(dim=1)
    
        # Convert to numpy arrays for easier handling
        _probabilities = _probabilities.cpu().numpy()
        # pred_class = pred_class.cpu().numpy()
    
        return _probabilities  # , pred_class
    
    
    # Example usage
    text = "Tokyo is reach. [SEP] Yes it is reach"
    probabilities = predict_with_bert(model, tokenizer, text)
    print("Probabilities:", probabilities)

    Save the passage, including the author (politician), the stance, and the found political issues.

Part 2: Build the Pipeline for Triple Extraction and Knowledge Graph Population

From the extracted passages, we will extract named-entities and relations, creating triples that will be stored in an instance of the graph database Neo4j.

Follow these steps:

  1. Create a Neo4j graph database instance. You have two free options:

    1. Create a cloud-based database using Neo4j Aura Professional.

    2. Create a local database using Neo4j Desktop.

    Your notebook will process the passages to extract semantic triples that will populate your KG.

  2. In the notebook 8_spacy_rule-based-matching_for_re.ipynb , you will find a way to extract entities and relations based on matching rules. Adapt the code to your use case, add entities and relations that you find in the extracted passages.

  3. Create a single Colab notebook where you can test your NLP pipeline. The expected output of this project is at least 10 triples.

  4. Add meaningful comments to increase readability.

  5. In the final presentation, you will execute each cell to populate your KG with at least 10 triples.

IMPORTANT: It is recommended to review the Linguistic Features section on spaCy documentation.