Information Extraction and Knowledge Graph Population
Institute of Computer Science, Brandenburgische
Technische Universität Cottbus-Senftenberg
Juan-Francisco
Reyes
pacoreyes@protonmail.com
Institute of Computer Science, Brandenburgische
Technische Universität Cottbus-Senftenberg
Juan-Francisco
Reyes
pacoreyes@protonmail.com
### THIS PAGE WILL BE UPDATED PERMANENTLY BASED ON INTERACTIONS ON THE FORUM, RETURN OFTEN
This document delineates the process of building an information extraction system for extracting political statements at the passage level and information in the form of triples to create a knowledge graph.
This project is individual, not in groups.
This tutorial provides incomplete parts of the code necessary to complete this project. Thus it is expected that the participant put together all the parts and fill the gaps to achieve the project's final goal. Therefore, it is advised to use the forum to request guidance from the lecturer and classmates.
lib/
text_utils.py, use the preprocess_text function to clean and prepare text for downstream NLP tasks.
issues_matcher.py, use it to match political issues of interest in text.
shared_data/
political_issues.jsonl, use the match_issues function in issues_matcher.py to find issues of interest using the issues_matcher.
political_issues_gazetteers.jsonl, used by the issues_matcher.
Collect the models:
Download BERT model for discourse classification (1_bert_monologic_dialogic_classification.pth).
Prepare the SetFit model for stance detection (trained by you in Project 3).
Download BERT model for passage boundary detection (3_bert_passage_boundary_detection.pth).
Collect political discourses: For this project, you will use the Rev.com library exclusively as the source to collect 10 texts, 5 monologic (i.e. interviews) and 5 dialogic (i.e. speeches). Examples: monologic, Trump Speaks After New York Civil Fraud Ruling Transcript; dialogic, Fed Chair Jerome Powell 2024 60 Minutes Interview Transcript. All political discourse MUST BE from the year 2024, not before.
Preprocess political discourse texts:
Cleaning: Remove contextual information, such as the introduction to the interview or speeches by another speaker, dates, and notes at the end, such that you strip the text to feature their monological or dialogical nature. We recommend the use of any plain text editor.
Normalize text: Use the preprocess_text function and enable/disable parameters according to your needs. We recommend the following setup of the function:
import spacy nlp = spacy.load("en_core_web_trf") sent = preprocess_text(sent, nlp, with_remove_known_unuseful_strings=True, with_remove_parentheses_and_brackets=True, with_remove_text_inside_parentheses=True, with_remove_leading_patterns=True, with_remove_timestamps=True, with_replace_unicode_characters=True, with_expand_contractions=False, with_remove_links_from_text=True, with_put_placeholders=False, with_final_cleanup=False)
The use of spaCy's Transformers model is optional, but it includes improved results in tasks like named entity recognition (NER), part-of-speech (POS) tagging, dependency parsing, and text classification. For development, you can use en_core_web_sm, en_core_web_md, or en_core_web_lg.
Make inferences of Text Class using Sliding window approach: Split text in windows of not more than 512 tokens, measured with the Transformers library tokenizer BertTokenizer. Discard windows with less than 400-450 tokens because the model may not be sensitive enough to classify with just a few tokens. The model should pass each window, and the prediction scores should be averaged to evaluate by a threshold if the text is monologic or dialogic.
The code for the sliding window approach was provided in Project 1.
The following is a code to make predictions using a BERT model. Use it as a template to develop your own code including the sliding window approach. Use this code for Model 1.
import torch from transformers import BertTokenizer, BertForSequenceClassification import torch.nn.functional as F from lib.utils import load_txt_file # Replace this with your model's name or path if you used a different model MODEL_NAME = 'bert-base-uncased' # 1. Load Pre-trained Model model = BertForSequenceClassification.from_pretrained(MODEL_NAME) # 2. Load Saved Weights model.load_state_dict(torch.load('models/example.pth')) # 3. Prepare Model for Evaluation model.eval() # 4. Preprocess Input Data tokenizer = BertTokenizer.from_pretrained(MODEL_NAME) file_path = 'new_text.txt' with open(file_path, 'r') as file: lines = file.readlines() def preprocess(text): inputs = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt") return inputs # 5. Make Predictions def predict(text): inputs = preprocess(text) with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probabilities = F.softmax(logits, dim=1) return probabilities for line in lines: data_point = line.strip() result = predict(data_point) print(result)
The following code is used to make inferences from Model 2, which was trained for pairs of sentences. Notice the use of the tokenizer encoder "encoder_plus", which handles sentences separately, contrary to the previous code that expects only one text segment.
import torch from transformers import BertTokenizer, BertForSequenceClassification def get_device(): """Returns the appropriate device available in the system: CUDA, MPS, or CPU""" if torch.backends.mps.is_available(): return torch.device("mps") elif torch.cuda.is_available(): return torch.device("cuda") else: return torch.device("cpu") device = get_device() print(f"\nUsing device: {str(device).upper()}\n") # Replace this with your model's name or path if you used a different model MODEL_NAME = 'bert-base-uncased' # Load Pre-trained Model model = BertForSequenceClassification.from_pretrained(MODEL_NAME) # Load Saved Weights model.load_state_dict(torch.load('models/2/paper_b_hop_bert_reclass.pth')) # Move model to device model.to(device) # Prepare Model for Evaluation model.eval() # Initialize Tokenizer tokenizer = BertTokenizer.from_pretrained(MODEL_NAME) def predict_with_bert(_model, _tokenizer, _text): # Tokenize and preprocess the input text sentence1, sentence2 = _text.split('[SEP]') encoded_input = _tokenizer.encode_plus( text=sentence1.strip(), text_pair=sentence2.strip(), add_special_tokens=True, max_length=512, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt' ) # Move the tensors to the same device as the model input_ids = encoded_input['input_ids'].to(device) attention_mask = encoded_input['attention_mask'].to(device) # Tell model not to compute or store gradients, saving memory and speeding up prediction with torch.no_grad(): # Forward pass, calculate logit predictions outputs = _model(input_ids, attention_mask=attention_mask) logits = outputs.logits # Convert logits to probabilities _probabilities = torch.nn.functional.softmax(logits, dim=1) # Optionally, convert probabilities to class labels # pred_class = _probabilities.argmax(dim=1) # Convert to numpy arrays for easier handling _probabilities = _probabilities.cpu().numpy() # pred_class = pred_class.cpu().numpy() return _probabilities # , pred_class # Example usage text = "Tokyo is reach. [SEP] Yes it is reach" probabilities = predict_with_bert(model, tokenizer, text) print("Probabilities:", probabilities)
Handle Speaker-Specific Text Extraction: Since we will focus on political statements by specific politicians, in the case of dialogic texts, we need to remove the text segments that do not belong to those politicians of interest. For example, let's say that from the interview Fed Chair Jerome Powell 2024 60 Minutes Interview Transcript, we will extract statements by Jerome Powell; therefore, we will remove all segments unrelated to him. This step is important to attribute the authorship of the statement later—we don't want a politician's statement be attributed to another politician.
Use any means to identify parts of the politician of interest by matching speaker labels, like spaCy's matchers or REGEX—consider using ChatGPT for REGEX creation.
In Figure 1 you see how the original text is processed to remove the parts (in opaque) that do not belong to the politician of interest (Jerome Powell), also excluding speaker labels and timestamps.Detect Stance with Model 3: At this point, we should know who is the politician author of a discourse, and we will use spaCy to iterate sentence by sentence to find political issues of interest. Use the match_issues function for that purpose.
import spacy # Load the spaCy model nlp = spacy.load("en_core_web_sm") text = [an individual text from one political discourse] # Process the text with spaCy doc = nlp(text) # Iterate over the sentences for sent in doc.sents: print(sent.text) # check if sentence has any political issue matches = match_issues(sent.text) matches = list({_match[4] for _match in matches}) if not matches: continue
Whenever a sentence has a political issue, pass the sentence to the SetFit model 2 for stance classification. Choosing the higher prediction confidence score returned, finding the best sentences, and saving the stance (support/oppose).
Detect passage boundary with Model 2: Once you detect a sentence with a stance toward a political issue of interest, navigate the text (upwards and downwards) to find out if the two sentences continue or not, defining the boundaries of a passage. Use Model 2 to evaluate pairs of sentences and navigate the text to identify when a passage begins and ends.
The following code is used to make inferences from Model 2, which was trained for pairs of sentences. Notice the use of the tokenizer encoder "encoder_plus", which handles sentences separately, contrary to the previous inference code that expects only one text segment and uses the regular encoder "encoder".
import torch from transformers import BertTokenizer, BertForSequenceClassification def get_device(): """Returns the appropriate device available in the system: CUDA, MPS, or CPU""" if torch.backends.mps.is_available(): return torch.device("mps") elif torch.cuda.is_available(): return torch.device("cuda") else: return torch.device("cpu") device = get_device() print(f"\nUsing device: {str(device).upper()}\n") # Replace this with your model's name or path if you used a different model MODEL_NAME = 'bert-base-uncased' # Load Pre-trained Model model = BertForSequenceClassification.from_pretrained(MODEL_NAME) # Load Saved Weights model.load_state_dict(torch.load('models/2/paper_b_hop_bert_reclass.pth')) # Move model to device model.to(device) # Prepare Model for Evaluation model.eval() # Initialize Tokenizer tokenizer = BertTokenizer.from_pretrained(MODEL_NAME) def predict_with_bert(_model, _tokenizer, _text): # Tokenize and preprocess the input text sentence1, sentence2 = _text.split('[SEP]') encoded_input = _tokenizer.encode_plus( text=sentence1.strip(), text_pair=sentence2.strip(), add_special_tokens=True, max_length=512, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt' ) # Move the tensors to the same device as the model input_ids = encoded_input['input_ids'].to(device) attention_mask = encoded_input['attention_mask'].to(device) # Tell model not to compute or store gradients, saving memory and speeding up prediction with torch.no_grad(): # Forward pass, calculate logit predictions outputs = _model(input_ids, attention_mask=attention_mask) logits = outputs.logits # Convert logits to probabilities _probabilities = torch.nn.functional.softmax(logits, dim=1) # Optionally, convert probabilities to class labels # pred_class = _probabilities.argmax(dim=1) # Convert to numpy arrays for easier handling _probabilities = _probabilities.cpu().numpy() # pred_class = pred_class.cpu().numpy() return _probabilities # , pred_class # Example usage text = "Tokyo is reach. [SEP] Yes it is reach" probabilities = predict_with_bert(model, tokenizer, text) print("Probabilities:", probabilities)
Save the passage, including the author (politician), the stance, and the found political issues.
From the extracted passages, we will extract named-entities and relations, creating triples that will be stored in an instance of the graph database Neo4j.
Follow these steps:
Create a Neo4j graph database instance. You have two free options:
Create a cloud-based database using Neo4j Aura Professional.
Create a local database using Neo4j Desktop.
Your notebook will process the passages to extract semantic triples that will populate your KG.
In the notebook 8_spacy_rule-based-matching_for_re.ipynb , you will find a way to extract entities and relations based on matching rules. Adapt the code to your use case, add entities and relations that you find in the extracted passages.
Create a single Colab notebook where you can test your NLP pipeline. The expected output of this project is at least 10 triples.
Add meaningful comments to increase readability.
In the final presentation, you will execute each cell to populate your KG with at least 10 triples.
IMPORTANT: It is recommended to review the Linguistic Features section on spaCy documentation.