Hierarchical Text Summarization Using Semantic Similarity
Research Highlights in Language, Literature and Education Vol. 6,
2 June 2023,
Page 1-12
https://doi.org/10.9734/bpi/rhlle/v6/19261D
Due to the impact of the internet in the current era, huge volumes of data have emerged. This makes the user spend more time gathering information that is scattered across many documents. Summarizing the documents and providing a compressed version of it reduces the reading time for the users. Text summarization is the practice of condensing larger text documents into an organized summary of instructive sentences. The term "embedding" refers to the representation of each sentence in the text as a vector of real values. Values of particular attributes are examined and plotted in n-dimensional space when embedding a text. Semantically comparable sentences are located closer to one another. By calculating the distance between the vectors, unsupervised summarization combines comparable sentences and determines whether or not they should be included in the summary. With the input text data, Hierarchical Summarization creates a tree-structure, with the tree's length being determined by the number of clusters. Three different hierarchical clustering models are used to perform the clustering. Sentences with a comparable semantic content are collected in each cluster. A predetermined number of sentences were extracted from each cluster by locating its closest neighbor, and these phrases were then added to a summary that holds at least half the size of the original document(s). Performance indicators are used to assess the effectiveness of hierarchical summarizing on the CNN/Daily Mail dataset. The evaluation score finds that, with an F1 score of 0.75, the BIRCH algorithm performs better than the other clustering algorithms.