October 17, 2018

What is automatic text summarization?

Automatic text summarization is the data science problem of creating a short, accurate, and fluent summary from a longer document. Summarization methods are greatly needed to consume the ever-growing amount of text data available online. In essence, summarization is meant to help us consume relevant information faster.

While summarization has been a field of study for decades, it has certainly grown in popularity in recent years. In 2017, Salesforce announced certain breakthroughs in the field of abstractive automatic summarization, and the use cases have proliferated across the enterprise.

Earlier in 2014, data scientist Juan Manuel Torres Moreno published a full book on the subject titled “Automatic Text summarization”, where he provided 6 reasons why we need automatic text summarization tools:

  • Summaries reduce reading time.
  • When researching documents, summaries make the selection process easier.
  • Automatic summarization improves the effectiveness of indexing.
  • Automatic summarization algorithms are less biased than human summarizers.
  • Personalized summaries are useful in question-answering systems as they provide personalized information.
  • Using automatic or semi-automatic summarization systems enables commercial abstract services to increase the number of texts they are able to process.

As described by Agolo, a Microsoft-backed summarization startup,  a document summarizer must generally overcome a set of challenges:

  • Determining which sentences are the most salient.
  • Making the summary cohesive and readable.
  • Minimizing the number of references to ideas and entities not mentioned in the summary. (i.e. coreference resolution).

Types of automatic summarization

Automatic summarization can be used in a variety of applications. Depending on the use case and type of documents, summarization systems can fall into different categories.

Abstractive vs. Extractive

When a human is given a corpus of text to summarize, they might rewrite the main points in their own words. This is called abstractive summarization and it requires high-level human skills like the ability to combine multiple perspectives into coherent natural language. As of 2018, the state of the art for abstractive summarization is not yet up to par, so many automatic summarization systems opt for a technique called extractive summarization.

Extractive summaries are excerpts taken directly from the input documents and presented in a readable way. The summary does not contain any rephrasing of the ideas presented in the original text. Extractive summarization methods employ AI-powered techniques to identify the most important sentences directly from the source.

Illustration of Salesforce’s model generating a multi-sentence summary from a news article. For each generated word, the model pays attention to specific words of the input and the previously generated output.

Single-document vs. Multi-document summarization

When summarizing a single document, the summarization system can rely on a cohesive piece of text with very little repetition of facts. However, the chance of redundancy increases with multi-document summarization systems. An ideal multi-document summarizer maximizes the important information included in the summary while minimizing repetition.

Indicative vs. informative

The taxonomy of summaries largely depend on the user’s end goal. For example, journalists or analysts looking to skim information as fast as possible would be interested in the high-level points of an article. So this use case requires an ‘indicative’ type of summary.

On the other hand, when the reader is looking to get more granular, summaries may require more detail. For example, a summary might need to allow topic filtering to let the reader further drill down the summary. This type of summary is considered to be ‘informative’.

Document length and type

The length of the input text heavily impacts the sort of approaches a summarization system can take. The largest summarization datasets, like Newsroom by Cornell University, have focused on news articles, which usually range 300 to 1,000 words. Extractive summarizers can be very effective when dealing with relatively short documents like news or blog articles. On the other hand, a 20-page report or a chapter of a book can only be summarized with the help of more advanced approaches like hierarchical clustering or discourse analysis.

In addition to length, documents may also fall into different genres. It is very different to summarize a news article to a financial earnings report or a technical white-paper. These are very different types of documents that may require entirely distinct summarization approaches.

Recommended articles

As a recap, here is a list of articles that cover the basics of automatic summarization. These articles were actually summarized by Frase’s summarization engine, which uses AI-powered extractive summarization.

New AI Breakthrough from Salesforce Research Boosts Productivity with Text Summarization (salesforce.com)

  • Salesforce Research is tackling this exact challenge and today we’re excited to announce two new breakthroughs in natural language processing towards the goal of automatically summarizing a long text and serving up coherent, digestible highlights that help you stay informed in a fraction of the time.
  • Text summarization is a very tough challenge, especially for longer texts such as news articles, and the work we are doing at Salesforce Research is pushing the state of the art.
  • I’m honored to work with Caiming Xiong and Richard Socher to introduce a more contextual word generation model and a new way of training summarization models with reinforcement learning (RL) .

Introduction to Automatic Text Summarization (blog.algorithmia.com) – Jan 05 2017

  • Without an abstract or summary, it can take minutes just to figure out what the heck someone is talking about in a paper or report.
  • Automatic text summarization is part of the field of natural language processing , which is how computers can analyze, understand, and derive meaning from human language.
  • By keeping things simple and general purpose, the automatic text summarization algorithm is able to function in a variety of situations that other implementations might struggle with, such as documents containing foreign languages or unique word associations that aren’t found in standard english language corpuses.

A Gentle Introduction to Text Summarization (machinelearningmastery.com) – Nov 28 2017

  • Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.
  • After reading this post, you will know: Why text summarization is important, especially given the wealth of text available on the internet.
  • Automatic text summarization, or just text summarization, is the process of creating a short and coherent version of a longer document.
  • These deep learning approaches to automatic text summarization may be considered abstractive methods and generate a wholly new description by learning a language generation model specific to the source documents.

Taming Recurrent Neural Networks for Better Summarization (abigailsee.com)

  • Abstractive approaches use natural language generation techniques to write novel sentences.
  • In the past few years, the Recurrent neural network (RNN) – a type of neural network that can perform calculations on sequential data (e.g. sequences of words) – has become the standard approach for many Natural Language Processing tasks.
  • The decoder’s ability to freely generate words in any order – including words such as beat that do not appear in the source text – makes the sequence-to-sequence model a potentially powerful solution to abstractive summarization.
  • Explanation for Problem 2 : Repetition may be caused by the decoder’s over-reliance on the decoder input (i.e. previous summary word) , rather than storing longer-term information in the decoder state.

An Overview of Summarization – agolo (blog.agolo.com) – Nov 03 2016

  • A summarization system with what’s called a generic trigger will find the most important topics in a given input text and summarize it without further guidance.
  • A generic trigger for summarization is useful in cases where the user does not yet know the contents of the text to be summarized.
  • Agolo’s summarizer takes these factors into account at various points in the summarization process.

A Quick Introduction to Text Summarization in Machine Learning (towardsdatascience.com) – Sep 18 2018

  • Text summarization refers to the technique of shortening long pieces of text.
  • Machine learning models are usually trained to understand documents and distill the useful information before outputting the required summarized texts.
  • With such a big amount of data circulating in the digital space, there is need to develop machine learning algorithms that can automatically shorten longer texts and deliver accurate summaries that can fluently pass the intended messages.
  • However, the text summarization algorithms required to do abstraction are more difficult to develop; that’s why the use of extraction is still popular.
  • As research in this area continues, we can expect to see breakthroughs that will assist in fluently and accurately shortening long text documents.

How to Make a Text Summarizer – Intro to Deep Learning  (YouTube)

You may also like...