String Patterns with Keywords

Extracting the Sentences with Particular Keywords.

December 26, 2016

We often read and learn new information from a given text, and our minds tend to pay a closer attention to the related keywords. It will be convenient if we can extract the string patterns associated together with the relevant keywords.

The following designed function: regex_rules() can help me to extract the string patterns having the provided keywords:

def regex_rules(words, extract=0):
    """A function which returns the regular expression (regex) pattern for the
    provided keywords.
 
    Parameters
    ----------
    words : list
        A list of keywords or regex patterns.
 
    extract : int, default 0
        - 0 : Only the provided keywords or regex patterns.
        - 1 : Including words/word adjacent to the keywords or regex patterns.
        - 2 : Including the whole sentence having the keywords or regex
              patterns. It assumes a sentence is sandwiched by two periods
              or start/end boundaries .
 
    Returns
    -------
    patterns : list
        A list of regex rules containing the keywords in the 'words',
        according to the rules defined by 'extract'.
    """
    assert type(words) is list, "words is not in a list format"
    for word in words:
        assert type(word) is str, "words must have elements of strings"
 
    patterns = []
    for word in words:
        if extract == 0:
            # \b means \x08, a backspace. So we need \\b
            pattern = '\\b(' + word + ')\\b'
        elif extract == 1:
            # \w or \\w are the same
            # The usage of () implies a backreference, need (?: )
            pattern = '(?:\w+\s+|\\b)(?:' + word + ')(?:\s+\w+|\\b)'
        elif extract == 2:
            pattern = "(?:\.?|^)[^.]*\\b(?:" + word + ")\\b[^.]*(?:\.?|$)"
        patterns.append(pattern)
    return patterns

For a short paragraph like

Johnathan likes to play badminton. John likes to play basketball. Max likes computer games, but Johnathan prefer board games instead. Michael and Max love to play soccer.

If I am interested to learn whether the paragraph has mentioned about "John" or "Max", I could pass ['John|Max'] to the regex_rules function and refer to the extraction rule of 'keywords only' (extract = 0 in regex_rules). It will correctly identify the words of "John" and "Max", but not the "Johnathan". To understand what kind of "games" are mentioned in the paragraph, we can search for neighboring words associated to the keyword "games". ['games'] provided to the regex_rules function (extract = 1 in regex_rules) will suggest a regular expression rule that can correctly identify string patterns of ['computer games', 'board games instead']. In this way, we can learn that two different types of games are mentioned in the text.

Sometimes, we are interested to look for sentences having a particular keyword. For a paragraph like

I visited ABC company a few years ago. They visit ABC company in this coming weekend, and they would like to have another visit. Visit to ABC company makes people feeling great. I am keen to arrange a new visit by the end of the year.

, it may have sentences separated by unwanted blank lines (\n), white spaces or without properly placed period. A paragraph of text may not have sentences separated clearly with appropriate line breaks or white spaces because of incorrect input data. The designed regex_rules can extract sentences sandwiched by two periods (or start/end boundaries in regular expression pattern), and this assumption does not work for sentences having url links. Nevertheless, a ['visit'] together with the extraction rule (extract = 2 in regex_rules) will suggest the following extracted sentences.

'.\nThey visit ABC company in this coming\nweekend, and they would like to have another visit.'

' Visit to ABC company \nmakes people feeling great.'

' I am keen to arrange a\n\nnew visit\n\nby the end of the year.'

This example demonstrates that we can extract all the sentences having the word 'visit' appropriately, even the paragraph has inappropriate line breaks. For a more practical example, let's say we have an advertisement providing air-conditioning maintenance service:

Hi there, we are from an international company. We provide excellent services and some of our clients are from MNC. We have been in the business for more than 50 years. Enroll into a yearly contract with us and get your air-conditioning serviced at $25 per unit.

A regular expression pattern such as

keyword = ['(?:\$?\d+\.?\d?\d?|price|charges?)(?: is| nett| per)?(?:/| per | an | one | half | every | each )(?:unit|(?!hour |hr )\w+)\\b']

will help to search for a quote in the format of "price ker unit". Key phrases such as "$25 per unit", "price per every session", "$50.00 nett every class" can be identified, but not matching those in hourly rate such as "$25 per hour". In this manner, we can understand the pricing format preferred by a service provider ($25 per unit) by extracting any sentence with this type of keyword using the regex_rules function.

The example code of this post is available on my GitHub page.