In this tutorial, we will go over some useful functions in pandas that you can use with regular experessions to process texts.

function description
contains() Test if pattern or regex is contained within a string of a Series or Index.
count() Count occurrences of pattern in each string of the Series/Index
findall() Find all occurrences of pattern or regular expression in the Series/Index.
replace() Replace each occurrence of pattern/regex in the Series/Index with a custom string
split() Split strings around given pattern




Create a DataFrame if you'd like to follow along with the tutorial:

from datasets import load_dataset
agnews = load_dataset('ag_news')

Using custom data configuration default
Downloading and preparing dataset ag_news/default (download: 29.88 MiB, generated: 30.23 MiB, post-processed: Unknown size, total: 60.10 MiB) to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548...
Dataset ag_news downloaded and prepared to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548. Subsequent calls will reuse this data.
agnews.set_format(type="pandas")
df = agnews['train'][:]
df.head()
text label
0 Wall St. Bears Claw Back Into the Black (Reute... 2
1 Carlyle Looks Toward Commercial Aerospace (Reu... 2
2 Oil and Economy Cloud Stocks' Outlook (Reuters... 2
3 Iraq Halts Oil Exports from Main Southern Pipe... 2
4 Oil prices soar to all-time record, posing new... 2

contains

  • find texts containing the word "business"
df[df['text'].str.contains(r'\bbusiness\b')].head()
text label
42 Technology company sues five ex-employees A M... 2
62 Downhome Pinoy Blues, Intersecting Life Paths,... 2
63 The Real Time Modern Manila Blues: Bill Monroe... 2
65 What are the best cities for business in Asia?... 2
74 HP to Buy Synstar Hewlett-Packard will pay \$2... 2

count

  • count the total number of times the word "business" occurs in texts
df['text'].str.count(r'\bbusiness\b').sum()
2759

findall

df['text'].str.findall(r'\ba\b')
0                   []
1                  [a]
2                   []
3                  [a]
4                  [a]
              ...     
119995             [a]
119996    [a, a, a, a]
119997             [a]
119998              []
119999             [a]
Name: text, Length: 120000, dtype: object

replace

  • replace the all the occurence of "today" or "Today" with "TODAYYYYYY"
  • check second to the last row!
df['text'].str.replace(r'\b[Tt]oday\b','TODAYYYYYY')
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  """Entry point for launching an IPython kernel.
0         Wall St. Bears Claw Back Into the Black (Reute...
1         Carlyle Looks Toward Commercial Aerospace (Reu...
2         Oil and Economy Cloud Stocks' Outlook (Reuters...
3         Iraq Halts Oil Exports from Main Southern Pipe...
4         Oil prices soar to all-time record, posing new...
                                ...                        
119995    Pakistan's Musharraf Says Won't Quit as Army C...
119996    Renteria signing a top-shelf deal Red Sox gene...
119997    Saban not going to Dolphins yet The Miami Dolp...
119998    TODAYYYYYY's NFL games PITTSBURGH at NY GIANTS...
119999    Nets get Carter from Raptors INDIANAPOLIS -- A...
Name: text, Length: 120000, dtype: object

split

  • split texts by "the", the function returns a list of strings
  • check first row of the output
df['text'].str.split(r"\bthe\b")
0         [Wall St. Bears Claw Back Into ,  Black (Reute...
1         [Carlyle Looks Toward Commercial Aerospace (Re...
2         [Oil and Economy Cloud Stocks' Outlook (Reuter...
3         [Iraq Halts Oil Exports from Main Southern Pip...
4         [Oil prices soar to all-time record, posing ne...
                                ...                        
119995    [Pakistan's Musharraf Says Won't Quit as Army ...
119996    [Renteria signing a top-shelf deal Red Sox gen...
119997    [Saban not going to Dolphins yet The Miami Dol...
119998    [Today's NFL games PITTSBURGH at NY GIANTS Tim...
119999    [Nets get Carter from Raptors INDIANAPOLIS -- ...
Name: text, Length: 120000, dtype: object