In this we have retrieving data form web pages and perform the following task:
Steps needs to perform:
Retrieving Text from Static Website
Beautiful Soup
Using Newspaper3K to handle text cleanup
Several Web Examples
Processing Local Text File
Basic WordCloud with WordCloud
Readability with Textatistic
Sentiment Analysis with TextBlob
Be able to:
Download text from (some) web pages and prep for text analysis.
Clean up the text with Beautiful Soup, if possible.
Learn to use a library like Article to extract articles from most news sites and blogs including key meta-data.
Practice manipulating speech transcript data from Rev.com.
Perform sentiment analysis and plot sentence level subjectivity and polarity data with matplotlib and plotly express
Wrangling Text from Web-pages is Hard!
Each web-site stores data different so you need to be a sleuth.
Most modern sites no longer store the text as part of the page.
Static web pages are hard to find.
You could spend a semester just on retrieving data from web-pages or other APIs.
Many web-pages have restrictions on what you can retrieve. (See robots.txt before making heavy use of a web-page.)
Most book examples will use a static, locally stored text file as input.
Some newer tools (e.g. Article) can make it "easier" to retrieve properly formatted pages.
Install from Command or Terminal Prompt (not Jupyter Notebook)
TextBlob Module
conda install -c conda-forge textblob
ipython -m textblob.download_corpora
Note: Windows users may need to run as administrator
wordcloud Module
*conda install -c conda-forge wordcloud
Install from Command or Terminal Prompt (continued)
Newspaper3k
https://github.com/codelucas/newspaper
Reliable text scraping
pip3 install newspaper3k
TextTastic Module
Not required for our assignments but good for practice and examples
pip install textatistic
Note: Windows users may need to run as administrator*
Some students have reported needing to install VS Code to get Textatistic to work (ymmv)*
Importing All Related Libraries
import requests # import from web
from bs4 import BeautifulSoup # clean up text
from wordcloud import WordCloud # create word clouds
from textblob import TextBlob # basic NLP, install first
from textatistic import Textatistic # readability, install first
from pathlib import Path # for quick import of text file for NLP
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly import express as px
# Magics
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
Example 1: Extraction Block or Not Allowed
## 403 Forbidden Error, extract blocked / not allowed
url = 'https://www.americanrhetoric.com/speeches/mlkihaveadream.htm'
response = requests.get(url) # retrieve the webpage
response.content # show content from the retrieved page
Example 2: Static, Predominately Text-based Web-Page
## Static web page - JFK speech re: moon
url = 'https://er.jsc.nasa.gov/seh/ricetalk.htm'
response = requests.get(url)
response.content # Notice moderate amount of HTML code
soup = BeautifulSoup(response.content, 'html5lib')
text = soup.get_text(strip=True) # text without tags
text #BeautifulSoup has done a decent job on this page removing HTML
Example 3: Sometimes it is easier to copy and paste to a file
## Somewhat hidden text
url = 'https://www.whitehouse.gov/briefings-statements/remarks-president-trump-2020-united-states-military-academy-west-point-graduation-ceremony/'
response = requests.get(url)
response.content # UGLY! significant amount of code -- Where's the text??
## Soup doesn't help much in this case
soup = BeautifulSoup(response.content, 'html5lib')
text = soup.get_text(strip=True)
text
Let's try Article
Steps: Install newspaper3k via pip (only do this once per machine)
Import Article from newspaper (once per notebook)
Create an article object and set it to the URL of the web-page (required once per web-page)
Download (required after creating article object)
Parse the downloaded object (required once per download, separate data into text, authors, title, date, etc.)
Now you are ready for other tasks (view text, check authors and publication date; perform NLP tasks).
Try it!
Try the newspaper article code on your own link to a site that is likely to have all of the atributes
# Change the url to your own, comment out all urls but one, note may not show all of article if behind pay wall
url = 'https://hbr.org/2020/04/bringing-an-analytics-mindset-to-the-pandemic'
# url = 'https://www.wsj.com/articles/ceos-increasingly-see-sustainability-as-path-to-profitability-11602535250'
# url = 'https://www.cnn.com/2020/10/13/health/us-coronavirus-tuesday/index.html'
article = Article(url)
article.download()
article.parse()
print("Title: ", article.title)
print("Authors: ", article.authors)
print("Publication Date: ", article.publish_date)
print("First Image:", article.top_image)
print("Video Links:", article.movies)
print("Title: ", article.title)
print()
print(article.text)
article.nlp()
print("KeyWords: ", article.keywords) # creates a list of authors; no authors on this page
print()
print("Summary: ", article.summary) # no publish date on this web-page
Processing a Transcript with Newspaper3k
We can leverage article to retrieve text of transcribed speeches though we may need to process the data a bit to prepare it for analysis.
Most transcripts include speaker names, time stamps and other information.
Speech Transcript
These examples are specifically for the transcript site https://www.rev.com/blog/transcripts
Modifications likely for other speech sources
# Set the url
# From rev.com
url = 'https://www.rev.com/blog/transcripts/donald-trump-mosinee-wi-rally-speech-transcript-september-17'
event = '-mosinee-2020' # this will be part of the file name for a text file we create
#url = 'https://www.rev.com/blog/transcripts/ruth-bader-ginsburg-stanford-rathbun-lecture-transcript-2017'
#event = '-standfordlecture-2017'
# Minimum code needed to get to the text of the speech
article = Article(url)
article.download()
article.parse()
print(article.text)
# write the text to a file
with open('speech.txt', 'w') as f:
f.writelines(text)
with open('speech.txt', 'r') as f:
for cnt, line in enumerate(f):
print(f'Line {cnt}: {line}')
# Custom processing for rev site
# line 0 = speaker and time
# line 2 = what speaker said
# lines 1 and 3 = blanks
# create four lists of the components of speech
with open('speech.txt', 'r') as f:
speech = f.readlines()
tmp = []
speaker = []
time = []
words = []
for cnt, line in enumerate(speech):
if cnt % 2 == 0:
tmp.append(line.rstrip()) # temp list of just the text lines 0,2
for i in range(0,len(tmp),2):
speaker.append(tmp[i].split(': ')[0]) #split speaker line into 2 parts
time.append(tmp[i].split(': ')[1])
words.append(tmp[i+1]) # words from speaker
# find unique speaker names for later filter
set(speaker)
spkr = 'Donald Trump'
file = spkr.split()[len(spkr.split())-1] + event + '.txt'
file
# use write instead of writelines since we don't want entire list
# remember to add new line
with open(file,'w') as f:
for i in range(0,len(speaker)):
if speaker[i] == spkr:
f.write(words[i]+'\n')
# Confirm good file
text = Path(file).read_text()
text
Web scrapping using beautiful soup is used in machine learning or data science to extract useful data and preform machine learning algorithms or task on it, like: sentiment analysis in NLP etc.
If need any help related to this then contact us at: contact@codersarts.com
Comments