News in a word cloud

This tutorial was originally published on DataCareer.

In this tutorial we will retrieve the latest news and visualise it in a word cloud, using Python 3.

NewsAPI.org is an easy to use API to get news from over 30,000 sources all over the world. The API is free for all non-commercial projects (including open-source) and in-development commercial projects. You do need to register though to get a an 'API key'. You can do this very easily in a few seconds at: https://newsapi.org/register.

Let's start with importing the required packages for this tutorial.

In [1]:
import pprint
import requests     # 2.19.1

After registering at NewsAPI.org, you can find your API key at: https://newsapi.org/docs/authentication The following one is a dummy one, so please replace it with your own.

In [2]:
secret = 'your-api-key'

NewsAPI offers three endpoints:

  1. '/v2/top-headlines', for the most important headlines per country and category
  2. '/v2/everything', for all the news articles from over 30,000 sources
  3. '/v2/sources', for information on the various sources

We will use the 'everything' endpoint, to get news about 'Big Data'.

In [3]:
# Define the endpoint
url = 'https://newsapi.org/v2/everything?'
In [4]:
# Specify the query and number of returns
parameters = {
    'q': 'big data', # query phrase
    'pageSize': 20,  # maximum is 100
    'apiKey': secret # your own API key
}

Now we can retrieve the news with the requests package.

In [5]:
# Make the request
response = requests.get(url, params=parameters)

# Convert the response to JSON format
response_json = response.json()

# Check out the dictionaries keys
print(response_json.keys())
dict_keys(['status', 'totalResults', 'articles'])

The news can probably be found in the key articles, so let's just print the first value.

In [6]:
pprint.pprint(response_json['articles'][0])
{'author': 'Dylan Haas',
 'content': "Unless you've been living under a rock the past few years, you've "
            'likely heard of big data. This illusive term can be intimidating, '
            'but put simply, it refers to the fact that consumers are now '
            'producing more raw data than companies can keep up with.\r\n'
            'Instead … [+1214 chars]',
 'description': "Unless you've been living under a rock the past few years, "
                "you've likely heard of big data. This illusive term can be "
                'intimidating, but put simply, it refers to the fact that '
                'consumers are now producing more raw data than companies can '
                'keep up with. Instead o…',
 'publishedAt': '2019-05-10T09:00:00Z',
 'source': {'id': 'mashable', 'name': 'Mashable'},
 'title': 'Want a competitive edge in the job market? Master big data.',
 'url': 'https://mashable.com/shopping/may-10-big-data-online-course-sale/',
 'urlToImage': 'https://mondrian.mashable.com/2019%252F05%252F10%252Fb7%252Fadc180fa9eec4e17a3e92dd5a5263807.c5a07.jpg%252F1200x630.jpg?signature=WwJTuQ2rPNmzAqXoBwUSYbHV3rg='}

That seems right. Let's walk through all the news headlines with a loop (print just the titles).

In [7]:
for i in response_json['articles']:
    print(i['title'])
Want a competitive edge in the job market? Master big data.
Senator proposes strict Do Not Track rules in new bill
Sisense acquires Periscope Data to build integrated data science and analytics solution
Details emerge of China’s ‘Big Brother’ surveillance app targeting Muslims
Adtech veteran Quantcast is latest tech giant to face GDPR privacy probe
Protecting your computer against Intel’s latest security flaw is easy, unless it isn’t
Why We Should Stop Fetishizing Privacy
Here Are the Best Account Security Methods, According to Google
Tink, the European banking platform, partners with British incumbent NatWest
Automakers have a choice: Become data companies or become irrelevant
The Dumb Truth About Google's Privacy Push
How Big Data Can Help Teach Us About Infectious Diseases
Get an Apple Watch for $200 Right Now
Senator Introduces Do Not Track Bill to Block Companies From Collecting Your Data
Does Customer Data Privacy Actually Matter? It Should.
Big revenues, huge valuations and major losses: charting the era of the unicorn IPO
U.K. Police Have a Message for Crime Victims: Hand Over Your Private Data
Explainer video: When does data become big data?
Ookla's New Interactive Map Is a Helpful Tool for Understanding 5G Coverage
Galaxy S10 5G's insane Verizon data speeds restore my faith in 5G - CNET

Pretty easy right? Feel free to try out some other queries and the different endpoints. You can find the documentation at https://newsapi.org/docs and if you're logged in, all the example queries are already with your own API Key.

Now we have the headlines for 'Big Data', let's do something fun with it. We can visualize it with the wordcloud package. You can install the package via pip or conda-forge. Then, import the wordcloud & matplotlib packages

In [8]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

Now put all the headlines together in one string:

In [9]:
# Create an empty string
text_combined = ''
# Loop through all the headlines and add them to 'text_combined' 
for i in response_json['articles']:
    text_combined += i['title'] + ' ' # add a space after every headline, so the first and last words are not glued together
# Print the first 300 characters to screen for inspection
print(text_combined[0:300])
Want a competitive edge in the job market? Master big data. Senator proposes strict Do Not Track rules in new bill Sisense acquires Periscope Data to build integrated data science and analytics solution Details emerge of China’s ‘Big Brother’ surveillance app targeting Muslims Adtech veteran Quantca

Now we have all the headlines together in one variable, we can use it to generate the word cloud with the following code:

In [10]:
wordcloud = WordCloud(max_font_size=40).generate(text_combined)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

What other cool things can you do with the NewsAPI and the wordcloud package?