The newscatcheR package provides three simple functions for reading RSS feeds from news outlets and have them conveniently returned as a tibble.
The first function get_news()
returns a tibble of the rss feed of a given site.
get_news(website = "news.ycombinator.com")
#> GET request successful. Parsing...
#> # A tibble: 30 x 9
#> feed_title feed_link feed_description feed_pub_date item_title
#> <chr> <chr> <chr> <dttm> <chr>
#> 1 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 No cookie…
#> 2 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 Uber is l…
#> 3 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 Show HN: …
#> 4 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 Simple Ho…
#> 5 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 WinUI – T…
#> 6 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 Systems S…
#> 7 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 Bats in P…
#> 8 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 Take care…
#> 9 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 Point of …
#> 10 Hacker Ne… https://… Links for the i… 2020-05-06 12:59:58 Säkkijärv…
#> # … with 20 more rows, and 4 more variables: item_link <chr>,
#> # item_description <chr>, item_pub_date <dttm>, item_comments <chr>
The second function get_headlines
is a helper function that returns a tibble of just the headlines, instead of the full rss feed.
# adding a small time delay to avoid simultaneous posts to the API
Sys.sleep(1)
get_headlines(website = "news.ycombinator.com")
#> GET request successful. Parsing...
#> feed_entries$item_title
#> 1 No cookie consent walls, scrolling isn’t consent, says EU data protection body
#> 2 Uber is laying off 3,700, as rides plummet due to Covid-19
#> 3 Show HN: Axiom – No-code Browser Automation
#> 4 Simple Homemade TEA Laser (2007)
#> 5 WinUI – The modern native UI platform of Windows
#> 6 Systems Smart Enough to Know When They're Not Smart Enough (2017)
#> 7 Bats in Portuguese Libraries (2018)
#> 8 Take care editing bash scripts (2019)
#> 9 Point of WebGPU on Native
#> 10 Säkkijärven polkka
#> 11 Running a full IBM System/370 Mainframe on a Raspberry Pi Zero for ~5 years
#> 12 Initial Impressions of WSL 2
#> 13 The Framing of the Developer
#> 14 Visual Macros in TeXmacs [video]
#> 15 Onfim
#> 16 Early treatment of Covid-19 with HCQ and AZ:retrospective analysis of 1061 cases
#> 17 Circle – A C++ compiler with compile-time imperative metaprogramming
#> 18 Inkscape 1.0
#> 19 Ask HN: Any good FOSS alternative to Google's reCAPTCHA?
#> 20 I was tricked into thinking I had “grit”
#> 21 Retrofuturism
#> 22 Two Heads: A marriage devoted to the mind-body problem (2007)
#> 23 Rosetta: The Engine Behind Cray’s Slingshot Exascale-Era Interconnect
#> 24 Van Gogh's Favorite Books
#> 25 React-flow: a library to create interactive node-based graphs
#> 26 Post Mortem on Salt Incident
#> 27 Covering Science at Dangerous Speeds
#> 28 What we get wrong about Machiavelli
#> 29 Textile Hub: databases, storage, and remote IPFS for app builders
#> 30 Junk-Bond Sellers Desperate for Funding Swallow Yields over 10%
The function tld_sources
is a helper function for browsing news sites by top level domains. It’s useful to see which news sites from a country are present in the database.
tld_sources("de")
#> # A tibble: 40 x 2
#> url rss_endpoint
#> <chr> <chr>
#> 1 spiegel.de https://www.spiegel.de/international/index.rss
#> 2 zeit.de http://newsfeed.zeit.de/index
#> 3 thelocal.de https://www.thelocal.de/feeds/rss.php
#> 4 deutschland.de https://www.deutschland.de/en/feed-news/rss.xml
#> 5 raccoon.onyxbits.de https://raccoon.onyxbits.de/blog/index.xml
#> 6 abendblatt.de http://www.abendblatt.de/?service=Rss
#> 7 berliner-zeitung.de https://www.berliner-zeitung.de/feed/index.rss
#> 8 bild.de http://www.bild.de/rss-feeds/rss-16725492,feed=home.bild…
#> 9 bz-berlin.de http://www.bz-berlin.de/rss
#> 10 capital.de http://www.capital.de/rss
#> # … with 30 more rows
This package can be convenient if you need to fetch news from various websites for further analysis and you don’t want to search manually for the URL of their RSS feeds.
Assuming we have the news sites we want to follow:
We can get a list of data frames with:
lapply(sites, get_news)
#> GET request successful. Parsing...
#>
#> GET request successful. Parsing...
#>
#> GET request successful. Parsing...
#> [[1]]
#> # A tibble: 26 x 13
#> feed_title feed_link feed_description feed_language feed_pub_date
#> <chr> <chr> <chr> <chr> <dttm>
#> 1 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 2 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 3 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 4 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 5 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 6 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 7 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 8 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 9 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> 10 BBC News … https://… BBC News - World en-gb 2020-05-06 08:12:52
#> # … with 16 more rows, and 8 more variables: feed_last_build_date <dttm>,
#> # feed_generator <chr>, feed_ttl <chr>, item_title <chr>, item_link <chr>,
#> # item_description <chr>, item_pub_date <dttm>, item_guid <chr>
#>
#> [[2]]
#> # A tibble: 20 x 11
#> feed_title feed_link feed_description feed_language feed_pub_date
#> <chr> <chr> <chr> <chr> <dttm>
#> 1 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 2 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 3 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 4 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 5 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 6 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 7 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 8 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 9 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 10 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 11 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 12 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 13 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 14 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 15 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 16 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 17 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 18 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 19 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> 20 DER SPIEG… https://… Deutschlands fü… de 2020-05-06 16:11:49
#> # … with 6 more variables: feed_last_build_date <dttm>, item_title <chr>,
#> # item_link <chr>, item_description <chr>, item_pub_date <dttm>,
#> # item_guid <chr>
#>
#> [[3]]
#> # A tibble: 25 x 10
#> feed_title feed_link feed_description feed_language feed_pub_date
#> <chr> <chr> <chr> <chr> <dttm>
#> 1 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 2 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 3 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 4 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 5 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 6 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 7 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 8 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 9 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> 10 World http://w… The Washington … en-US 2020-05-06 07:00:54
#> # … with 15 more rows, and 5 more variables: item_title <chr>, item_link <chr>,
#> # item_description <chr>, item_pub_date <dttm>, item_guid <chr>