What do Wikipedia’s readers care about? Is Britney Spears more popular than Brittany? Is Asia Carrera more popular than Asia? How many people looked at the article on Santa Claus in December? How many looked at the article on Ron Paul?
What can you find?
Source: http://stats.grok.se/
The wikipediatrend package provides convenience access to daily page view counts (Wikipedia article traffic statistics) stored at http://stats.grok.se/ .
If you want to know how often an article has been viewed over time and work with the data from within R, this package is for you. Maybe you want to compare how much attention articles from different languages got and when, this package is for you. Are you up to policy studies or epidemiology? Have a look at page counts for Flue, Ebola, Climate Change or Millennium Development Goals and maybe build a model or two. Again, this package is for you.
If you simply want to browse Wikipedia page view statistics without all that coding, visit http://stats.grok.se/ and have a look around.
If non-big data is not an option, get the raw data in their entity at http://dumps.wikimedia.org/other/pagecounts-raw/ .
If you think days are crude measures of time but seconds might do if need be and info about which article views led to the numbers is useless anyways - go to http://datahub.io/dataset/english-wikipedia-pageviews-by-second.
To get further information on the data source (Who? When? How? How good?) there is a Wikipedia article for that: http://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics and another one: http://en.wikipedia.org/wiki/Wikipedia:About_page_view_statistics .
stable CRAN version (http://cran.rstudio.com/web/packages/wikipediatrend/)
install.packages("wikipediatrend")
developemnt version (https://github.com/petermeissner/wikipediatrend)
devtools::install_github("petermeissner/wikipediatrend")
… and load it via:
library(wikipediatrend)
The workhorse of the package is the wp_trend()
function that allows you to get page view counts as neat data frames like this:
page_views <- wp_trend("main_page")
page_views
## date count lang page rank month title
## 1 2015-06-27 13473954 en Main_page 2 201506 Main_page
## 2 2015-06-24 13899246 en Main_page 2 201506 Main_page
## 3 2015-06-25 13641545 en Main_page 2 201506 Main_page
## 4 2015-06-30 13067233 en Main_page 2 201506 Main_page
## 5 2015-06-11 19625134 en Main_page 2 201506 Main_page
## 6 2015-06-10 21152587 en Main_page 2 201506 Main_page
## 7 2015-06-13 10598467 en Main_page 2 201506 Main_page
## 8 2015-06-12 12423777 en Main_page 2 201506 Main_page
## 9 2015-06-15 13380283 en Main_page 2 201506 Main_page
## 10 2015-06-14 10842527 en Main_page 2 201506 Main_page
## 11 2015-06-16 13920109 en Main_page 2 201506 Main_page
## 12 2015-06-19 12892579 en Main_page 2 201506 Main_page
## 13 2015-06-06 19829956 en Main_page 2 201506 Main_page
## 14 2015-06-07 21789719 en Main_page 2 201506 Main_page
## 15 2015-06-04 15823541 en Main_page 2 201506 Main_page
## 16 2015-06-05 15293358 en Main_page 2 201506 Main_page
## 17 2015-06-18 13172261 en Main_page 2 201506 Main_page
## 18 2015-06-02 21715975 en Main_page 2 201506 Main_page
## 19 2015-06-03 15563456 en Main_page 2 201506 Main_page
## 20 2015-06-26 12925544 en Main_page 2 201506 Main_page
## 21 2015-06-01 21682574 en Main_page 2 201506 Main_page
## 22 2015-06-20 12246739 en Main_page 2 201506 Main_page
## 23 2015-06-21 12494026 en Main_page 2 201506 Main_page
## 24 2015-06-22 14479200 en Main_page 2 201506 Main_page
## 25 2015-06-23 13479181 en Main_page 2 201506 Main_page
## 26 2015-06-08 22099974 en Main_page 2 201506 Main_page
## 27 2015-06-09 21713790 en Main_page 2 201506 Main_page
## 28 2015-06-28 13224876 en Main_page 2 201506 Main_page
## 29 2015-06-29 13973454 en Main_page 2 201506 Main_page
… that can easily be turned into a plot …
library(ggplot2)
ggplot(page_views, aes(x=date, y=count)) +
geom_line(size=1.5, colour="steelblue") +
geom_smooth(method="loess", colour="#00000000", fill="#001090", alpha=0.1) +
scale_y_continuous( breaks=seq(5e6, 50e6, 5e6) ,
label= paste(seq(5,50,5),"M") ) +
theme_bw()
wp_trend()
optionswp_trend()
has several options and most of them are set to defaults:
page
from = Sys.Date() - 30
to = Sys.Date()
lang = "en"
file = ""
friendly
requestFrom
userAgent
page
The page
option allows to specify one or more article titles for which data should be retrieved.
These titles should be in the same format as shown in the address bar of your browser to ensure that the pages are found. If we want to get page views for the United Nations Millennium Development Goals and the article is found here: “http://en.wikipedia.org/wiki/Millennium_Development_Goals” the page title to pass to wp_trend()
should be Millennium_Development_Goals not Millennium Development Goals or Millennium_development_goals or amy other ‘mostly-like-the-original’ variation.
To ease data gathering wp_trend()
page
accepts whole vectors of page titles and will retrieve date for each one after another.
page_views <-
wp_trend(
page = c( "Millennium_Development_Goals", "Climate_Change")
)
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, group=page, color=page)) +
geom_line(size=1.5) + theme_bw()
from
and to
These two options determine the time frame for which data shall be retrieved. The defaults are set to gather the last 30 days but might be set to cover larger time frames as well. Note that there is no data prior to December 2007 so that any date prior will be set to this minimum.
page_views <-
wp_trend(
page = "Millennium_Development_Goals" ,
from = "2000-01-01",
to = prev_month_end()
)
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, color=wp_year(date))) +
geom_line() +
stat_smooth(method = "lm", formula = y ~ poly(x, 22), color="#CD0000a0", size=1.2) +
theme_bw()
lang
This option determines for which Wikipedia the page views shall be retrieved, English, German, Chinese, Spanish, … . The default is set to "en"
for the English Wikipedia. This option should get one language shorthand that then is used for all pages or for each page a corresponding language shorthand should be specified.
page_views <-
wp_trend(
page = c("Objetivos_de_Desarrollo_del_Milenio", "Millennium_Development_Goals") ,
lang = c("es", "en"),
from = Sys.Date()-100
)
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, group=lang, color=lang, fill=lang)) +
geom_smooth(size=1.5) +
geom_point() +
theme_bw()
file
This last option allows for storing the data retrieved by a call to wp_trend()
in a file, e.g. file = "MyCache.csv"
. While MyCache.csv
will be created if it does not exist already it will never be overwritten by wp_trend()
thus allowing to accumulate data from susequent calls to wp_trend()
. To get the data stored back into R use wp_load(file = "MyCache.csv")
.
wp_trend("Cheese", file="cheeeeese.csv")
wp_trend("K\u00e4se", lang="de", file="cheeeeese.csv")
cheeeeeese <- wp_load( file="cheeeeese.csv" )
cheeeeeese
## date count lang page rank month title
## 43 2015-06-07 226 de K%C3%A4se 6057 201506 Käse
## 55 2015-06-08 329 de K%C3%A4se 6057 201506 Käse
## 56 2015-06-09 292 de K%C3%A4se 6057 201506 Käse
## 35 2015-06-10 294 de K%C3%A4se 6057 201506 Käse
## 34 2015-06-11 331 de K%C3%A4se 6057 201506 Käse
## 36 2015-06-13 196 de K%C3%A4se 6057 201506 Käse
## 41 2015-06-19 261 de K%C3%A4se 6057 201506 Käse
## 52 2015-06-21 199 de K%C3%A4se 6057 201506 Käse
## 53 2015-06-22 313 de K%C3%A4se 6057 201506 Käse
## 31 2015-06-24 299 de K%C3%A4se 6057 201506 Käse
## 49 2015-06-26 226 de K%C3%A4se 6057 201506 Käse
## 30 2015-06-27 137 de K%C3%A4se 6057 201506 Käse
## 57 2015-06-28 169 de K%C3%A4se 6057 201506 Käse
## 33 2015-06-30 258 de K%C3%A4se 6057 201506 Käse
## 21 2015-06-01 1818 en Cheese 705 201506 Cheese
## 15 2015-06-04 2089 en Cheese 705 201506 Cheese
## 14 2015-06-07 1401 en Cheese 705 201506 Cheese
## 6 2015-06-10 1849 en Cheese 705 201506 Cheese
## 10 2015-06-14 1036 en Cheese 705 201506 Cheese
## 9 2015-06-15 1375 en Cheese 705 201506 Cheese
## 11 2015-06-16 1172 en Cheese 705 201506 Cheese
## 17 2015-06-18 1021 en Cheese 705 201506 Cheese
## 12 2015-06-19 1044 en Cheese 705 201506 Cheese
## 22 2015-06-20 802 en Cheese 705 201506 Cheese
## 23 2015-06-21 810 en Cheese 705 201506 Cheese
## 2 2015-06-24 1123 en Cheese 705 201506 Cheese
## 3 2015-06-25 1227 en Cheese 705 201506 Cheese
## 20 2015-06-26 1092 en Cheese 705 201506 Cheese
## 1 2015-06-27 1061 en Cheese 705 201506 Cheese
##
## ... 29 rows of data not shown
When using wp_trend()
you will notice that subsequent calls to the function might take considerably less time than previous calls - given that later calls include data that has been downloaded already. This is due to the caching system running in the background and keeping track of things downloaded already. You can see if wp_trend()
had to download something if it reports one or more links to the stats.grok.se server, e.g. …
wp_trend("Cheese")
## http://stats.grok.se/json/en/201506/Cheese
wp_trend("Cheese")
## http://stats.grok.se/json/en/201506/Cheese
… but …
wp_trend("Cheese", from = Sys.Date()-60)
## http://stats.grok.se/json/en/201505/Cheese
## http://stats.grok.se/json/en/201506/Cheese
The current cache in memory can be accessed via:
wp_get_cache()
## date count lang page rank month
## 3417 2014-09-20 6851 de Islamischer_ ... -1 201409
## 2967 2014-12-19 20771 en Islamic_Stat ... -1 201412
## 3072 2015-04-08 31832 en Islamic_Stat ... -1 201504
## 778 2008-05-12 459 en Millennium_D ... 7435 200805
## 819 2008-06-14 343 en Millennium_D ... 7435 200806
## 1288 2009-10-03 574 en Millennium_D ... 7435 200910
## 1494 2010-05-15 759 en Millennium_D ... 7435 201005
## 1556 2010-07-25 863 en Millennium_D ... 7435 201007
## 1719 2010-12-24 567 en Millennium_D ... 7435 201012
## 1867 2011-05-20 1450 en Millennium_D ... 7435 201105
## 2318 2012-08-17 1228 en Millennium_D ... 7435 201208
## 2510 2013-02-27 3901 en Millennium_D ... 7435 201302
## 2546 2013-03-24 2167 en Millennium_D ... 7435 201303
## 400 2014-11-24 1938 en Millennium_D ... 7435 201411
## 4384 2010-07-05 0 en Syria 1802 201007
## 4784 2011-08-15 7010 en Syria 1802 201108
## 4877 2011-11-10 6282 en Syria 1802 201111
## 5025 2012-04-27 6380 en Syria 1802 201204
## 5061 2012-05-07 5700 en Syria 1802 201205
## 5127 2012-07-06 9037 en Syria 1802 201207
## 5617 2013-11-16 4971 en Syria 1802 201311
## 5625 2013-12-23 3908 en Syria 1802 201312
## 5765 2014-04-05 3551 en Syria 1802 201404
## 5963 2014-11-24 3826 en Syria 1802 201411
## 597 2015-03-01 534 es Objetivos_de ... 4160 201503
## 585 2015-03-14 470 es Objetivos_de ... 4160 201503
## 3922 2014-08-31 4120 ru %D0%98%D1%81 ... -1 201408
## 3974 2014-09-05 6691 ru %D0%98%D1%81 ... -1 201409
## 4173 2015-04-26 2644 ru %D0%98%D1%81 ... -1 201504
## title
## 3417 Islamischer_ ...
## 2967 Islamic_Stat ...
## 3072 Islamic_Stat ...
## 778 Millennium_D ...
## 819 Millennium_D ...
## 1288 Millennium_D ...
## 1494 Millennium_D ...
## 1556 Millennium_D ...
## 1719 Millennium_D ...
## 1867 Millennium_D ...
## 2318 Millennium_D ...
## 2510 Millennium_D ...
## 2546 Millennium_D ...
## 400 Millennium_D ...
## 4384 Syria
## 4784 Syria
## 4877 Syria
## 5025 Syria
## 5061 Syria
## 5127 Syria
## 5617 Syria
## 5625 Syria
## 5765 Syria
## 5963 Syria
## 597 Objetivos_de ...
## 585 Objetivos_de ...
## 3922 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 3974 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 4173 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
##
## ... 6347 rows of data not shown
… and emptied by a call to wp_cache_reset()
.
While everything that is downloaded during a session is cached in memory it might come handy to save the cache parallel on disk to reuse it in the next R session. To activate disk-caching for a session simply use:
wp_set_cache_file( file = "myCache.csv" )
The function will reload whatever is stored in the file and in subsequent calls to wp_trend()
will automatically add data as it is downloaded. The file used for disk-caching can be changed by another call to wp_set_cache_file( file = "myOtherCache.csv")
or turned off completely by leaving the file
argument empty.
If disk-caching should be enabled by default one can define a path as system/environment variable WP_CACHE_FILE
. When loading the package it will look for this variable via Sys.getenv("WP_CACHE_FILE")
and use the path for caching as if …
wp_set_cache_file( Sys.getenv("WP_CACHE_FILE") )
.. would have been typed in by the user.
If comparing languages is important one needs to specify the exact article titles for each language: While the article about the Millennium Goals has an English title in the English Wikipedia, it of course is named differently in Spanish, German, Chinese, … . One might look these titles up by hand or use the handy wp_linked_pages()
function like this:
titles <- wp_linked_pages("Islamic_State_of_Iraq_and_the_Levant", "en")
titles <- titles[titles$lang %in% c("de", "es", "ar", "ru","zh-min-nan"),]
titles
## page lang title
## 1 %D8%AA%D9%86 ... ar <U+062A><U+0646><U+0638><U+064A><U+0645>_<U+0627><U+0644><U+062F><U+0648><U+0644><U+0629> ...
## 2 Iraq_kap_Lev ... zh-min-nan Iraq_kap_Lev ...
## 3 Islamischer_ ... de Islamischer_ ...
## 4 Estado_Isl%C ... es Estado_Islám ...
## 5 %D0%98%D1%81 ... ru <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
… then we can use the information to get data for several languages …
page_views <-
wp_trend(
page = titles$page,
lang = titles$lang,
from = "2014-08-01"
)
library(ggplot2)
for(i in unique(page_views$lang) ){
iffer <- page_views$lang==i
page_views[iffer, ]$count <- scale(page_views[iffer, ]$count)
}
ggplot(page_views, aes(x=date, y=count, group=lang, color=lang)) +
geom_line(size=1.2, alpha=0.5) +
ylab("standardized count\n(by lang: m=0, var=1)") +
theme_bw() +
scale_colour_brewer(palette="Set1") +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
AnomalyDetection
Currently the AnomalyDetection
package is not availible on CRAN so we have to use install_github()
from the devtools
package to get it.
RUN_IT <- require(AnomalyDetection) & require(BreakoutDetection) & F
## Loading required package: AnomalyDetection
## Loading required package: BreakoutDetection
# install.packages( "AnomalyDetection", repos="http://ghrr.github.io/drat", type="source")
library(AnomalyDetection)
library(dplyr)
library(ggplot2)
The package is a little picky about the data it accepts for processing so we have to build a new data frame. It should contain only the date and count variable. Furthermore, date
should be named timestamp
and transformed to type POSIXct
.
page_views <- wp_trend("Syria", from = "2010-01-01")
page_views_br <-
page_views %>%
select(date, count) %>%
rename(timestamp=date) %>%
unclass() %>%
as.data.frame() %>%
mutate(timestamp = as.POSIXct(timestamp))
Having transformed the data we can detect anomalies via AnomalyDetectionTs()
. The function offers various options e.g. the significance level for rejecting normal values (alpha
); the maximum fraction of the data that is allowed to be detected as anomalies (max_amoms
); whether or not upward deviations, downward devaitions or irregularities in both directions might form the basis of anomaly detection (direction
) and last but not least whether or not the time frame for detection is larger than one month (lonterm
).
Lets choose a greedy set of parameters and detect possible anomalies:
res <-
AnomalyDetectionTs(
x = page_views_br,
alpha = 0.05,
max_anoms = 0.40,
direction = "both",
longterm = T
)$anoms
res$timestamp <- as.Date(res$timestamp)
head(res)
… and play back the detected anomalies to our page_views
data set:
page_views <-
page_views %>%
mutate(normal = !(page_views$date %in% res$timestamp)) %>%
mutate(anom = page_views$date %in% res$timestamp )
class(page_views) <- c("wp_df", "data.frame")
Now we can plot counts and anomalies …
(
p <-
ggplot( data=page_views, aes(x=date, y=count) ) +
geom_line(color="steelblue") +
geom_point(data=filter(page_views, anom==T), color="red2", size=2) +
theme_bw()
)
… as well as compare running means:
p +
geom_line(stat = "smooth", size=2, color="red2", alpha=0.7) +
geom_line(data=filter(page_views, anom==F),
stat = "smooth", size=2, color="dodgerblue4", alpha=0.5)
It seems like upward and downward anomalies partial each other out most of the time since both smooth lines (with and without anomalies) do not differ much. Nonetheless, keeping anomalies in will upward bias the counts slightly, so we proceed with a cleaned up data set:
page_views_clean <-
page_views %>%
filter(anom==F) %>%
select(date, count, lang, page, rank, month, title)
page_views_br_clean <-
page_views_br %>%
filter(page_views$anom==F)
BreakoutDetection
BreakoutDetection
is a package that allows to search data for mean level shifts by dividing it into timespans of change and those of stability in the presence of seasonal noise. Similar to AnomalyDetection
the BreakoutDetection
package is not available on CRAN but has to be obtained from Github.
# install.packages( "BreakoutDetection", repos="http://ghrr.github.io/drat", type="source")
library(BreakoutDetection)
library(dplyr)
library(ggplot2)
library(magrittr)
… again the workhorse function (breakout()
) is picky and requires “a data.frame which has ‘timestamp’ and ‘count’ components” like our page_views_br_clean
.
The function has two general options: one tweaks the minimum length of a timespan (min.size
); the other one does determine how many mean level changes might occur during the whole time frame (method
); and several method specific options, e.g. decree
, beta
, and percent
which control the sensitivity adding further breakpoints. In the following case the last option tells the function that overall model fit should be increased by at least 5 percent if adding a breakpoint.
br <-
breakout(
page_views_br_clean,
min.size = 30,
method = 'multi',
percent = 0.05,
plot = TRUE
)
br
In the following snippet we combine the break information with our page views data and can have a look at the dates at which the breaks occured.
breaks <- page_views_clean[br$loc,]
breaks
Next, we add a span variable capturing which page_view observations belong to which span, allowing us to aggregate data.
page_views_clean$span <- 0
for (d in breaks$date ) {
page_views_clean$span[ page_views_clean$date > d ] %<>% add(1)
}
page_views_clean$mcount <- 0
for (s in unique(page_views_clean$span) ) {
iffer <- page_views_clean$span == s
page_views_clean$mcount[ iffer ] <- mean(page_views_clean$count[iffer])
}
spans <-
page_views_clean %>%
as_data_frame() %>%
group_by(span) %>%
summarize(
start = min(date),
end = max(date),
length = end-start,
mean_count = round(mean(count)),
min_count = min(count),
max_count = max(count),
var_count = var(count)
)
spans
Also, we can now plot the shifting mean.
ggplot(page_views_clean, aes(x=date, y=count) ) +
geom_line(alpha=0.5, color="steelblue") +
geom_line(aes(y=mcount), alpha=0.5, color="red2", size=1.2) +
theme_bw()