The R source code comparison is based on similarity coefficients for the names used in R programs or expressions. Use cases would be the detection of
In the first case, detection of similar code sequences can lead to better code quality if similar code is embedded in a function rather than repeatedly in different places. In the second case, cheating is looked for.
The goal, however, is not perfect detection of similar code sequences, but rather to give clues as to where similar code sequences might be.
We have some steps to take:
The makers of the package SimilaR (R Source Code Similarity Evaluation) have provided some sample files for testing:
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE)
#>
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/aa.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/aa1.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1_addLines.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1_variables.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2_addLines.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2_callReverse.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kendall4.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kendall4_variables.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kombinuj1.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kombinuj1_variables.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kwantyle1.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kwantyle1_variables.R
names(prgs)
#> [1] "aa.R" "aa1.R"
#> [3] "bucketSort1.R" "bucketSort1_addLines.R"
#> [5] "bucketSort1_variables.R" "isPrime2.R"
#> [7] "isPrime2_addLines.R" "isPrime2_callReverse.R"
#> [9] "kendall4.R" "kendall4_variables.R"
#> [11] "kombinuj1.R" "kombinuj1_variables.R"
#> [13] "kwantyle1.R" "kwantyle1_variables.R"
The parameter basename=TRUE
ensures that names of the list elements are the basename of the files
and not the file names including the path.
The parameter silent=TRUE
suppresses the output of the parsed files. If an error occurs during parsing, the file will not be loaded and will be included in the following steps.
If you want to consider expressions and not the whole R file, you have to set the parameter minlines
. sourcecode
checks whether an expression in the source file has more than minlines
lines. If so, the expression is kept for further analysis. The name of the list items in prgs
is then filename[number]
. For example, you could access the expression named prgs[["aa.R[1]"]]
.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, minlines=3)
#>
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/aa.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/aa1.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1_addLines.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1_variables.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2_addLines.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2_callReverse.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kendall4.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kendall4_variables.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kombinuj1.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kombinuj1_variables.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kwantyle1.R
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kwantyle1_variables.R
names(prgs)
#> [1] "aa.R[1]" "aa1.R[1]"
#> [3] "bucketSort1.R[1]" "bucketSort1_addLines.R[1]"
#> [5] "bucketSort1_variables.R[1]" "isPrime2.R[1]"
#> [7] "isPrime2_addLines.R[1]" "isPrime2_callReverse.R[1]"
#> [9] "kendall4.R[1]" "kendall4_variables.R[1]"
#> [11] "kombinuj1.R[1]" "kombinuj1_variables.R[1]"
#> [13] "kwantyle1.R[1]" "kwantyle1_variables.R[1]"
The next step is to calculate similarity coefficients between all source text segments based on the names used:
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE)
simy <- similarities(prgs)
head(simy)
#> row col jaccard
#> 1 aa.R aa1.R 1.0000000
#> 2 bucketSort1.R bucketSort1_addLines.R 1.0000000
#> 3 isPrime2.R isPrime2_addLines.R 1.0000000
#> 4 kombinuj1.R kombinuj1_variables.R 0.3000000
#> 5 kwantyle1.R kwantyle1_variables.R 0.2000000
#> 6 bucketSort1.R bucketSort1_variables.R 0.1428571
This calculates the Jaccard coefficients based on the variable names.
The output can be interpreted line by line:
aa.R
and aa1.R
are identical. Each variable name used in aa.R
is also used in aa1.R
and vice versa.bucketSort1_addLines.R
and bucketSort1.R
are identical.isPrime2_addLines.R
and isPrime2.R
are identical.kombinuj1_variables.R
and kombinuj1.R
are not identical. In both files, the variable names overlap by 30%.The interpretation will be different if a different similarity coefficient is used! But in any case, a higher similarity coefficient corresponds to a larger proportion of variable names in both files (or expressions).
type
With the type
parameter you can distinguish between different types of names:
cat(as.character(prgs[[1]])) # source code
#> asd <- function(x) {
#> for (i in x) {
#> cat(i)
#> x[i] <- 3
#> }
#> }
all.vars(prgs[[1]]) # type="v", default
#> [1] "asd" "i" "x"
all.names(prgs[[1]]) # type="n"
#> [1] "<-" "asd" "function" "{" "for" "i"
#> [7] "x" "{" "cat" "i" "<-" "["
#> [13] "x" "i"
setdiff(all.names(prgs[[1]]), all.vars(prgs[[1]])) # type="f"
#> [1] "<-" "function" "{" "for" "cat" "["
minlen
and ignore.case
With the parameter minlen
you can exclude names that are shorter than minlen
. The default is minlen=2
because the name of an index variable in loops often consists of only one letter, for example for (i in 1:n)
. Ignore.case" is either TRUE
or FALSE
. If TRUE
(default), then "A"=="a"
and so on.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE)
simy <- similarities(prgs, minlen=4)
#> Warning in similarities(prgs, minlen = 4): no names found in aa.R, aa1.R
head(simy)
#> row col jaccard
#> 1 bucketSort1.R bucketSort1_addLines.R 1.0000000
#> 2 bucketSort1.R bucketSort1_variables.R 1.0000000
#> 3 bucketSort1_addLines.R bucketSort1_variables.R 1.0000000
#> 4 isPrime2.R isPrime2_addLines.R 1.0000000
#> 5 kombinuj1.R kombinuj1_variables.R 0.3333333
#> 6 kwantyle1.R kwantyle1_variables.R 0.2000000
same.file
If you are only interested in the differences between files, you can set the similarities between expressions to zero if they are in the same file. The use case here is to detect plagiarism in different files.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE, minlines=1)
simy <- similarities(prgs)
attr(simy, "similarity")[1:3,1:3]
#> aa.R[1] aa1.R[1] bucketSort1.R[1]
#> aa.R[1] 1 1 0
#> aa1.R[1] 1 1 0
#> bucketSort1.R[1] 0 0 1
simy <- similarities(prgs, same.file=FALSE)
attr(simy, "similarity")[1:3,1:3]
#> aa.R[1] aa1.R[1] bucketSort1.R[1]
#> aa.R[1] 0 1 0
#> aa1.R[1] 1 0 0
#> bucketSort1.R[1] 0 0 0
coeff
(similarity)With the parameter coeff
a certain similarity coefficient can be calculated (default: jaccard
).
If you specify two sets with unique names set1
, set2
and one set setfull
with predefined names, four numbers will be calculated (default: setfull <- unique(c(set1,set2))
):
inset1 <- setfull %in% unique(set1)
inset2 <- setfull %in% unique(set2)
p <- length(setfull)
n11 <- sum(inset1 & inset2)
n10 <- sum(inset1 & !inset2)
n01 <- sum(!inset1 & inset2)
n00 <- sum(!inset1 & !inset2)
The following coefficients can be calculated:
braun = n11/max(n01+n11, n10+n11)
,dice = 2*n11/(n01+n10+2*n11)
,jaccard = n11/(n01+n10+n11)
(default),kappa = 1/(1+p/2*(n01+n10)/(n00*n11-n01*n10))
,kulczynski = n11/(n01+n10)
,matching = (n00+n11)/p
,ochiai = n11/sqrt((n11+n10)*(n11+n10))
,phi = (n11*n00-n10*n01)/sqrt((n11+n10)*(n11+n10)*(n00+n10)*(n00+n10))
,russelrao = n11/p
,simpson = n11/min(n01+n11, n10+n11)
,sneath = n11/(n11+2*n01+2*n10)
,tanimoto = (n11+n00)/(n11+2*n01+2*n10+n00)
, andyule = (n11*n00-n01*n10)/(n11*n00-n01*n10)
.If a coefficient name is not found or a NaN
is generated then a zero is returned.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE)
simy <- similarities(prgs, coeff="m")
head(simy)
#> row col matching
#> 1 aa.R aa1.R 1.0000000
#> 2 bucketSort1.R bucketSort1_addLines.R 1.0000000
#> 3 isPrime2.R isPrime2_addLines.R 1.0000000
#> 4 kombinuj1.R kombinuj1_variables.R 0.3000000
#> 5 kwantyle1.R kwantyle1_variables.R 0.2000000
#> 6 bucketSort1.R bucketSort1_variables.R 0.1428571
decreasing
and tol
The matrix m
of similarities is checked if it is a symmetric matrix. It is symmetric if for all entries holds abs(m-t(m))<=tol
; the result is stored in the attribute symmetrical
. The matrix m
is transformed into a data frame, where the first column is the row index, the second column the column index and the third column the similarity coefficient. If decreasing
is TRUE
(default), the data frame is sorted in descending order by the third column.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE)
simy <- similarities(prgs)
head(simy)
#> row col jaccard
#> 1 aa.R aa1.R 1.0000000
#> 2 bucketSort1.R bucketSort1_addLines.R 1.0000000
#> 3 isPrime2.R isPrime2_addLines.R 1.0000000
#> 4 kombinuj1.R kombinuj1_variables.R 0.3000000
#> 5 kwantyle1.R kwantyle1_variables.R 0.2000000
#> 6 bucketSort1.R bucketSort1_variables.R 0.1428571
The first step is to look at the sorted calculated coefficients.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE)
simy <- similarities(prgs, type="n", minlen=3)
stripchart(simy$jaccard, "jitter", pch=19, xlab="Jaccard")
In a second step you can plot the coefficients in a diagram, where thicker edges correspond to higher similarity coefficients.
library("igraph")
#>
#> Attache Paket: 'igraph'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#>
#> decompose, spectrum
#> Das folgende Objekt ist maskiert 'package:base':
#>
#> union
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE)
simy <- similarities(prgs, type="n", minlen=3)
graph <- as_igraph(simy, diag=FALSE)
# color all edges wit a large similarity coefficient in red
E(graph)$color <- ifelse(E(graph)$weight>0.4, "red", "grey")
plot(graph, edge.width=1+3*E(graph)$weight)
box()
The package igraph is used for the graphical representation. In as_igraph
the function igraph::graph_from_adjacency_matrix
is used. In case of a symmetric coefficient matrix, undirected graphs are used, otherwise a directed graph is used.
The last step is to compare the relevant source codes. The command browse
creates and opens an HTML page with the source codes side by side.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE)
simy <- similarities(prgs, type="n", minlen=3)
if (interactive()) browse(prgs, simy, sum(simy$jaccard>0.4))