The package Conjurer offers synthetic data distribution functionality to generate data that seems real. To that extent, the functions in this package help generate distributions in a parametric method. This means that the randomness of the data generation is preserved while allowing the user to define the constraints of the randomness. Such a controlled randomness will aid in the generation of multiple data distributions to simulate real world as well as unrealistic examples of data. This paper provides insights and usage of the functions in a more detailed manner than provided in the manual of the package. This paper presents each function as a sub section and provides an overview of the purpose and details examples with source code.

The function *buildNum* is used to generate continuous data
distribution. The continuous data in the context of this package relates
to the float data type and not continuous in the context of signal
processing. Although the data distribution generated is a float data
type, this can be rounded off to simulate discrete data distribution. At
the core, this function uses a modified form of sine curve and therefore
lends itself to manipulation such that the dispersion of the data can be
skewed on purpose. The dispersion of the data can be controlled by the
parameter *disp* which takes a value between *(-pi/2)* and
*(pi/2)*. In order to make the data more realistic, the parameter
*outliers* can be set to *1*. It must be noted that the
outliers may produce results where data could be beyond the range of the
data requested i.e. *st* and *en* This functionality can
be used to generate univariate distributions.

The following code illustrates the process of generating continuous data with and without outliers.

```
#invoke library
library("conjurer")
set.seed(123)
continuousData <- buildNum(n = 10, st = 0, en = 1, disp = (pi/3), outliers = 0)
continuousDataOutlier <- buildNum(n = 10, st = 0, en = 1, disp = (pi/3), outliers = 1)
par(mfrow=c(1,2))
plot(continuousData)
plot(continuousDataOutlier)
```

The function *buildName* is used to generate string data. This
function uses probabilistic distribution of the alpabet sequences.
Unlike more advanced algorithms such as conditional random fields, this
function uses a more basic approach of probability of an alphabet given
the probability of the alplhabet preceding it. To this extent, the
function sources a data frame of string data based on which the
posterior probabilities are generated. Since the generation is based on
posterior probabilities, there needs to be sufficiently large data frame
such that all possible permutations of the alphabets are present. If no
data frame is provided, a default data frame of predetermined set of
baby names is used.

The following code illustrates the process of generating of alphabet sequences based on the default data frame provided in the package as well as a mocked up data of three short parts of a ficticious genome sequence.

```
#invoke library
library("conjurer")
set.seed(123)
buildNames(numOfNames = 3, minLength = 5, maxLength = 7)
#> [1] "jonnet" "jaceyn" "ronni"
d <- data.frame (first_column = c("ATGACGAGAGAGAGCA", "ATGACGAGAGAGCAGAGA","TACTGCTCTCTCGTAAATCG"))
buildNames(dframe=d, numOfNames = 3, minLength = 5, maxLength = 5)
#> Warning in buildNames(dframe = d, numOfNames = 3, minLength = 5, maxLength = 5):
#> Training data is not large enough. Expect less than minimum length names and/or
#> names that do not seem like training data
#> [1] "tt" "tt" "aaac"
```

*Note: It can be observed that since the data frame of genome
sequences is small, the package throws a warning that there is not
enough training data*

The function *buildId* is used to generate the alphanumeric.
In its current state the alphanumeric is a sequence of data with a
string prefix followed by an incremental numeric data. This data can be
used a unique identifier of an element or in cases of database schema,
this can be used as a primary key of a table. ### Usage

The following code illustrates the process of generating a unique
specimen id for a given number of elements.

```
#invoke library
library("conjurer")
buildId(numOfItems = 3, prefix = "specID")
#> [1] "specID1" "specID2" "specID3"
```

The function *buildPattern* is used to generate a sequence
i.e. a predetermined pattern of data. This function can be considered as
an intuitive form of finite state automaton or a regular expression. A
pattern is built as a probabilistic combination of *parts*.

The following code illustrates the process of generating a pattern of
phone numbers and IP addresses. The *parts* are generated based
on the respective probabilities given in the *probs*.

```
#invoke library
library("conjurer")
set.seed(123)
parts <- list(c(172),c("."),c(16:31), c("."), c(0:255), c("."), c(0:255))
probs <- list(c(), c(),c(),c(), c(), c(), c())
buildPattern(n=5,parts = parts, probs = probs)
#> [1] "159.18.194.49" "118.20.13.152" "90.31.242.184" "92.24.98.25"
#> [5] "7.24.210.77"
parts <- list(c("+11","+44","+64"), c("-"), c(491,324,211), c(7821:8324))
probs <- list(c(0.25,0.25,0.50), c(), c(0.30,0.60,0.10), c())
buildPattern(n=5,parts = parts, probs = probs)
#> [1] "+64-3248193" "+64-3248310" "+64-3248245" "+64-4918264" "+64-3248231"
```

The function *buildHierarchy* is used to generate graph data
i.e. hierarchical data. Based on the number of levels and splits, the
tree structure is built. The graph data is then presented in the form of
a data frame.

The following code illustrates the process of generating a tree with 2 splits at each node and a depth of three levels.

```
#invoke library
library("conjurer")
buildHierarchy(splits = 2, numOfLevels = 3)
#> level1 level2 level3
#> 1 Level_1_element_1 Level_2_element_1 Level_3_element_1
#> 2 Level_1_element_2 Level_2_element_2 Level_3_element_2
#> 3 Level_1_element_1 Level_2_element_3 Level_3_element_3
#> 4 Level_1_element_2 Level_2_element_4 Level_3_element_4
#> 5 Level_1_element_1 Level_2_element_1 Level_3_element_5
#> 6 Level_1_element_2 Level_2_element_2 Level_3_element_6
#> 7 Level_1_element_1 Level_2_element_3 Level_3_element_7
#> 8 Level_1_element_2 Level_2_element_4 Level_3_element_8
```

The function *buildPareto* is used to map data elements to
each other. This function helps in mapping or linking variables. Such a
linking or mapping helps in multiple use cases such as build a data
frame from a set of variables, building data distribution of one
variable in relation to another.

The following code illustrates the process of generating a mapping between two factors such that 30 percent of one factor is linked to 70 percent of another factor.

```
#invoke library
library("conjurer")
set.seed(123)
f1 <- factor(c(1:10))
f2 <- factor(letters[1:12], labels = "f")
buildPareto(factor1 = f1, factor2 = f2, pareto = c(70,30))
#> factor2 factor1
#> 1 f10 5
#> 2 f8 6
#> 3 f4 6
#> 4 f9 7
#> 5 f3 5
#> 6 f6 5
#> 7 f11 6
#> 8 f12 5
#> 9 f5 9
#> 10 f1 10
#> 11 f7 2
#> 12 f2 4
```