Reading an xlsb workbook

Michael Allen

2023-03-03

library(readxlsb)

read_xlsb attempts to import a region from a binary format Excel workbook (xlsb)

The parameters are

read_xlsb(path, sheet, range, col_names, col_types, na, trim_ws, skip, ...)

sheet

Either a name, or the index of the sheet to read. Index of the first sheet is 1. If the sheet name is embedded in the range argument, or implied if range is a named range, then this argument is ignored

range

range can be specified as

col_names

col_types

Can be implied from the spreadsheet or specified in advanced. When specifying types, options are

When implying types from the underlying spreadsheet data, the resultant type is the regarded as the ‘least fragile’.

Effectively the order is logicaldatetimeintegerdoublestring

Currently ‘date’ is implied from cell formatting. It is either one of the built-in Excel datetime formats or a custom format where the format string contains only the characters Y, M, D, H, S, y, m, d, h, s and - (dash), : (colon), (space), . (dot). That should be good enough to identify any dates.

If ‘date’ is specified as the column type, then any strings are converted to dates. The format, I’m afraid, isn’t flexible at this stage - it’s assumed to be of the type “%Y-%m-%dT%H:%M:%S”.

I think there’s a 1900 leap year bug in Excel. I haven’t corrected for that.

If any of the cells in a ‘date’ column contain time, then a POSIXct object is returned, otherwise a Date object is returned. The timezone for POSIXct is set as UTC. Seems like the sensible thing to do.

na

A character string that is interpret as NA. This does not effect the implied data type for a column.

trim_ws

Should leading and trailing whitespaces be trimmed from character strings?

Additional options. At present just debug = TRUE is supported. This returns a list with fields ‘result’ set to the resulting data.frame and ‘env’ set to an internal environment that may be useful for debugging.

res = read_xlsb(path = system.file("extdata", "TestBook.xlsb", package = "readxlsb"), range = "PORTFOLIO", debug = TRUE)

ls(res$env)
#> [1] "content"      "named_ranges" "sheets"       "stream"

res$env$named_ranges
#>             name                     range sheet_idx first_column first_row
#> 1   INFO_RELEASE          FirstSheet!$A$11         0            1        11
#> 2        OUTLOOK 'My SecondTab'!$A$1:$C$13         1            1         1
#> 3      PORTFOLIO      FirstSheet!$A$3:$C$9         0            1         3
#> 4 SAVED_DATETIME          FirstSheet!$C$14         0            3        14
#> 5          TITLE           FirstSheet!$A$1         0            1         1
#> 6    UNICODE_STR          FirstSheet!$A$12         0            1        12
#>   last_column last_row
#> 1           1       11
#> 2           3       13
#> 3           3        9
#> 4           3       14
#> 5           1        1
#> 6           1       12