The ASCII files provided by the
EPA, used for building the local database
(download_ecotox_data()
), contain all required data. As
documented by the EPA most table fields are stored as text (with a few
exceptions). During the build process, all fields are kept as is,
without any cleanup or standardisation. This is done to avoid any data
loss or corruption and keep it as close to the source as possible.
Therefore, it is likely that you need to post-process data after
querying the locally built database.
Although it is the user’s responsibility to evaluate the correctness
and validity of the data, the ECOTOXr
package provides some
tools to make the cleanup process easier. This vignette presents
important aspects for cleaning:
In general there are two types of sanitising functions; those named
as_..._ecotox()
and those starting with
process_ecotox_...s()
. Where ‘...
’ is the data
type being sanitised (e.g. unit
, numeric
, or
date
). The first function type (‘as’) handles vectors of
character
s. The second function type (‘process’), handles
data.frame
s, where relevant columns are automatically
detected and processed with the ‘as’ type functions.
Note that the sanitation routines are subject to development, so they
may change. For reproducible results you should therefore always report
the version of ECOTOXr
you are using
(cite_ecotox()
).
Quantity units are vital for the interpretation of measurements. The
ECOTOX database contains units as reported by its source publication. As
a result, the units are often not stored consistently and are not
standardised. The ECOTOXr
package implements a function
that sanitises the unit text fields and then parses them with the units package. This
package provides instruments to convert units using the UNIDATA udunits
library.
The advantage of using the units
package is that it
provides a mechanism to apply arithmetic manipulations of data and
conversion between compatible units. Or as the documentation of the
package puts it: “When used in expression, it automatically converts
units, and simplifies units of results when possible; in case of
incompatible units, errors are raised”
So the goal of the sanitation steps here is to create a format that
can be parsed by the units
package. In order to achieve
this the following steps are performed:
units
package are converted such that they are handled
correctly. For instance ‘C’ in the ECOTOX database frequently stands for
‘degrees Celsius’ (although it is also used to indicate Carbon). So if
it’s not an annotation, ‘C’ is replaced with ‘Celsius’. Another example
is ‘sqft’ (square feet) which can not be interpreted by the units
package is replaced with ‘ft2’.units::mixed_units(1, "unit")
).The documentation of the as_unit_ecotox()
function has a
more detailed description of the cleanup procedure. If you need even
more details you can check the source
code.
In order to demonstrate how unit sanitation works in this packages, let’s first initialise a vector of mishmash units. These are actually a random sample from the ECOTOX database, not necessarily the most common ones:
library(ECOTOXr) |> suppressMessages()
library(dplyr) |> suppressMessages()
mishmash <-
c("ppm-d", "ml/2.5 cm eu", "fl oz/10 gal/1k sqft", "kg/100 L",
"mopm", "ng/kg", "ug", "AI ng/g", "PH", "pm", "uM/cm3", "1e-4 mM",
"degree", "fs", "mg/TI", "RR", "ug/g org/d", "1e+4 IU/TI", "pg/mg TE",
"pmol/mg", "1e-9/l", "no >15 cm", "umol/mg pro", "cc/org/wk", "PIg/L",
"ug/100 ul/org", "ae mg/kg diet/d", "umol/mg/h", "cmol/kg d soil",
"ug/L diet", "kg/100 kg sd", "1e+6 cells", "ul diet", "S", "mmol/h/g TI",
"g/70 d", "vg", "ng/200 mg diet", "uS/cm2", "AI ml/ha", "AI pt/acre",
"mg P/h/g TI", "no/m", "kg/ton sd", "ug/g wet wt", "AI mg/2 L diet",
"nmol/TI", "umol/g wet wt", "PSU", "Wijs number")
With as_unit_ecotox()
, the mishmash of units,
represented by character
strings are cleaned and coerced to
units::mixed_units()
. As units
objects have a
numeric component, but the character
strings from the
database do not, each unit is given a value of 1
. As you
can see not all units in the mishmash
vector can be
interpreted and are just returned as arbitrary 1 unit
.
as_unit_ecotox(mishmash, warn = FALSE)
#> Mixed units: d*ppm (1), ml/2.5cm (1), oz/1000ft2/10gal (1), kg/100L (1), month (1), ng/kg (1), ug (1), ng/g (1), unit (13), umol/L/cm3 (1), 0.0001mmol/L (1), ° (1), ug/d/g (1), pg/mg (1), pmol/mg (1), umol/mg (1), counts/counts/week (1), ug/100ul/counts (1), mg/d/kg (1), umol/h/mg (1), cmol/kg (1), ug/L (1), kg/100kg (1), 1000000counts (1), ul (1), S (1), mmol/g/h (1), g/70d (1), ng/200mg (1), uS/cm2 (1), ml/ha (1), pt/acre (1), mg/g/h (1), counts/m (1), kg/ton (1), ug/g (1), mg/2L (1), umol/g (1)
#> 1 [d*ppm], 1 [ml/2.5cm], 1 [oz/1000ft2/10gal], 1 [kg/100L], 1 [month], 1 [ng/kg], 1 [ug], 1 [ng/g], 1 [unit], 1 [unit], 1 [umol/L/cm3], 1 [0.0001mmol/L], 1 [°], 1 [unit], 1 [unit], 1 [unit], 1 [ug/d/g], 1 [unit], 1 [pg/mg], 1 [pmol/mg], 1 [unit], 1 [unit], 1 [umol/mg], 1 [counts/counts/week], 1 [unit], 1 [ug/100ul/counts], 1 [mg/d/kg], 1 [umol/h/mg], 1 [cmol/kg], 1 [ug/L], 1 [kg/100kg], 1 [1000000counts], 1 [ul], 1 [S], 1 [mmol/g/h], 1 [g/70d], 1 [unit], 1 [ng/200mg], 1 [uS/cm2], 1 [ml/ha], 1 [pt/acre], 1 [mg/g/h], 1 [counts/m], 1 [kg/ton], 1 [ug/g], 1 [mg/2L], 1 [unit], 1 [umol/g], 1 [unit], 1 [unit]
With process_ecotox_units()
you can process an entire
data.frame
/tibble
, where each column ending
with _unit
is processed (i.e. as_unit_ecotox()
is called on them). By setting the .names
argument, you can
preserve the original unit column:
tibble(mishmash_unit = mishmash) |>
process_ecotox_units(.names = "{.col}_parsed", warn = FALSE)
#> # A tibble: 50 × 2
#> mishmash_unit mishmash_unit_parsed
#> <chr> <mixed_units>
#> 1 ppm-d 1 [d*ppm]
#> 2 ml/2.5 cm eu 1 [ml/2.5cm]
#> 3 fl oz/10 gal/1k sqft 1 [oz/1000ft2/10gal]
#> 4 kg/100 L 1 [kg/100L]
#> 5 mopm 1 [month]
#> 6 ng/kg 1 [ng/kg]
#> 7 ug 1 [ug]
#> 8 AI ng/g 1 [ng/g]
#> 9 PH 1 [unit]
#> 10 pm 1 [unit]
#> # ℹ 40 more rows
As the database contains over 6,000 unique unit codes, it is likely that not all units are processable. Also, because the codes are not always consistent, some of them may not be interpreted correctly. Most frequently occurring units should parse correctly. If you think a specific code is not parsed correctly, and it is not highly outlandish, you could file an issue report. Furthermore, you should always inspect automatically parsed units for correctness.
Another point of attention is the removal of annotations from the unit. Consider the concentration unit with the following annotations:
as_unit_ecotox(c("mg/L CO3", "mg/L CaCO3", "mg/L HCO3"))
#> Mixed units: mg/L (3)
#> 1 [mg/L], 1 [mg/L], 1 [mg/L]
Note that they are all interpreted as [mg/L]
. Although
technically the same unit, they are definitely not directly compatible.
The units
package does not support annotations, so you need
to keep track of them yourself.
First let me explain what is meant by ‘numerics’ in the ECOTOX
database. These are all records that have a accompanying measurement
unit in the database. This includes, concentrations, durations and many
others. These records are stored as text fields in the database. So in
order to interpret them as actual numerics in R, they need to be coerced
to numerics. You could use a simple call to as.numeric()
to
do this, but that will not always work.
The text fields may contain operators such as ‘<’, ‘>’, ‘~’, etc. I think this is a mistake and not by design, because many of the numeric fields have a corresponding operator field where this operator could be stored. Text fields can also contain labelling text (such as asterisk symbol) or inconsistent decimal or thousand separators.
This is why there is as_numeric_ecotox()
which first
cleans the text records before coercing them to numerics:
## Text fields as possibly encountered in the database
text_records <-
c("10", " 2", "3 ", "~5", "9.2*", "2,33",
"2,333", "2.1(1.0 - 3.2)", "1-5", "1e-3")
as_numeric_ecotox(text_records)
#> Warning in as_numeric_ecotox(text_records): NAs introduced by coercion
#> [1] 10.000 2.000 3.000 5.000 9.200 2.330 2333.000 2.100
#> [9] NA 0.001
#> attr(,"has_notation")
#> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> attr(,"has_brackets")
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
You can use process_ecotox_numerics()
to process a
data.frame
/tibble
resulting from a search
query. It automatically applies as_numeric_ecotox()
to
columns containing numeric information:
text_tbl <- tibble(conc1_mean = text_records)
process_ecotox_numerics(text_tbl, warn = FALSE)
#> # A tibble: 10 × 1
#> conc1_mean
#> <dbl>
#> 1 10
#> 2 2
#> 3 3
#> 4 5
#> 5 9.2
#> 6 2.33
#> 7 2333
#> 8 2.1
#> 9 NA
#> 10 0.001
As indicated above all notations and operators included with numerics
are stripped in the cleaning process. These notations and operators are
potentially important for the interpretation of the values. It may be
wise to keep track of them. One way to do this is by first trying to
coerce texts to numerics with as.numeric()
and then with
as_numeric_ecotox()
. The cases where the first returns
NA
but the latter returns a value, is likely to contain
notations or operators (or is just inconsistently formatted). You could
also use the .names
argument in
process_ecotox_numerics()
to rename the numeric columns and
keep the original text fields.
process_ecotox_numerics(text_tbl, warn = FALSE, .names = "{.col}_num") |>
mutate(
test = is.na(as.numeric(conc1_mean)) &
!is.na(as_numeric_ecotox(conc1_mean, warn = FALSE))
)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `test = &...`.
#> Caused by warning:
#> ! NAs introduced by coercion
#> # A tibble: 10 × 3
#> conc1_mean conc1_mean_num test
#> <chr> <dbl> <lgl>
#> 1 "10" 10 FALSE
#> 2 " 2" 2 FALSE
#> 3 "3 " 3 FALSE
#> 4 "~5" 5 TRUE
#> 5 "9.2*" 9.2 TRUE
#> 6 "2,33" 2.33 TRUE
#> 7 "2,333" 2333 TRUE
#> 8 "2.1(1.0 - 3.2)" 2.1 TRUE
#> 9 "1-5" NA FALSE
#> 10 "1e-3" 0.001 FALSE
The ECOTOX contains several date fields. They can represent
meta-information about the record (date created and modified),
administrative information (publication date), or excremental
information (application date). These dates are stored as text in the
database. As not all records are consistent or complete, some cleaning
is required before coercing the text to a Date object
(?Date
).
The example below shows some typical date formats as encountered in
the database and how to coerce them to Date
objects using
as_date_ecotox()
:
char_date <- c("5-19-1987 ", "5/dd/2021", "3/19/yyyy", "1985", "mm/19/1999",
"October 2004", "nr/nr/2015")
as_date_ecotox(char_date)
#> [1] "1987-05-19" "2021-05-01" NA "1985-01-01" "1999-01-19"
#> [6] "2004-10-01" "2015-01-01"
The only date that cannot be coerced is the one with an unspecified
year. It is returned as NA
.
You can use process_ecotox_dates()
to process a
data.frame
/tibble
as returned by a search
query. Date columns are automatically coerced with
as_date_ecotox()
. Column names ending with
_date
are recognised as date records.