Sanitising ECOTOX

The ASCII files provided by the EPA, used for building the local database (download_ecotox_data()), contain all required data. As documented by the EPA most table fields are stored as text (with a few exceptions). During the build process, all fields are kept as is, without any cleanup or standardisation. This is done to avoid any data loss or corruption and keep it as close to the source as possible. Therefore, it is likely that you need to post-process data after querying the locally built database.

Although it is the user’s responsibility to evaluate the correctness and validity of the data, the ECOTOXr package provides some tools to make the cleanup process easier. This vignette presents important aspects for cleaning:

units
numeric data
dates

In general there are two types of sanitising functions; those named as_..._ecotox() and those starting with process_ecotox_...s(). Where ‘...’ is the data type being sanitised (e.g. unit, numeric, or date). The first function type (‘as’) handles vectors of characters. The second function type (‘process’), handles data.frames, where relevant columns are automatically detected and processed with the ‘as’ type functions.

Note that the sanitation routines are subject to development, so they may change. For reproducible results you should therefore always report the version of ECOTOXr you are using (cite_ecotox()).

Sanitising units

Quantity units are vital for the interpretation of measurements. The ECOTOX database contains units as reported by its source publication. As a result, the units are often not stored consistently and are not standardised. The ECOTOXr package implements a function that sanitises the unit text fields and then parses them with the units package. This package provides instruments to convert units using the UNIDATA udunits library.

The advantage of using the units package is that it provides a mechanism to apply arithmetic manipulations of data and conversion between compatible units. Or as the documentation of the package puts it: “When used in expression, it automatically converts units, and simplifies units of results when possible; in case of incompatible units, errors are raised”

So the goal of the sanitation steps here is to create a format that can be parsed by the units package. In order to achieve this the following steps are performed:

Annotations are stripped (i.e., prefixes and suffixes to units, indicating for instance ‘active ingredient’, the medium it refers to (e.g. soil), etc.)
Ambiguous units like percentages and ‘parts per …’ are converted to more explicit units where possible. This is only possible when it is recorded if the units indicate if they are w/w, w/v, v/v or v/w. If this conversion is not possible, the unit is kept as is.
Units that are not known or interpreted incorrectly by the units package are converted such that they are handled correctly. For instance ‘C’ in the ECOTOX database frequently stands for ‘degrees Celsius’ (although it is also used to indicate Carbon). So if it’s not an annotation, ‘C’ is replaced with ‘Celsius’. Another example is ‘sqft’ (square feet) which can not be interpreted by the units package is replaced with ‘ft2’.
Units that are not reported or interpretable are set as generic ‘unit’ (units::mixed_units(1, "unit")).

The documentation of the as_unit_ecotox() function has a more detailed description of the cleanup procedure. If you need even more details you can check the source code.

In order to demonstrate how unit sanitation works in this packages, let’s first initialise a vector of mishmash units. These are actually a random sample from the ECOTOX database, not necessarily the most common ones:

library(ECOTOXr) |> suppressMessages()
library(dplyr)   |> suppressMessages()

mishmash <-
  c("ppm-d", "ml/2.5 cm eu", "fl oz/10 gal/1k sqft", "kg/100 L",
    "mopm", "ng/kg", "ug", "AI ng/g", "PH", "pm", "uM/cm3", "1e-4 mM",
    "degree", "fs", "mg/TI", "RR", "ug/g org/d", "1e+4 IU/TI", "pg/mg TE",
    "pmol/mg", "1e-9/l", "no >15 cm", "umol/mg pro", "cc/org/wk", "PIg/L",
    "ug/100 ul/org", "ae mg/kg diet/d", "umol/mg/h", "cmol/kg d soil",
    "ug/L diet", "kg/100 kg sd", "1e+6 cells", "ul diet", "S", "mmol/h/g TI",
    "g/70 d", "vg", "ng/200 mg diet", "uS/cm2", "AI ml/ha", "AI pt/acre",
    "mg P/h/g TI", "no/m", "kg/ton sd", "ug/g wet wt", "AI mg/2 L diet",
    "nmol/TI", "umol/g wet wt", "PSU", "Wijs number")

With as_unit_ecotox(), the mishmash of units, represented by character strings are cleaned and coerced to units::mixed_units(). As units objects have a numeric component, but the character strings from the database do not, each unit is given a value of 1. As you can see not all units in the mishmash vector can be interpreted and are just returned as arbitrary 1 unit.

as_unit_ecotox(mishmash, warn = FALSE)
#> Mixed units: d*ppm (1), ml/2.5cm (1), oz/1000ft2/10gal (1), kg/100L (1), month (1), ng/kg (1), ug (1), ng/g (1), unit (13), umol/L/cm3 (1), 0.0001mmol/L (1), ° (1), ug/d/g (1), pg/mg (1), pmol/mg (1), umol/mg (1), counts/counts/week (1), ug/100ul/counts (1), mg/d/kg (1), umol/h/mg (1), cmol/kg (1), ug/L (1), kg/100kg (1), 1000000counts (1), ul (1), S (1), mmol/g/h (1), g/70d (1), ng/200mg (1), uS/cm2 (1), ml/ha (1), pt/acre (1), mg/g/h (1), counts/m (1), kg/ton (1), ug/g (1), mg/2L (1), umol/g (1) 
#> 1 [d*ppm], 1 [ml/2.5cm], 1 [oz/1000ft2/10gal], 1 [kg/100L], 1 [month], 1 [ng/kg], 1 [ug], 1 [ng/g], 1 [unit], 1 [unit], 1 [umol/L/cm3], 1 [0.0001mmol/L], 1 [°], 1 [unit], 1 [unit], 1 [unit], 1 [ug/d/g], 1 [unit], 1 [pg/mg], 1 [pmol/mg], 1 [unit], 1 [unit], 1 [umol/mg], 1 [counts/counts/week], 1 [unit], 1 [ug/100ul/counts], 1 [mg/d/kg], 1 [umol/h/mg], 1 [cmol/kg], 1 [ug/L], 1 [kg/100kg], 1 [1000000counts], 1 [ul], 1 [S], 1 [mmol/g/h], 1 [g/70d], 1 [unit], 1 [ng/200mg], 1 [uS/cm2], 1 [ml/ha], 1 [pt/acre], 1 [mg/g/h], 1 [counts/m], 1 [kg/ton], 1 [ug/g], 1 [mg/2L], 1 [unit], 1 [umol/g], 1 [unit], 1 [unit]

With process_ecotox_units() you can process an entire data.frame/tibble, where each column ending with _unit is processed (i.e. as_unit_ecotox() is called on them). By setting the .names argument, you can preserve the original unit column:

tibble(mishmash_unit = mishmash) |>
  process_ecotox_units(.names = "{.col}_parsed", warn = FALSE)
#> # A tibble: 50 × 2
#>    mishmash_unit        mishmash_unit_parsed
#>    <chr>                       <mixed_units>
#>  1 ppm-d                           1 [d*ppm]
#>  2 ml/2.5 cm eu                 1 [ml/2.5cm]
#>  3 fl oz/10 gal/1k sqft 1 [oz/1000ft2/10gal]
#>  4 kg/100 L                      1 [kg/100L]
#>  5 mopm                            1 [month]
#>  6 ng/kg                           1 [ng/kg]
#>  7 ug                                 1 [ug]
#>  8 AI ng/g                          1 [ng/g]
#>  9 PH                               1 [unit]
#> 10 pm                               1 [unit]
#> # ℹ 40 more rows

Consequences of unit sanitation

As the database contains over 6,000 unique unit codes, it is likely that not all units are processable. Also, because the codes are not always consistent, some of them may not be interpreted correctly. Most frequently occurring units should parse correctly. If you think a specific code is not parsed correctly, and it is not highly outlandish, you could file an issue report. Furthermore, you should always inspect automatically parsed units for correctness.

Another point of attention is the removal of annotations from the unit. Consider the concentration unit with the following annotations:

as_unit_ecotox(c("mg/L CO3", "mg/L CaCO3", "mg/L HCO3"))
#> Mixed units: mg/L (3) 
#> 1 [mg/L], 1 [mg/L], 1 [mg/L]

Note that they are all interpreted as [mg/L]. Although technically the same unit, they are definitely not directly compatible. The units package does not support annotations, so you need to keep track of them yourself.

Sanitising numerics

First let me explain what is meant by ‘numerics’ in the ECOTOX database. These are all records that have a accompanying measurement unit in the database. This includes, concentrations, durations and many others. These records are stored as text fields in the database. So in order to interpret them as actual numerics in R, they need to be coerced to numerics. You could use a simple call to as.numeric() to do this, but that will not always work.

The text fields may contain operators such as ‘<’, ‘>’, ‘~’, etc. I think this is a mistake and not by design, because many of the numeric fields have a corresponding operator field where this operator could be stored. Text fields can also contain labelling text (such as asterisk symbol) or inconsistent decimal or thousand separators.

This is why there is as_numeric_ecotox() which first cleans the text records before coercing them to numerics:

## Text fields as possibly encountered in the database
text_records <-
  c("10", " 2", "3 ", "~5", "9.2*", "2,33",
    "2,333", "2.1(1.0 - 3.2)", "1-5", "1e-3")
  
as_numeric_ecotox(text_records)
#> Warning in as_numeric_ecotox(text_records): NAs introduced by coercion
#>  [1]   10.000    2.000    3.000    5.000    9.200    2.330 2333.000    2.100
#>  [9]       NA    0.001
#> attr(,"has_notation")
#>  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#> attr(,"has_brackets")
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

You can use process_ecotox_numerics() to process a data.frame/tibble resulting from a search query. It automatically applies as_numeric_ecotox() to columns containing numeric information:

text_tbl <- tibble(conc1_mean = text_records)

process_ecotox_numerics(text_tbl, warn = FALSE)
#> # A tibble: 10 × 1
#>    conc1_mean
#>         <dbl>
#>  1     10    
#>  2      2    
#>  3      3    
#>  4      5    
#>  5      9.2  
#>  6      2.33 
#>  7   2333    
#>  8      2.1  
#>  9     NA    
#> 10      0.001

Consequences of numeric sanitation

As indicated above all notations and operators included with numerics are stripped in the cleaning process. These notations and operators are potentially important for the interpretation of the values. It may be wise to keep track of them. One way to do this is by first trying to coerce texts to numerics with as.numeric() and then with as_numeric_ecotox(). The cases where the first returns NA but the latter returns a value, is likely to contain notations or operators (or is just inconsistently formatted). You could also use the .names argument in process_ecotox_numerics() to rename the numeric columns and keep the original text fields.

process_ecotox_numerics(text_tbl, warn = FALSE, .names = "{.col}_num") |>
  mutate(
    test = is.na(as.numeric(conc1_mean)) &
      !is.na(as_numeric_ecotox(conc1_mean, warn = FALSE))
    )
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `test = &...`.
#> Caused by warning:
#> ! NAs introduced by coercion
#> # A tibble: 10 × 3
#>    conc1_mean       conc1_mean_num test 
#>    <chr>                     <dbl> <lgl>
#>  1 "10"                     10     FALSE
#>  2 " 2"                      2     FALSE
#>  3 "3 "                      3     FALSE
#>  4 "~5"                      5     TRUE 
#>  5 "9.2*"                    9.2   TRUE 
#>  6 "2,33"                    2.33  TRUE 
#>  7 "2,333"                2333     TRUE 
#>  8 "2.1(1.0 - 3.2)"          2.1   TRUE 
#>  9 "1-5"                    NA     FALSE
#> 10 "1e-3"                    0.001 FALSE

Sanitising dates

The ECOTOX contains several date fields. They can represent meta-information about the record (date created and modified), administrative information (publication date), or excremental information (application date). These dates are stored as text in the database. As not all records are consistent or complete, some cleaning is required before coercing the text to a Date object (?Date).

The example below shows some typical date formats as encountered in the database and how to coerce them to Date objects using as_date_ecotox():

char_date <- c("5-19-1987   ", "5/dd/2021", "3/19/yyyy", "1985", "mm/19/1999",
               "October 2004", "nr/nr/2015")

as_date_ecotox(char_date)
#> [1] "1987-05-19" "2021-05-01" NA           "1985-01-01" "1999-01-19"
#> [6] "2004-10-01" "2015-01-01"

The only date that cannot be coerced is the one with an unspecified year. It is returned as NA.

You can use process_ecotox_dates() to process a data.frame/tibble as returned by a search query. Date columns are automatically coerced with as_date_ecotox(). Column names ending with _date are recognised as date records.

tibble(
  my_date = char_date
) |>
  process_ecotox_dates()
#> # A tibble: 7 × 1
#>   my_date   
#>   <date>    
#> 1 1987-05-19
#> 2 2021-05-01
#> 3 NA        
#> 4 1985-01-01
#> 5 1999-01-19
#> 6 2004-10-01
#> 7 2015-01-01