Data Quality Assessment: DataRobot docs

Data Quality Assessment¶

The Data Quality Assessment capability automatically detects and surfaces common data quality issues and, often, handles them with minimal or no action on the part of the user. The assessment not only saves time finding and addressing issues, but provides transparency into automated data processing (you can see the automated processing that has been applied). It includes a warning level to help determine issue severity.

See the associated considerations for important additional information.

As part of EDA1, DataRobot runs checks on features that don’t require date/time and/or target information. Once EDA2 starts, DataRobot runs additional checks. In the end, the following checks are run:

Time series projects run all the baseline data quality checks as well as checks for:

The Visual AI project Data Quality Assessment runs the same baseline checks and an additional missing image check:

Once EDA1 completes, the Data Quality Assessment appears just above the feature listing on the Data page.

In addition to the baseline data quality assessment, DataRobot provides additional detail for time series and Visual AI projects. Once model building completes, you can view the Data Quality Handling Report for additional imputation information.

For more information, refer to the following reference material:

  • Detailed descriptions of each check.
  • A summary of the logic behind each of the data quality checks.

The Data Quality Assessment provides information about data quality issues that are relevant to your stage of model building. Initially run as part of EDA1 (data ingest), the results report on the All Features list. It runs again and updates after EDA2, displaying information for the selected feature list (or, by default, All Features). For checks that are not applicable to individual features (for example, Inconsistent Gaps), the report provides a general summary. Click View Info to view (and then Close Info to dismiss) the report:

Each data quality check provides issue status flags, a short description of the issue, and a recommendation message, if appropriate:

  • Warning (): Attention or action required

  • Informational (): No action required

  • No issue ()

Because the results are feature-list based, it is possible that if you change the selected feature list on the Data page, new checks will appear or current checks will disappear from the assessment. For example, if feature list List 1 contains a feature problem, which contains outliers, the outliers check will show in the assessment. If you change lists to List 2 which does not include problem (or any other feature with outliers), the outliers check will report “no issue” ().

From within the assessment modal, you can filter by issue type to see which features triggered the checks. Toggle on Show only affected features and check boxes next to the check names to select which checks to display:

DataRobot then displays only features violating the selected data quality checks, and within the selected feature list, on the Data page. Hover on an icon for more detail:

For multilabel and Visual AI projects, Preview Log displays at the top if the assessment detects multicategorical format errors or missing images in the dataset. Click Preview Log to open a window with a detailed view of each error, so you can more easily find and fix them in the dataset.

Explore the assessment¶

Once EDA1 completes and you have, perhaps, filtered the display, view the list of features impacted by the issues you are interested in investigating. To see the values that triggered a warning or information notification, expand a feature and review the Histogram and Frequent Values visualizations.

Interpret the Histogram tab¶

The Histogram chart is the default display for numeric features. It “buckets” numeric feature values into equal-sized ranges to show frequency distribution of the variable—the target observation (left Y-axis) plotted against the frequency of the value (X-axis). The height of each bar represents the number of rows with values in that range.

Histogram display variations

The display differs depending on whether the data quality issue “Outliers” was found.

Without data quality issues:

With data quality issues:

Initially, the display shows the bucketed data:

Select the Show outliers checkbox to calculate and display outliers:

The traditional box plot above the chart (shown in gold) highlights the middle quartiles for the data to help you determine whether the distribution is skewed. To determine whisker length, DataRobot uses Ueda’s algorithm to identify the outlier points—the whiskers depict the full range for the lowest and highest data points in the dataset excluding those outliers. This is useful for helping to determine whether a distribution is skewed and/or whether the dataset contains a problematic number of outliers.

Note the change in the X-axis scale and compression of the box plot to allow for outlier display. Because there tend to be fewer rows recording an outlier value (it’s what makes them outliers), the blue bar may not display. Hover on that column to display a tooltip with the actual row count.

After EDA2 completes, the histogram also displays an average target value overlay.

Interpret Frequent Values¶

The Frequent Values chart, in addition to showing common values, reports inliers, disguised missing values, and excess zeros.

The following sections provide:

  • detailed descriptions of each check
  • a summary of the logic behind each of the data quality checks.

Quality check descriptions¶

The sections below detail the checks DataRobot runs for the potential data quality issues. The table that follows summarizes this information.

Outliers, the observation points at the far ends of the sample mean, may be the result of data variability. DataRobot automatically creates blueprints that handle outliers. Each blueprint applies an appropriate method for handling outliers, depending on the modeling algorithm used in the blueprint. For linear models, DataRobot adds a binary column inside of a blueprint to flag rows with outliers. Tree models handle outliers automatically.

How they are detected: DataRobot uses its own implementation of Ueda’s algorithm for automatic detection of discordant outliers.

How they are handled: The data quality tool checks for outliers; to view outliers use the feature’s histogram.

Multicategorical format errors¶

Multilabel modeling is a classification task that allows each row to contain one, several, or zero labels. To create a training dataset that can be used for multilabel modeling, you must follow the requirements for multicategorical features.

How they are detected: From a sampling of 100 random rows, DataRobot checks every feature that might qualify as multicategorical, looking for at least one value with the proper multicategorical format. If found, each row is checked to determine whether it complies with the multicategorical format. If there is at least one row that does not, the “multicategorical format error” is reported for the feature. The logic for the check is:

  • Value must be a valid JSON.
  • Value must represent a list of non-empty strings.

How they are handled: A selection of errors are reported to the data quality tool. If a feature has a multicategorical format error, it is not detected as multicategorical. View the assessment log for details of the error:

Inliers are values that are neither above nor below the range of common values for a feature, however, they are anomalously frequent compared to nearby values (for example, 55555 as a zip code value, entered by people who don’t want to disclose their real zip code). If not handled, they could negatively affect model performance.

How they are detected: For each value recorded for a feature, DataRobot computes the value’s frequency for that feature and makes an array of the results. Inlier candidates are the outliers in that array. To reduce false positives, DataRobot then applies another condition, keeping as inliers only those values for which:

frequency > 50 * (number of non-missing rows in the feature) / (number of unique non-missing values in the feature)

The algorithm allows inlier detection in numeric features with many unique values where, due to the number of values, inliers wouldn’t be noticeable in a histogram plot. Note that this is a conservative approach for features with a smaller number of unique values. Additionally, it does not detect inliers in features with fewer than 50 unique values.

How they are handled: A binary column is automatically added inside of a blueprint to flag rows with inliers. This allows the model to incorporate possible patterns behind abnormal values. No additional user action is required.

Excess zeros¶

Repeated zeros in a column could be regular values but could also represent missing values. For example, sales could be zero for a given item either because there was no demand for the item or due to no stock. Using 0s to impute missing values is often suboptimal, potentially leading to decreased model accuracy.

How they are detected: Using the array described in inliers, if the frequency of the value 0 is an outlier, DataRobot flags the feature.

How they are handled: A binary column is automatically added inside of a blueprint to flag rows with excess zeros. This allows the model to incorporate possible patterns behind abnormal values. No additional user action is required.

Disguised missing values¶

A “disguised missing value” is the term applied to a situation when a value (for example, -999) is inserted to encode what would otherwise be a missing value. Because machine learning algorithms do not treat them automatically, these values could negatively affect model performance if not handled.

How they are detected: DataRobot finds values that both repeat with greater frequency than other values and are also detected outliers. To be considered a disguised missing value, repeated outliers must meet one of the following heuristics:

  • All digits in the value are the same and repeat at least twice (e.g., 99, 88, 9999).
  • The value begins with 1 and is then followed by two or more zeros.
  • The value is equal to -1, 98, or 97.

How they are handled: Disguised missing values are handled in the same way as standard missing values—a median value is imputed and inserted and a binary column flags the rows where imputation occurred.

Target leakage¶

The goal of predictive modeling is to develop a model that makes accurate predictions on new data, unseen during training. Because you cannot evaluate the model on data you don’t have, DataRobot estimates model performance on unseen data by saving off a portion of the historical dataset to use for evaluation.

A problem can occur, however, if the dataset uses information that is not known until the event occurs, causing target leakage. Target leakage refers to a feature whose value cannot be known at the time of prediction (for example, using the value for “churn reason” from the training dataset to predict whether a customer will churn). Including the feature in the model’s feature list would incorrectly influence the prediction and can lead to overly optimistic models.

How they are detected: DataRobot checks for target leakage during EDA2 by calculating ACE importance scores (Gini Norm metric) for each feature with regard to the target. Features that exceed the moderate-risk (0.85) threshold are flagged; features exceeding the high risk (0.975) threshold are removed.

How they are handled: If the advanced option for leakage removal is enabled (which it is by default), DataRobot automatically creates a feature list (Informative Features – Leakage Removed) that removes the high-risk problematic columns. Medium-risk features are marked with a yellow warning to alert you that you may want to investigate further.

After DataRobot detects leakage and creates Informative Features – Leakage Removed, it behaves according to the Advanced Option “Run Autopilot on feature list with target leakage removed” setting. If enabled (the default):

  • Quick, full, or Comprehensive Autopilot: DataRobot runs the newly created feature list unless you specified a user-created list. To run on one of the other default lists, rebuild models after the initial build with any list you select.
  • Manual mode: DataRobot makes the list available so that you can apply it, at your discretion, from the Repository.
  • The target leakage list will be available when adding models after the initial build.

If disabled, DataRobot applies the above to Informative Features (with potential target leakage remaining) or any user-created list you specified.

Imputation leakage¶

The time series data prep tool can impute target and feature values for dates that are missing in the original dataset. The data quality check ensures that the imputed features are not leaking the imputed target. This is only a potential problem for features that are known in advance (KA), since the feature value is concurrent with the target value DataRobot is predicting.

How they are detected: DataRobot derives a binary classification target is_imputed = (aggregated_row_count == 0). Prior to deriving time series features, it applies the target leakage check to each KA feature, using is_imputed as the target.

How they are handled: Any features identified as high or moderate risk for imputation leakage are removed from the set of KA features. Subsequently, time series feature derivation proceeds as normal.

Pre-derived lagged feature¶

When a time series project starts, DataRobot automatically creates multiple date/time- related features, like lags and rolling statistics. There are times, however, when you do not want to automate time-based feature engineering (for example, if you have extracted your own time-oriented features and do not want further derivation performed on them). In this case, you should flag those features as Excluded from derivation or Known in advance. The “Lagged feature” check helps to detect whether features that should have been flagged were not, which would lead to duplication of columns.

How they are detected: DataRobot compares each non-target feature with target(t-1), target(t-2) … target(t-8).

How they are handled: All features detected as lags are automatically set as excluded from derivation to prevent “double derivation.” Best practice suggests reviewing other uploaded features and setting all pre-derived features as “Excluded from derivation” or “Known in advance”, if applicable.

Irregular time steps¶

The “inconsistent gaps” check is flagged when a time series model has irregular time steps. These gaps cause inaccurate rolling statistics. Some examples:

  • Transactional data is not aggregated for a time series project and raw transactional data is used.

  • Transactional data is aggregated into a daily sales dataset, and dates with zero sales are not added to the dataset.

How they are detected: DataRobot detects when there are expected timestamps missing.

It is important to understand that gaps could be consistent (for example, no sales for each weekend). DataRobot accounts for that and only detects inconsistent or unexpected gaps.

How they are handled: Because their inclusion is not good for rolling statistics, if greater than 20% of expected time steps are missing, the project runs in row-based mode (i.e., a regular project with out-of-time (OTV) validation). If that is not the intended behavior, make corrections in the dataset and recreate the project.

Leading or trailing zeros¶

Just as for excess zeros, this check works to detect zeros that are used to fill in missing values. It works for the special case where 0s are used to fill in missing values in the beginning or end of series that started later or finished earlier than others.

How they are detected: DataRobot estimates a total rate for zeros in each series and performs a statistical test to identify the number of consecutive zeros that cannot be considered a natural sequence of zeros.

How they are handled: If that is not the intended behavior, make corrections in the dataset and recreate the project.

Infrequent negative values¶

Data with excess zeros in the target can be modeled with a special two-stage model for zero-inflated cases. This model is only available when the min value of the target is zero (that is, a single negative value will invalidate its use). In sales data, for example, this can happen when returns are recorded along with sales. This data quality check identifies a negative value when two-stage models are appropriate and provides a warning to correct the target if the desire is to enable zero-inflated modeling and other additional blueprints.

How they are detected: If DataRobot detects that fewer than 2% of values are negative, it treats the project as zero-inflated.

How they are handled: DataRobot surfaces a warning message.

New series in validation¶

Depending on the project settings (training and validation partition sizes), a multiseries project might be configured so that a new series is introduced at the end of the dataset and therefore isn’t part of the training data. For example, this could happen when a new store opens. This check returns an information message indicating that the new series is not within the training data.

How they are detected: If DataRobot detects that more than 20% of series are new (meaning that they are not in the training data).

How they are handled: DataRobot surfaces an informational message.

Missing images¶

When an image dataset is used to build a Visual AI project, the CSV contains paths to images contained in the provided ZIP archive. These paths can be missing, refer to an image that does not exist, or refer to an invalid image. A missing path is not necessarily an issue as a row could contain a variable number of images or simply not have an image for that row and column. Click Preview Log for a more detailed view:

In this example, row 1 reports a file name referenced that did not exist in the uploaded file (1). Row 2 reports that a row was missing an image path (2). The log provides both the nature of the issue as well as the row in which the problem occurred. The log previews up to 100 rows; choose Download to export the log and view additional rows.

How they are detected: DataRobot checks each image path provided to ensure it refers to an image that exists and is valid.

How they are handled: For paths that fail to resolve, DataRobot attempts to find the intended image and replace the problematic path. In the event that an auto-correction is not possible, the problematic path is removed. If the image was invalid, the path is removed.

All missing images, paths that fail to resolve (even when automatically fixed), and invalid images are logged and available for viewing.

Data quality check logic summary¶

The following table summarizes the logic behind each data quality check:

Check / Run
Detection logic
Handling
Reported in…

Outliers / EDA2
Ueda’s algorithm
Linear: Flag added to feature in blueprint
Tree: Handled automatically
Data > Histogram

Multicategorical format error / EDA1
Meets any of the following three conditions:

  • Value is not valid JSON
  • Value does not represent a list
  • List entry contains an empty string

Feature is not identified as multicategorical
Data Quality Assessment log

Inliers / EDA1
Value is not an outlier; frequency is an outlier
Flag added feature in blueprint
Data > Frequent Values

Excess zeros / EDA1
Frequency is an outlier; value is 0
Flag added feature in blueprint
Data > Frequent Values

Disguised Missing Values / EDA1
Meets the following three conditions:

  • Value is an outlier
  • Frequency is an outlier
  • Value matches one of these patterns:
    • Is two or more digits and all digits are the same
    • begins with “1”, followed by multiple zeros
    • -1, 98, or 97

Median imputed; flag feature in blueprint
Data > Frequent Values
Target leakage / EDA2
Importance score for each feature, calculated using Gini Norm metric. Threshold levels for reporting are moderate-risk (0.85) or high-risk (0.975).
High-risk leaky features excluded from Autopilot (using “Leakage Removed” feature list)
Data page; optionally, filter by issue type

Missing images / EDA1
Empty cell, missing file, broken link
Links are fixed automatically
Data Quality Assessment log

Imputation leakage / EDA2 (pre-feature derivation)
Target leakage with is_imputed as the target applied to KA features. Only checked for projects with time series data prep applied to the dataset.
Remove feature from KA features
Data page; optionally, filter by issue type

Pre-derived lagged features / EDA2
Features equal to target(t-1), target(t-2) … target(t-8)
Excluded from derivation
Data page; optionally, filter by issue type

Inconsistent gaps / EDA2
[Irregular time steps]time steps
Model runs in a row-based mode
Message in the time-aware modeling configuration

Leading/ trailing zeros / EDA2
For series starting/ending with 0, compute probability of consecutive 0s; flag series with <5% probability
User correction
Data page; optionally, filter by issue type

Infrequent negative values / EDA1
Fewer than 2% of values are negative
User correction
Warning message

New series in validation / EDA1
More than 20% of series not seen in training data
User correction
Informational message

Feature considerations¶

Consider the following when working with Data Quality Assessment capability:

  • For disguised missing values, inlier, and excess zero issues, automated handling is only enabled for linear and Keras blueprints, where they have proven to reduce model error. Detection is applied to all blueprints.
  • You cannot disable automated imputation handling.
  • A public API is not yet available.
  • Automated feature engineering runs on raw data (instead of removing all excess zeros and disguised missing values before calculating rolling averages).