Previous Next

Even assuming that you use field constraints to prevent completely-invalid data from being collected in the first place, you will still want to monitor the overall quality of your incoming data and respond to any potential issues that arise. The quicker you identify and respond to issues, the more easily you can solve them and the higher-quality your data will be in the end. For this reason, we've made it easy to monitor incoming data using both SurveyCTO's built-in Data Explorer and automated quality checks. This help topic will focus on automated quality checks; see Using the Data Explorer to monitor incoming data for more on reviewing incoming data in the Data Explorer.

In just a few quick minutes, you can configure quality checks for any of your forms. Just go to the Automated quality checks section of your server console's Monitor tab, click the Checks button for the appropriate form, and click Create quality check to get started.

Automated quality checks can be configured to warn you about different kinds of issues:

  1. Individual field values that are too low or too high. For example, perhaps your form technically allows respondents to be up to 120 years old – but you really don't expect respondents to be over 100. You could create a quality check that warns about any cases where an age is above 100, so that your team can follow up and confirm that it wasn't a mistake.
  2. Individual field values that are outliers. Rather than setting a specific threshold for what is too high or too low, you can ask SurveyCTO to use statistics to determine when field values are unusually high or unusually low. You define a simple statistical threshold for what counts as "unusual," and then you get warned about submissions with values that are unusual in that way (i.e., that are outliers).
  3. Individual field values that are too frequent or too infrequent. Looking at the full dataset, as it comes in, you might want to monitor instead the frequency of certain response values. For example, you might not want a gender field to contain "female" less than 30% of the time, or a "don't know" response to appear more than 10% of the time.
  4. Field means that are too low or too high. Instead of looking at individual submissions, you might want to consider the overall mean or average of a field and warn if it is above or below a certain threshold. For example, if the average respondent income reported is above or below what you expect, there might be some problem with how it's being measured.
  5. Mean values that differ from one sub-group to another. Instead of looking at the overall mean for a field, you might want to consider how that mean differs across sub-groups. For example, you might want to look for interviewer effects by checking to make sure that average income doesn't differ significantly depending on the interviewer.
  6. Response distributions that differ from one sub-group to another. If you're working with discrete or categorical data rather than continuous numeric data, then you can consider the full distribution of responses in a field rather than just a single mean. Like with means, you can check to see if the distribution of responses differs across sub-groups. For example, you can see whether there are enumerator effects in the reported occupation of respondents.

SurveyCTO will report warnings to you whenever submission values, frequencies, means, or distributions in your data cause configured quality checks to fail. Read on to learn specific details about each type of quality check available, and about the quality-check reports that are generated for your review.

Types of quality check

Value is too low

What it checks: the numeric field or fields that you select, for every submission in your form data.

What it checks for: any response values that are below a numeric threshold that you specify.

Warnings it issues: value-too-low warnings for individual fields in individual submissions.

Example: check to see if the "income" field has a response value less than 1000.

Options: can specify a list of special response values (like -777 or -888) to exclude them from triggering warnings. Can specify whether the warnings should be classified as "critical" in quality-check reports.

Value is too high

What it checks: the numeric field or fields that you select, for every submission in your form data.

What it checks for: any response values that are above a numeric threshold that you specify.

Warnings it issues: value-too-high warnings for individual fields in individual submissions.

Example: check to see if the "age" field has a response value greater than 100.

Options: can specify a list of special response values (like 999 or 9999) to exclude them from triggering warnings. Can specify whether the warnings should be classified as "critical" in quality-check reports.

Value is an outlier

What it checks: the numeric field or fields that you select, for every submission in your form data.

What it checks for: any response values that are outliers because they are more than x times outside the interquartile range (IQR), for the value of x that you specify (1.5 is common, to identify values that are 1.5*IQR below the first quartile or 1.5*IQR above the third quartile).

Warnings it issues: value-is-outlier warnings for individual fields in individual submissions.

Example: check to see if the "income" field has a response value that is more than 1.5 times outside the interquartile range.

Options: can specify the multiple of the interquartile range to use for identifying outliers (1.5 is the default), plus a list of special response values (like -777 or -888) to exclude them from triggering warnings. Can specify whether the warnings should be classified as "critical" in quality-check reports.

Value is too frequent

What it checks: the overall frequency of specific field values, for the field or fields that you select, across all submissions in your form data.

What it checks for: any response values that are more frequent than a percentage frequency that you specify.

Warnings it issues: value-too-frequent warnings for individual response values within individual fields.

Example: check to see if the "gender" field has a response value of "male" more than 70% of the time.

Options: can specify whether the warnings should be classified as "critical" in quality-check reports.

Value is too infrequent

What it checks: the overall frequency of specific field values, for the field or fields that you select, across all submissions in your form data.

What it checks for: any response values that are less frequent than a percentage frequency that you specify.

Warnings it issues: value-too-infrequent warnings for individual response values within individual fields.

Example: check to see if the "gender" field has a response value of "female" less than 30% of the time.

Options: can specify whether the warnings should be classified as "critical" in quality-check reports.

Mean is too low

What it checks: the overall mean of the numeric field or fields that you select, considering all submissions in your form data.

What it checks for: an overall mean that is below a numeric threshold that you specify.

Warnings it issues: mean-too-low warnings for individual fields.

Example: check to see if the "income" field has a mean below 20000.

Options: can specify a list of special response values (like -777 or -888) to exclude when calculating field means. Can specify whether the warnings should be classified as "critical" in quality-check reports.

Mean is too high

What it checks: the overall mean of the numeric field or fields that you select, considering all submissions in your form data.

What it checks for: an overall mean that is above a numeric threshold that you specify.

Warnings it issues: mean-too-high warnings for individual fields.

Example: check to see if the "income" field has a mean above 100000.

Options: can specify a list of special response values (like -777 or -888) to exclude when calculating field means. Can specify whether the warnings should be classified as "critical" in quality-check reports.

Group mean is different

What it checks: sub-group means for the numeric field or fields that you select, considering all submissions in your form data.

What it checks for: any sub-groups with means that differ significantly from the means of other sub-groups, based on an ANOVA test and a significance threshold (a p-value) that you specify (usually 0.05).

Warnings it issues: group-mean-different warnings for individual sub-groups and individual fields.

Example: check to see if the mean of the "income" field differs by the sub-group defined by the "enumerator_id" field (i.e., look for enumerator differences in reported income, using the "enumerator_id" field to identify different enumerators).

Options: can specify the statistical significance threshold (the p value) to use for identifying differences, plus a list of special response values (like -777 or -888) to exclude when calculating sub-group means. Can specify whether the warnings should be classified as "critical" in quality-check reports.

Group distribution is different

What it checks: sub-group distributions for the discrete/categorical field or fields that you select, considering all submissions in your form data.

What it checks for: any sub-groups with distributions that differ significantly from the distributions of other sub-groups, based on a chi-squared test and a significance threshold (a p-value) that you specify (usually 0.05).

Warnings it issues: group-distribution-different warnings for individual sub-groups and individual fields.

Example: check to see if the distribution of the "occupation" field differs by the sub-group defined by the "enumerator_id" field (i.e., look for enumerator differences in reported occupation, using the "enumerator_id" field to identify different enumerators).

Options: can specify the statistical significance threshold (the p value) to use for identifying differences. Can specify whether the warnings should be classified as "critical" in quality-check reports.

Quality check options

You can configure as many quality checks as you wish, and each of them can cover as many fields as you like.

For each form within the Automated quality checks section, you can also click Options to configure some overall settings that apply for all quality checks configured for that form. This includes an option to Run all checks nightly (uncheck to pause) and an option to send email summaries of all quality-check results to a list of email addresses. If the review and correction workflow is enabled for the form, it also includes options to choose which types of submissions (approved, rejected, etc.) to include when running quality checks.

Unless you have the option to run nightly turned off, all configured quality checks will run automatically once per night. They'll also run whenever you click the Run now button to manually run them.

Quality check reports

If any of your quality checks are triggered (i.e., if some aspect of the form's data "fails"), then SurveyCTO will issue data-quality warnings in a report. Every time the checks are run, in fact, any new warnings are appended to the previous report – so warnings will accumulate over time, but duplicate copies of existing warnings will not be added.

You can download the quality-check report by clicking Report in the Automated quality checks section, or by clicking the link to the full report included in any email notifications you receive (if you have configured report summaries to be emailed to you).

The report itself is a .csv file that you can open in Microsoft Excel or Google Sheets. Its most important columns are:

  1. warning: the human-readable warning resulting from each failed quality check, including which check failed and why.
  2. last-reported: the date and time when the warning was most recently issued. Since it's common for the same data to generate the same warnings each time your quality checks are run, existing warning rows, when present, are simply updated with a new last-reported date and time (rather than appending duplicate warning rows with each run). You'll know these "current warnings" because their last-reported date and time will coincide with the last time a full report was generated; "old warnings," on the other hand, will have older last-reported values, perhaps because the offending data was corrected, the quality-check configuration changed, or the underlying statistical distributions changed such that the warnings no longer trigger.

Additional columns in the report include:

  1. critical: 1 if you'd configured the relevant quality check to issue critical warnings; otherwise 0.
  2. dataset-id: the unique ID of the internal dataset that was checked, which will include your unique form ID.
  3. id: a more machine-readable unique ID for the human-readable warning in the warning column.
  4. warning-id, field, group-field, value: details relating to which quality check you configured and with what parameters. In addition to the human-readable warning text in the warning column, these can help you to distinguish warnings from different configured quality checks. You might, for example, find it helpful to sort by some of these columns when reviewing long lists of warnings.
  5. row-id, group-id: the unique row and/or group ID that caused the warning. When warnings are issued for particular rows or for particular groups, these columns indicate the unique ID's that identify those rows or groups.

Note that if you use the "Monitor form data" action to enter the Data Explorer, results from your quality-check reports will be summarized along with your data. See Using the Data Explorer to monitor incoming data for more.

Quality check reports as server datasets

Quality check reports are actually server datasets (see Advanced publishing with datasets). This means that, in addition to manually downloading them as described above, you can export them using SurveyCTO Desktop, publish them to the cloud, merge them into Excel workbooks, or even attach them as pre-loaded data for one or more survey forms. We understand that back-office operations vary widely, so we try to keep things as flexible as possible.

Each quality check report will publish to a dataset with an ID like "formid.sampleid_qc" – but with your form's unique ID instead of the "sampleid". The first time a set of quality checks is run for a form, this report dataset will be automatically created. If you do publish a report dataset to the cloud, use the id column as the unique ID; that way, when some of the same warnings are issued each time the data checks are run, you will end up with one row per warning rather than multiple rows.

Limitations

When configuring quality checks for your forms, there are two key limitations to keep in mind: you can't configure checks for encrypted fields, and you can't configure checks for repeated fields.

If you have encrypted your form data with your own encryption keys, you can configure quality checks for only those form fields that were explicitly marked as publishable (i.e., fields for which you indicated "yes" in the publishable column of your survey worksheet). This is because SurveyCTO simply can't read encrypted data. But do take care not to mark sensitive, highly-confidential fields as publishable, as publishable fields will not be as strongly protected as other fields in encrypted forms. (They will still be encrypted in transit, but they will be readable by SurveyCTO.)

If you would like to run quality checks on one or more fields within repeat groups, you can design your form to include non-repeated fields to selectively pull data out of those repeat groups. For example, you can use the indexed-repeat() function to pull out a single value from a repeated field, or a function like join() or sum() to pull out a summary or aggregate representation of a repeated field. Quality checks can then be configured on those non-repeated fields. See Using expressions in your forms for more details.

Previous Next