How scripters see survey design

Introduction

This artlcle is aimed at survey researchers and offers advice on how to write survey specifications from a survey scripter’s point of view. It was presented at the 2021 Social Research Association annual conference.

GIDE script all kinds of surveys, from small market research questionnaires to long and complex social research studies and have seen thousands of survey specifications of varying style and quality since we started delivering web surveys in 1998. We use our own in-house survey software based around a powerful data description and scripting language but this article is intended to apply to any survey specification and implementation.

My role within GIDE is to deal with the larger and more complex projects, which are typically social research, and these can often be a challenge, both for a survey researcher to specify and for the survey scripter to implement.

As a researcher, your focus will naturally be on the content of the questions, and less so on the details of the survey scripting. Survey scripters will generally not be familar with the subject area of the questionnaire and need the specification to be as verbose and precise as possible. This article describes the common omissions from a survey specification, why they are important, and how you can easily include them in your specification.

This article assumes that the survey specification is authored as a Word document. However, this is not the only way that GIDE can program your survey – we can take specifications in DDI, Excel, Triple-S and even scripts from other survey systems like Unicom Intelligence (IBM/SPSS Data Collection) and have tools (or will write tools) to automate the conversion to our scripting language.

Conventions

The #1 most important thing when writing a survey specification is to have conventions over layout and formatting so the various components are easily differentiated. If you work for a large research organisation, you may already have a house style, which is extremely helpful to anyone writing or using the specification.

In cases where such conventions are followed, GIDE has tools to enable it to automatically create an initial version of the survey script directly from the specification in Word – increasing the speed and accuracy of the programming.

However, it’s often the case that the conventions are not consistently followed, or the conventions do not cover every part of the survey specification. The examples presented in this article can be used to help you extend and improve your organisation’s conventions, and if you don’t have such a template, it will hopefully show why it is helpful.

There are also simple yet powerful features in most scripting languages, which researchers may not be aware of. These features, such as the re-use of response lists, loops and routing constructs such as IF … THEN … ELSE will be used by your survey scripter, and you can simplify the specification (and make it closer to the survey script) if you include them in your specification.

Some examples

What follows are a few example question definitions, all following the same conventions. These are all examples of what I would call a good specification.

(the answer to the riddle is given at the end of this article.)

The first item is the filter for the question, which in this case is “Ask if age > 10”. You don’t have to have a filter for every question, although some people like to write “Ask all” above questions without a filter, which removes any ambiguity.

As this is an instruction for the scripter, this should be formatted differently to all the other text.

The next item relates to the type of the response. By this, I don’t mean how the question is displayed, but the underlying type of data the question is collecting. There are very few basic data types – 99% of questions will be either single response, multi-response, text or numeric, even if there are 1001 ways of presenting such questions.

Given the choice, the survey scripter will typically define a question in the way that is easiest to get the desired display. But this may not be how you want the data delivered – so if for example you want a numeric value in the data, but the question displayed as a drop-down list, then be explicit, so the scripter doesn’t, for example, define it as a single-response list.

Next is the question identifier. Scripting languages will all have their own rules over the format of question identifiers, but they will typically need to start with a letter and only contain letters, numbers and the underscore symbol. So for example in GIDE’s scripting language, the number five by itself can’t be used as a question identifier, but Q5 could be. Similarly, full stops are not allowed, so Q5.5 would be programmed by us as Q5_5.

Next are the different types of text associated with the question. A typical question would have the following pieces of text (also called labels):

Main question text (shown on-screen and/or read by the interviewer)
Additional instructions for the respondent (also shown on-screen and/or read by the interviewer)
Interviewer instructions – not shown for the respondent, or read out loud by the interviewer
Scripter instructions – instructions for the scripter to program, not displayed on screen

The difference between these types of text is not always obvious.

Finally, there is the list of responses, which will typically contain both a code and a label. It’s useful to have standard codes for common non-responses such as don’t know or refusal.

If you have the same list used in multiple questions, then tell the survey scripter that this is the case. You could even omit the list and just say “same list as question X”. This removes the possibility of accidental inconsistencies, and makes it easier for you to modify the list – you will only have one copy to update. Scripting languages all have the ability to define lists once and reuse them in different questions – help the survey scripter make the most of that feature.

The above example is a numeric question, and shows some examples of more detailed instructions that you may wish to give to the survey scripter. Note that these are all in the same colour.

For numeric values, the main constraints which are useful to include are the minimum and maximum values. Typically the maximum value is just some arbitrarily high number such as 9999 and just helps the scripter when defining the question (and some data export formats will also require it).

But the minimum can almost always be defined with a real value, and the choice is often between 0 and 1.

You will also see that for this question we specify two different pairs of minimum and maximum values. The first, next to the word NUMERIC, is the overall type of the question – this is the type of the resulting data item when it is exported from the survey tool, and is the same for every record. i.e. the value for every respondent will be between 1900 and 2022.

We then also specify further rules to apply to individual respondents – that the value is not before the year of birth of the respondent, and is not a value in the future.

This is also an example of how you can differentiate between the type of the data that you’re collecting, and how it is displayed – in this case the question is a drop-down list with the most recent year first, and the data will contain a numeric value in the range 1900 to 2022.

As well as questions, you may want to include variables (we call them recodes) which are computed in the survey script whilst the interview is running. For example, if you ask the respondent for her age (in years) or date of birth, you may then want to create a recoded variable putting the respondent into age bands.

This can then either be used later in the survey script (for example in a question filter), or just when you are analysing the data.

When such variables are programmed, the scripter can include all the information normally associated to questions – a question label, constraints on the accepted values etc. So its useful to think of recodes in the same way as normal questions, and describe them using the same style.

We sometimes see recodes defined without any labels, which means they will be missing when you receive the data at the end of the fieldwork.

The distinction between single-response and multi-response recodes is often forgotten as well – for example if this question was summarising the ages of all the members of a household, then it should be a multi-response, not single.

On some surveys, you may also be linking with external data sources and in effect these are just pre-answered questions, and should be defined by the survey scripter in the same way as normal questions and recodes. It’s therefore helpful to include these external variables at the start of the specification, using the same style as the rest of your question definitions.

The reason for this level of detail is two-fold – firstly it helps the survey scripter program things correctly, and secondly its so that these variables have the same kind of metadata available as the other questions when you come to analysing and reporting on the data.

Loops

Something that researchers can struggle to describe in specifications are loops. A loop is when the same question (or block of questions) are asked repeatedly – for example for each person in a household, or for each product or service that you have used.

If you are describing such loops, then there are three important things the survey scripter needs to know:

– Where the loop starts and ends

– The set of things which are looped through – in this case all the people in the household

– And finally, some way to refer to the current “item” being dealt with inside the loop.

A common convention in computer programming and mathematics is just to use the letter “i” for this purpose. Very rarely, you may need to use loops inside loops, in which case it’s useful to give them names (e.g. loop A and loop B) and you should use a different letter (e.g. j, k) to refer to the second set of things you are looping through.

A common problem is when a specification will refer to a question that was asked inside a loop, but without being explicit about which of the many answers to that question they are referring to.

For example, the specification might say to do something if “EDUCATION=4” – this typically means “if EDUCATION=4 for at least one person in the household”, but you might have meant “if EDUCATION=4 for every person in the household”. So it helps to be explicit.

Grid questions

Grid questions, sometimes called Matrix questions are typically presented as a two-dimensional table of radio buttons or check boxes. In GIDE, we call them “array” questions, which is the way the questions are defined in our scripting language.

These questions are also sometimes referred to as loops, as they involve asking the same question about multiple items, but unlike the type of loops described in the previous section, they are typically displayed as a single grid on the same screen.

The natural way to describe these questions in a survey specification would be to show it as a table, the same way as it is presented on-screen. Typically, the “items” being asked about are shown as rows in the table, and the possible answers (either single or multiple response per item) will be shown as the columns.

However, if we take the following example, it is not clear if the rows or the columns are the “items” in the question, and which are the answer responses.

The above question can be considered in one of two ways:

For each of the following activities, which devices do you use?
For each of the following devices, which activities do you use it for?

In the first case, the activities (columns) are the question items, and the devices (rows) are the answer options. In the second case, the devices are the question items and the activities are the answer options.

There are various reasons why this distinction is important:

For small displays (e.g. a mobile phone), the survey software may have a “responsive” mode of asking the question. In this mode, the question might be presented as a series of individual questions, one per “item”. This feature relies on the question being defined the “right way round”.
You should allow for the case where the respondent either doesn’t own or use every kind of device, or doesn’t engage in every listed activity. How you do this will depend on what you have specified as the question items and the answer responses. For example, in the first case you could add (in the list of devices) the options “Other” and “I don’t do this activity”. However, these wouldn’t make sense in the second case, where you could add “None of these” and “I don’t use this kind of device”.
It may help guide the respondent when he/she is answering the question – encouraging the respondent to look at each row in turn, or each column in turn (depending on how you have designed the question).
It will also affect the data you receive. If this is was a single-choice response, you would typically receive one data item in the data for each item the question is being asked about.

An alternative way to specify the above question would be as follows:

This is what I would consider a more “logical” definition of the question – it describes the structure of the question, not its presentation – and there is only one way a survey scripter can program it (but various ways it can be presented).

IF … THEN … ELSE …

A very common programming construct, which all survey scripting languages will have, is what is known as IF … THEN … ELSE …

All survey specifications will make use of the first part “if something then do something”. However, it is rare to see a specification say “if something then do something, otherwise do something else”.

This typically happens with recodes, which are defined along the following lines:

RECODE M1 AS FOLLOWS

1 – Train. IF Q1=1
2 – Bus, Tram. IF Q1=2 OR Q2=1
3 – Other. IF Q1<>1 AND Q1<>2 AND Q2<>1

The literal translation into a typical scripting language would be as follows:

IF Q1=1 THEN M1=1
IF Q1=2 OR Q1=3 THEN M1=2
IF Q1<>1 AND Q1<>2 AND Q2<>3 THEN M1=3

A complicated recode could have lots of conditions for each value, with a “catch-all” value at the end, where the specification will try to identify all the cases which are not covered by the previous rules.

This is a common cause of mistakes, where that final rule does not cover all eventualities. If it’s the case that “for all other respondents, code as 3”, then you can just write that, and the scripter will use the ELSE construct:

IF Q1=1 THEN M1=1
ELSE IF Q1=2 OR Q1=3 THEN M1=2
ELSE M1=3

In your survey specification, you could write this as:

RECODE M1 AS FOLLOWS

1 – Train. IF Q1=1
2 – Bus, Tram. IF Q1=2 OR Q2=1
3 – Other. OTHERWISE

TO GO TO OR NOT TO GO TO

The “GO TO” or “skip logic” has traditionally been used in paper questionnaires to tell the respondent which questions they need to answer and it is tempting to specify the routing logic in your questionnaire in the same way.

However, one of the first things I was taught when learning computer programming was that the “go to” statement is bad – because it leads to hard-to-follow and hard to maintain code. Many modern programming languages don’t have it, and those that do will mostly discourage its use. The same applies to survey scripting languages.

The following example shows how the same survey logic can be specified either with or without the use of “GO TO” instructions.

If you look at the example on the left and try to understand which respondents are asked Q4, it is not immediately clear, and this problem just gets worse the larger the survey is. In the example on the right, the filter for Q4 is specified directly.

It is also more difficult to modify a questionnaire defined with “GO TO” instructions – if you wanted to add a new question around question Q4, which is asked to all respondents, you would need to modify a lot of the GO TO instructions. If you follow the style on the right, then you can simply insert new questions at any point, with their own independent filters, and easily change the order of the questions.

That’s not to say that “go to” is never useful, but as a survey scripter I would generally try to avoid adding a go to into a survey script, and instead program it as a filter around the block of code being skipped. i.e. turn a negative condition (“skip this block if X is false”) into a positive condition (“ask this block if X is true”)

And finally…

If you only take three things away from reading this article:

Try and think about your questions from a logical point of view first, and a presentational view second. This helps ensure you’ve defined all the different components of your question and also how you want to receive the data at the end of the fieldwork;
Whilst survey scripters do not expect you to write computer code in your specification, the more formal and precise you can be, the less chance there is of the survey scripter misunderstanding and getting it wrong when they turn it into computer code;
Create conventions in terms of the style and layout of your specification and use them consistently.

And finally… (2)

The answer to the riddle given in the first example is 5 – None of the above. This is the only answer where, if this is true and all the other answers are false, there are no contradictions.