Loading Data with Repeating Groups

KoboToolbox enables grouping of questions, allowing them to be answered multiple times. This feature is particularly useful during household surveys where a set of questions is designed to be answered by each member of the household.

Repeating groups are a powerful tool in survey design, offering several advantages:

Efficiency: A single set of questions can be used for multiple respondents.
Flexibility: Surveys can accommodate varying numbers of respondents.
Data consistency: The same questions are asked for each repetition, ensuring uniform data collection.
Simplified analysis: The structured format facilitates easier data analysis across respondents.

These benefits make repeating groups essential for surveys dealing with multi-member units like households, schools, or organizations.

Loading data

KoboToolbox implements this feature by incorporating the concept of repeat group, enabling the repetition of a group of questions.

In KoboToolbox forms, begin_repeat and end_repeat are special commands that define the boundaries of a repeating group:

begin_repeat: Marks the start of a group of questions that can be repeated multiple times.
end_repeat: Signals the end of the repeating group.

Any questions placed between these commands will be repeated as a set, allowing for multiple responses to the same group of questions. This method involves enclosing the questions intended for repetition within a begin_repeat/end_repeat loop. Furthermore, repeat group allows for nesting, thus enabling the repetition of a question group within another repeat group. This concept can be demonstrated using the project and associated form below.

Survey questions

type	name	label::English (en)	label::Francais (fr)	repeat_count	calculation
start	start
end	end
today	today
begin_repeat	demo	Demographic Characteristics	Caracteristique Demographique
text	name	Name	Nom
integer	age	Age	Age
select_one sex	sex	Sex	Sexe
integer	hobby	How many hobbies does ${name} have?	Combien de hobbies ${name} a ?
select_one yesno	morelang	Does ${name} speak more than one language?	Est-ce que ${name} parle plus d’une langue ?
calculate	name_individual				indexed-repeat(${name}, ${demo}, position(..))
begin_repeat	hobbies_list	List of Hobbies	Liste de hobbies	${hobby}
text	hobbies	Hobbies of ${name_individual}	Hobbies de ${name_individual}
end_repeat
begin_repeat	lang_list	List of Languages	Liste de langues	${morelang}
select_multiple lang	langs	Languages spoken by ${name_individual}	Langue parle par ${name_individual}
end_repeat
end_repeat
calculate	family_count				count(${demo})
note	family_count_note	Number of family members: ${family_count}	Nombre de membre dans la famille: ${family_count}
begin_repeat	education	Education information	Information sur l’education	${family_count}
calculate	name_individual2				indexed-repeat(${name}, ${demo}, position(..))
select_one edu_level	edu_level	What is ${name_individual2}’s level of education	Quel est le niveau d’education de ${name_individual2}
end_repeat

Choices

list_name	name	label::English (en)	label::Francais (fr)
sex	1	Male	Homme
sex	2	Female	Femme
sex	3	Prefer not to say	Prefere ne pas dire
edu_level	1	Primary	Primaire
edu_level	2	Secondary	Secondaire
edu_level	3	Higher Secondary & Above	Lycee et superieur
yesno	1	Yes	Oui
yesno	0	No	Non
lang	1	French	Francais
lang	2	Spanish	Espagnol
lang	3	Arabic	Arabe
lang	99	Other	Autre

Loading the survey

The aforementioned survey, named nested_roster, was uploaded to the server. It can be accessed from the list of assets asset_list.

library(robotoolbox)
library(dplyr)

# Retrieve a list of all assets (projects) from your KoboToolbox server
asset_list <- kobo_asset_list()

# Filter the asset list to find the specific project and get its unique identifier (uid)
uid <- filter(asset_list, name == "nested_roster") |>
  pull(uid)

# Load the specific asset (project) using its uid
asset <- kobo_asset(uid)
asset

#> <robotoolbox asset>  aANhxwX9S6BCsiYMgQj9kV 
#>   Asset name: nested_roster
#>   Asset type: survey
#>   Asset owner: dickoa
#>   Created: 2022-01-05 21:22:51
#>   Last modified: 2023-09-07 07:27:04
#>   Submissions: 3

In this code:

kobo_asset_list() retrieves a list of all assets (projects) available on your KoboToolbox server.
kobo_asset() loads a specific asset (project) using its unique identifier (uid), allowing you to work with that particular project data and metadata.

Extracting the data

The output here deviates from a standard data.frame. It contains a table for each repeat group in the form.

df <- kobo_data(asset)
df

#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `main`, `demo`, `hobbies_list`, `lang_list`, `education`
#> Columns: 51
#> Primary keys: 5
#> Foreign keys: 4

#> [1] "dm"

The output is a dm object, sourced from the dm package. A dm object is a collection of related data frames that preserves the relationships between different levels of data in repeating groups. It’s particularly useful for repeating groups because:

It maintains the hierarchical structure of the data, reflecting how repeating groups are nested within the survey.
It allows for efficient storage and manipulation of data from different levels of the survey without losing the relationships between these levels.
It provides tools for working with related tables, making it easier to analyze data across different repeating groups.

Using a dm object helps preserve the complex structure of surveys with repeating groups, allowing for more intuitive and accurate data analysis.

Manipulating `repeat group` as `dm` object

A dm object, which is a list of interconnected data.frame instances, can be manipulated using the dm package.

Visualizing the relationship between tables

To comprehend the data storage structure, we can visualize the relationships among tables (repeat group loops) and the schema of the dataset. This schema can be depicted using the dm_draw function.

library(dm)
dm_draw(df)

This visual representation of table relationships can significantly aid in planning your data analysis strategy and ensuring that you’re working with the data in a way that respects its inherent structure.

Number of rows of each table

The dm package offers numerous helper functions for manipulating dm objects. For instance, the dm_nrow function can be used to ascertain the number of rows in each table.

dm_nrow(df)
#>         main         demo hobbies_list    lang_list    education 
#>            3            7           14            4            7

A `dm` object is a list of `data.frame`

A dm object is a list of data.frame. Similar to any list of data.frame, you can extract each table (data.frame), and analyze it separately. The principal table, containing the data outside any repeat group, is named main.

glimpse(df$main)
#> Rows: 3
#> Columns: 17
#> $ start                <dttm> 2022-01-06 15:16:21, 2022-01-06 15:17:18, 2022-0…
#> $ end                  <dttm> 2022-01-06 15:17:18, 2022-01-06 15:25:11, 2022-0…
#> $ today                <date> 2022-01-06, 2022-01-06, 2022-01-06
#> $ `_index`             <int> 1, 2, 3
#> $ `_id`                <int> 17727380, 17727538, 17727576
#> $ uuid                 <chr> "ee485fd6655b4e328fdd895ac0451656", "ee485fd6655…
#> $ family_count         <dbl> 2, 2, 3
#> $ education_count      <dbl> 2, 2, 3
#> $ `__version__`        <chr> "vcs3hEpGKxBo8G5uQa94oD", "vcs3hEpGKxBo8G5uQa94oD…
#> $ instanceID           <chr> "uuid:c2fbd800-f9d9-4a68-a9da-3917da86c318", "uui…
#> $ `_xform_id_string`   <chr> "aANhxwX9S6BCsiYMgQj9kV", "aANhxwX9S6BCsiYMgQj9kV…
#> $ `_uuid`              <chr> "c2fbd800-f9d9-4a68-a9da-3917da86c318", "06552d3d…
#> $ `_status`            <chr> "submitted_via_web", "submitted_via_web", "submit…
#> $ `_submission_time`   <dttm> 2022-01-06 15:17:28, 2022-01-06 15:25:23, 2022-01…
#> $ `_validation_status` <chr> NA, NA, NA
#> $ `_submitted_by`      <lgl> NA, NA, NA
#> $ `_attachments`       <list> <NULL>, <NULL>, <NULL>

The other tables are named following the names of their associated repeat groups. For instance, the education table is named after the education repeat group.

glimpse(df$education)
#> Rows: 7
#> Columns: 6
#> $ name_individual2     <chr> "Ahmad", "Myriam", "Shannon", "Skip", "Jemelle", …
#> $ edu_level            <chr+lbl> "3", "3", "3", "3", "3", "3", "1"
#> $ `_index`             <int> 1, 2, 3, 4, 5, 6, 7
#> $ `_parent_index`      <int> 1, 1, 2, 2, 3, 3, 3
#> $ `_parent_table_name` <chr> "main", "main", "main", "main", "main", "main…
#> $ `_validation_status` <chr> NA, NA, NA, NA, NA, NA, NA

Filtering data

One key benefit of using the dm package is its capability to dynamically filter tables while maintaining their interconnections. For example, filtering the main table will automatically extend to the education and demo tables. As the hobbies_list and lang_list tables are linked to the demo table, they will be filtered as well.

df |>
  dm_filter(main = (`_index` == 2)) |>
  dm_nrow()
#>         main         demo hobbies_list    lang_list    education 
#>            1            2            4            0            2

This approach ensures that your filtered dataset maintains the structural integrity of your survey data, leading to more reliable and consistent analysis results.

Joining tables

In certain instances, analyzing joined data may prove simpler. The dm_flatten_to_tbl function can be used to join data safely while preserving its structure and the connections between tables. We can merge the education table with the main table using the dm_flatten_to_tbl function, with the operation starting from education.

df |>
  dm_flatten_to_tbl(.start = education,
                    .join = left_join) |>
  glimpse()
#> Rows: 7
#> Columns: 22
#> $ name_individual2               <chr> "Ahmad", "Myriam", "Shannon", "Skip", "…
#> $ edu_level                      <chr+lbl> "3", "3", "3", "3", "3", "3", "1"
#> $ `_index`                       <int> 1, 2, 3, 4, 5, 6, 7
#> $ `_parent_index`                <int> 1, 1, 2, 2, 3, 3, 3
#> $ `_parent_table_name`           <chr> "main", "main", "main", "main", "ma…
#> $ `_validation_status.education` <chr> NA, NA, NA, NA, NA, NA, NA
#> $ start                          <dttm> 2022-01-06 15:16:21, 2022-01-06 15:16:2…
#> $ end                            <dttm> 2022-01-06 15:17:18, 2022-01-06 15:17:1…
#> $ today                          <date> 2022-01-06, 2022-01-06, 2022-01-06, 202…
#> $ `_id`                          <int> 17727380, 17727380, 17727538, 17727538,…
#> $ uuid                           <chr> "ee485fd6655b4e328fdd895ac0451656", "e…
#> $ family_count                   <dbl> 2, 2, 2, 2, 3, 3, 3
#> $ education_count                <dbl> 2, 2, 2, 2, 3, 3, 3
#> $ `__version__`                  <chr> "vcs3hEpGKxBo8G5uQa94oD", "vcs3hEpGKxB…
#> $ instanceID                     <chr> "uuid:c2fbd800-f9d9-4a68-a9da-3917da86…
#> $ `_xform_id_string`             <chr> "aANhxwX9S6BCsiYMgQj9kV", "aANhxwX9S6BC…
#> $ `_uuid`                        <chr> "c2fbd800-f9d9-4a68-a9da-3917da86c318",…
#> $ `_status`                      <chr> "submitted_via_web", "submitted_via_web…
#> $ `_submission_time`             <dttm> 2022-01-06 15:17:28, 2022-01-06 15:17:2…
#> $ `_validation_status.main`      <chr> NA, NA, NA, NA, NA, NA, NA
#> $ `_submitted_by`                <lgl> NA, NA, NA, NA, NA, NA, NA
#> $ `_attachments`                 <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>,…

This logic can be extended to create the widest possible table through a cascade of joins, commencing from a deeper table (.start argument) and ending at the main table. Taking .start = hobbies_list as an example, two joins will be performed: hobbies_list will be merged with the demo table, and subsequently, the demo table will be combined with the main table.

df |>
  dm_flatten_to_tbl(.start = hobbies_list,
                    .join = left_join,
                    .recursive = TRUE) |>
  glimpse()
#> Rows: 14
#> Columns: 32
#> $ hobbies                           <chr> "Basketball", "Video games", "Karaok…
#> $ `_index`                          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1…
#> $ `_parent_index.hobbies_list`      <int> 1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 6, …
#> $ `_parent_table_name.hobbies_list` <chr> "demo", "demo", "demo", "demo", "dem…
#> $ `_validation_status.hobbies_list` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ name                              <chr> "Ahmad", "Ahmad", "Myriam", "Shannon…
#> $ age                               <dbl> 30, 30, 20, 40, 40, 40, 65, 35, 35, …
#> $ sex                               <chr+lbl> "1", "1", "2", "1", "1", "1", "1…
#> $ hobby                             <dbl> 2, 2, 1, 3, 3, 3, 1, 3, 3, 3, 2, 2, …
#> $ morelang                          <chr+lbl> "1", "1", "0", "0", "0", "0", "0…
#> $ name_individual                   <chr> "Ahmad", "Ahmad", "Myriam", "Shannon…
#> $ `_parent_index.demo`              <int> 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, …
#> $ hobbies_list_count                <dbl> 2, 2, 1, 3, 3, 3, 1, 3, 3, 3, 2, 2, …
#> $ lang_list_count                   <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
#> $ `_parent_table_name.demo`         <chr> "main", "main", "main", "main", "mai…
#> $ `_validation_status.demo`         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ start                             <dttm> 2022-01-06 15:16:21, 2022-01-06 15:…
#> $ end                               <dttm> 2022-01-06 15:17:18, 2022-01-06 15:…
#> $ today                             <date> 2022-01-06, 2022-01-06, 2022-01-06,…
#> $ `_id`                             <int> 17727380, 17727380, 17727380, 177275…
#> $ uuid                              <chr> "ee485fd6655b4e328fdd895ac0451656", …
#> $ family_count                      <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, …
#> $ education_count                   <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, …
#> $ `__version__`                     <chr> "vcs3hEpGKxBo8G5uQa94oD", "vcs3hEpGK…
#> $ instanceID                        <chr> "uuid:c2fbd800-f9d9-4a68-a9da-3917da…
#> $ `_xform_id_string`                <chr> "aANhxwX9S6BCsiYMgQj9kV", "aANhxwX9S…
#> $ `_uuid`                           <chr> "c2fbd800-f9d9-4a68-a9da-3917da86c31…
#> $ `_status`                         <chr> "submitted_via_web", "submitted_via_…
#> $ `_submission_time`                <dttm> 2022-01-06 15:17:28, 2022-01-06 15:…
#> $ `_validation_status.main`         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `_submitted_by`                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `_attachments`                    <list> <NULL>, <NULL>, <NULL>, <NULL>, <NU…

Conclusion

The integration of robotoolbox with the dm package provides a powerful toolkit for handling complex survey data with repeating groups from KoboToolbox. This approach preserves the hierarchical structure of your data, allows for efficient manipulation and analysis, and offers flexibility in how you view and work with your survey results. By maintaining the relationships between different levels of your survey data, it ensures accurate and meaningful analyses, from simple filtering to complex joins. Whether you’re dealing with household surveys, multi-level organizational data, or any other nested data structure, this workflow offers a robust solution for managing and analyzing your KoboToolbox data in R.

You can gain extensive knowledge about the dm package by going through its detailed documentation.