Yoda Portal Intake
Note that this chapter is only important for projects using the Intake module. At this moment only “Youth” uses the Intake.
Purpose of the Intake
The purpose of the intake is threefold:
1) Recognize data sets and enrich them with metadata.
2) Perform quality checks on data sets, especially completeness checks
3) Inserting datasets into the vault while converting the supplied files / folders to the applicable folder organization.
Naming of datasets
Recognizing data sets automatically is dependent on the methodology of mentioning the Wave, Experiment Type, Pseudocode and Optional Version in the names of folders and files.
Only when the W, E, P (and V) elements are present in the folder names, according to the preset rules, Yoda will be able to retrieve them automatically.
The W, E, P (and V) elements can be separated by “-” or “_”.
Yoda will recognize
- Wave: 15y
- Experiment type: peabody
- Pseudo code: A46501
- Version: raw
In the following filename
Two types of datasets
Two types of data sets can be recognized:
a) data set characterized by a main folder (containing everything that belongs to the data)
b) Data set consisting of single loose files in a folder (where the folder can also contain files from other datasets)
Type b) is used by Youth to store, for example, single video files as data sets.
The first input of a data set is placed as a “Raw” version in the vault unless explicitly referred to as a version. Later additions to the dataset will be added with an explicit version (Yoda will, for integrity reasons, never overwrite an existing version in the vault).
The Intake module is an optional module to perform a quality check on primary research data before it is stored in the Vault.
Imagine, for instance, you expect a device to upload a set of 17 labtests every morning at 8.00. The Intake module can be programmed to check: whether the time of the upload is 8.00 o’clock indeed, whether the type of data is labtest and whether the number of files is 17 as expected.
Currently the Intake module is in use for the Youth data project where it automatically extracts
- Experiment type
Furthermore, the Intake module, as in use for the Youth project, performs checks like:
- The type of data
- Whether the file isn’t empty
- Count of the number of files
For other studies the Intake module could be programmed to perform checks like:
- The time of the data upload
- Whether the resolution of an image is sufficient
Requests can be discussed with Yoda’s Business Information Manager
During the Intake process the data can have different statuses:
Unrecognized is applicable to loose files. Yoda will not be able to scan them.
Unscanned data resides in the Intake, is recognized by Yoda and can be scanned.
Scanned data can be ready for sending to the Vault or can still contain errors. In this case some rework can be done to scan them again.
Locked data can’t be changed but Unlocking is possible. Data can be locked by
- the user
- the system
In case a user locked the data, it can be unlocked at any time.
In case data is waiting to be send to the Vault in the next run and therefore locked by the system, unlocking for a rollback is possible until the next 5 minutes-run. So, you will have a maximum of 5 minutes to Unlock the data.
Frozen data can’t be changed and defrosting or unlocking is not possible. Frozen data is locked by the system and is waiting to be actually transferred to the Vault. This status is hardly relevant to the user. It means the system is busy processing the data.
Scanning occurs on the user’s initiative to achieve goals 1 and 2 as mentioned above.
While scanning the data will be checked. The checks performed on the Youth data are:
- Completeness: are all expected files of this experiment type present?
- File size: are the files of the expected size?
- Metadata structure: are the following items readable?
- experiment type
- date creation
- date transfer to Vault,
- wave (measuring moment)
- Presence lab report
- Non-duplication: aren’t there files already in the Vault with the same:
- Validity pseudocode.
Without a pseudocode the Youth data cannot be scanned.
Errors and Warnings
The result of the check is displayed in the portal.
The RA, Research Assistant checks the notifications, corrects any errors (rework) and decides which warnings can be disregarded.
The RA completes this step by scanning again. A log entry will appear in the database.
During the scan, the recognized metadata is added to the files.
The option to add metadata manually by the user is not offered. If you would add metadata with the icommands, it will be ignored / removed when moving to the Vault.
Lock, reserved for data managers, serves the above mentioned goal 3.
After approval by the Research Assistant, the data is waiting in Yoda to get a final approval by the Data Manager. The Data Manager will “Lock” the data to be stored in the Vault. A batch run to move the data to the Vault is scheduled every 5 minutes. During the time between clicking “Lock” and the moment the data is actually moved to the Vault the status will be “Locked”, no changes to the data can be made and a rollback “Unlock” is possible in this time. Notice that this delay can be anything between 0 seconds and 5 minutes.
Move not Copy
Note that the data is moved from Intake to Vault and not copied: the data will disappear from the Intake.
Workflow to store data from the Intake in the Vault
Today at 9.42 the following folder was sent to the Intake:
You want the data to be stored in the Vault.
Steps Research Assistant
In the Yoda Portal open the tab for the Intake.
Then check the applicable study, in this case “Youth”.
From the dropdown “Change folder” you choose your folder
Now you are in the applicable folder.
Within this folder choose “Scan all files”
The scan is performed and Pseudocode, Experiment type and Wave are automatically extracted from the folder name.
Steps Data Manager
The Data Manager will regularly check for files to be moved to the Vault.
Only scanned data without errors will have a checkbox in front of it to be send to the Vault.
The Date Manager will place a check in front of the data to be permanently stored and sends them to the Vault by choosing “Lock datasets”.
The data set will get status “Locked” until it is moved to the Vault in the next run (every 5 minutes).
After a maximum of 5 minutes the data is moved from the Intake to the Vault. Note that the data in the Vault is ordered in a folder structure: Wave\ Experiment type\Pseudocode.
Errors and Warnings
In the scenario above all went smoothly. In case scanning the files is not successful due to incomplete information you will get a notification like:
Rework can solve the problems and the data can be scanned again.
While scanning a couple of checks are performed on the data. In the following example the wave 35y doesn’t exist in the Youth project.
You will see a notification.
The column Nr. Of errors/warnings gives to values e/w. The e is the number of errors, these must be solved. The w is the number of warnings. You can decide what to do with them.
Double clicking on the notification will reveal the problem(s):
The comment you write here will be seen by the Data Manager.