Loading Data¶
The current version offers two different loading alternatives: (v0) loading of clinical and genomic data based on MAF datasets; and (v1) loading of generic i2b2 data. Currently these two loaders support each a dataset:
- v0: a genomic dataset (tcga_cbio publicly available in cbio portal)
- v1: the i2b2 demodata.
Future releases of this software will allow for other arbitrary data sources, given that they follow a specific structure (e.g. BAM format).
Pre-Requisites¶
First get the repository containing the MedCo loader software, which already contains some test data for you to work with. Not that you need git-lfs for those data to be retrieved with the repository.
$ cd ~
$ git clone -b v0.1 https://github.com/lca1/medco-loader.git
Building Application
To get the MedCo loader application, pull it with Docker:
docker pull medco/medco-loader:v0.1
v0 (Genomic Data)¶
The v0 loader expects an ontology, with mutation and clinical data in the MAF format.
As the ontology data you must use data/genomic/tcga_cbio/clinical_data.csv and data/genomic/tcga_cbio/mutation_data.csv.
As clinical data you can keep using the same two files or a subset of the data (e.g. 8_clinical_data.csv).
More information about how to generate sample datafiles can be found below.
All the data is encrypted and ‘deterministically tagged’ in compliance with the MedCo data model.
Example¶
The following example allows to load data into a running MedCo development deployment (dev-3nodes-1host), on the node 0.
Adapt accordingly arguments network, entryPointIdx and dbName for the 2 other nodes.
cd ~/medco-loader/docker
docker pull medco/medco-loader:v0.1
docker run --network="dev-3nodes-1host_medco-network" --network="dev-3nodes-1host_medco-srv0" \
-v ~/medco-loader/data/genomic:/dataset \
-v ~/medco-deployment/configuration-profiles/dev-3nodes-1host/group.toml:/group.toml \
medco/medco-loader:v0.1 -debug 2 v0 --group /group.toml --entryPointIdx 0 \
--ont_clinical /dataset/tcga_cbio/8_clinical_data.csv --sen /dataset/sensitive.txt \
--ont_genomic /dataset/tcga_cbio/8_mutation_data.csv --clinical /dataset/tcga_cbio/8_clinical_data.csv \
--genomic /dataset/tcga_cbio/8_mutation_data.csv --output /dataset/ --dbHost postgresql --dbPort 5432 \
--dbName i2b2medcosrv0 --dbUser i2b2 --dbPassword i2b2
Explanation of the arguments:
NAME:
medco-loader v0 - Load genomic data (e.g. tcga_bio dataset)
USAGE:
medco-loader v0 [command options] [arguments...]
OPTIONS:
--group value, -g value UnLynx group definition file
--entryPointIdx value, --entry value Index (relative to the group definition file) of the collective authority server to load the data
--sensitive value, --sen value File containing a list of sensitive concepts
--dbHost value, --dbH value Database hostname
--dbPort value, --dbP value Database port (default: 0)
--dbName value, --dbN value Database name
--dbUser value, --dbU value Database user
--dbPassword value, --dbPw value Database password
--ont_clinical value, --oc value Clinical ontology to load
--ont_genomic value, --og value Genomic ontology to load
--clinical value, --cl value Clinical file to load
--genomic value, --gen value Genomic file to load
--output value, -o value Output path to the .csv files
For more help simply type
./medco-loader v0 -help
Data Manipulation¶
Inside data/scripts/ you can find a small python application to extract (or replicate) data out of the original tcga_cbio dataset.
You can decide which patients you want to consider for you ‘new’ dataset or simply randomly pick a sample.
v1 (I2B2 Demodata)¶
v1 expects an already existing i2b2 database (in .csv format) that will be converted in a way that is compliant with the MedCo data model. This involves encrypting and ‘deterministically tagging’ some of the data.
List of input (‘original’) files:
- all i2b2metadata files(e.g. i2b2.csv)
- dummy_to_patient.csv
- patient_dimension.csv
- visit_dimension.csv
- concept_dimension.csv
- modifier_dimension.csv
- observation_fact.csv
- table_access.csv
Dummy Generation¶
The provided example data set files come with dummy data pre-generated.
Those data are random dummy entries whose purpose is to prevent frequency attacks.
For more information on how this dummy generation is done please refer to data/scripts/import-tool/report/report.pdf.
In a future release, the generation will be done dynamically by the loader.
Example¶
The following example allows to load data into a running MedCo development deployment (dev-3nodes-1host), on the node 0.
Adapt accordingly arguments network, entryPointIdx and dbName for the 2 other nodes.
docker run --network="dev-3nodes-1host_medco-network" --network="dev-3nodes-1host_medco-srv0" \
-v ~/medco-loader/data/i2b2:/dataset -v ~/medco-deployment/configuration-profiles/dev-3nodes-1host/group.toml:/group.toml \
medco/medco-loader:v0.1 -debug 2 v1 --group /group.toml --entryPointIdx 0 --sen /dataset/sensitive.txt \
--files /dataset/files.toml --dbHost postgresql --dbPort 5432 --dbName i2b2medcosrv0 --dbUser i2b2 --dbPassword i2b2
NAME:
medco-loader v1 - Convert existing i2b2 data model
USAGE:
medco-loader v1 [command options] [arguments...]
OPTIONS:
--group value, -g value UnLynx group definition file
--entryPointIdx value, --entry value Index (relative to the group definition file) of the collective authority server to load the data
--sensitive value, --sen value File containing a list of sensitive concepts
--dbHost value, --dbH value Database hostname
--dbPort value, --dbP value Database port (default: 0)
--dbName value, --dbN value Database name
--dbUser value, --dbU value Database user
--dbPassword value, --dbPw value Database password
--files value, -f value Configuration toml with the path of the all the necessary i2b2 files
--empty, -e Empty patient and visit dimension tables (y/n)
For more help simply type
./medco-loader v1 -help