v0 (Genomic Data)¶
The v0 loader expects an ontology, with mutation and clinical data in the MAF format.
As the ontology data you must use ~/medco-loader/data/genomic/tcga_cbio/clinical_data.csv and ~/medco-loader/data/genomic/tcga_cbio/mutation_data.csv.
For clinical data you can keep using the same two files or a subset of the data (e.g. 8_clinical_data.csv).
More information about how to generate sample datafiles can be found below.
After the following script is executed all the data is encrypted and ‘deterministically tagged’ in compliance with the MedCo data model.
Loading from the same host¶
If you using the same host machine to deploy and load the data you can use the following table bellow to adapt some of the script parameters depending on the deployment scenario.
This includes the scenario in test-network where for each of the nodes you want to load data from its hosting machine.
You need to repeat the loading process for all nodes, by modifying the arguments “network”, “entryPointIdx” and “dbName”.
| Deployment Profile | –network | –v (volumes) | –dbHost | –dbName |
|---|---|---|---|---|
| test-local-3nodes | test-local-3nodes_medco-network + test-local-3nodes_medco-srv<node index> |
~/medco-loader/data/genomic:/dataset + ~/medco-deployment/configuration-profiles/test-local-3nodes/group.toml:/group.toml |
postgresql |
i2b2medcosrv<node index> |
| test-network | test-network-<network name>-node<node index>_default |
~/medco-loader/data/genomic:/dataset + ~/medco-deployment/configuration-profiles/test-network-<network name>-node<node index>/group.toml:/group.toml |
postgresql |
i2b2medco |
| dev-local-3nodes | dev-local-3nodes_medco-network + dev-local-3nodes_medco-srv<node index> |
~/medco-loader/data/genomic:/dataset + ~/medco-deployment/configuration-profiles/dev-local-3nodes/group.toml:/group.toml |
postgresql |
i2b2medcosrv<node index> |
Loading from a different host¶
If you are using an external machine (e.g. your laptop) to load the data into one of the nodes you can use the following table bellow to adapt some of the script parameters depending on the deployment scenario. In this case you do not need to specify the --network parameters.
You need to repeat the loading process for all nodes, by modifying the arguments “network”, “entryPointIdx” and “dbName”.
| Deployment Profile | –v (volumes) | –dbHost | –dbName |
|---|---|---|---|
| test-local-3nodes | ~/medco-loader/data/genomic:/dataset + ~/medco-deployment/configuration-profiles/test-local-3nodes/group.toml:/group.toml |
<domain name> |
i2b2medcosrv<node index> |
| test-network | ~/medco-loader/data/genomic:/dataset + ~/medco-deployment/configuration-profiles/test-network-<network name>-node<node index>/group.toml:/group.toml |
<domain name> |
i2b2medco |
Example¶
The following example allows to load data into a running MedCo development deployment (dev-local-3nodes), on the node 0.
Adapt accordingly arguments network, entryPointIdx and dbName for the 2 other nodes.
cd ~/medco-loader/deployment
docker run --network="dev-local-3nodes_medco-network" --network="dev-local-3nodes_medco-srv0" \
-v ~/medco-loader/data/genomic:/dataset \
-v ~/medco-deployment/configuration-profiles/dev-local-3nodes/group.toml:/group.toml \
medco/medco-loader:v0.1.1 medco-loader -debug 2 v0 --group /group.toml --entryPointIdx 0 \
--ont_clinical /dataset/tcga_cbio/8_clinical_data.csv --sen /dataset/sensitive.txt \
--ont_genomic /dataset/tcga_cbio/8_mutation_data.csv --clinical /dataset/tcga_cbio/8_clinical_data.csv \
--genomic /dataset/tcga_cbio/8_mutation_data.csv --output /dataset/ --dbHost localhost --dbPort 5432 \
--dbName i2b2medcosrv0 --dbUser i2b2 --dbPassword i2b2
Explanation of the arguments:
NAME:
medco-loader v0 - Load genomic data (e.g. tcga_bio dataset)
USAGE:
medco-loader v0 [command options] [arguments...]
OPTIONS:
--group value, -g value UnLynx group definition file
--entryPointIdx value, --entry value Index (relative to the group definition file) of the collective authority server to load the data
--sensitive value, --sen value File containing a list of sensitive concepts
--dbHost value, --dbH value Database hostname
--dbPort value, --dbP value Database port (default: 0)
--dbName value, --dbN value Database name
--dbUser value, --dbU value Database user
--dbPassword value, --dbPw value Database password
--ont_clinical value, --oc value Clinical ontology to load
--ont_genomic value, --og value Genomic ontology to load
--clinical value, --cl value Clinical file to load
--genomic value, --gen value Genomic file to load
--output value, -o value Output path to the .csv files
Data Manipulation¶
Inside ~/medco-loader/data/scripts/ you can find a small python application to extract (or replicate) data out of the original tcga_cbio dataset.
You can decide which patients you want to consider for you ‘new’ dataset or simply randomly pick a sample.
To check that it is working you can query for:
-> MedCo Gemomic Ontology -> Gene Name -> BRPF3
For the small dataset ``8_xxxx``you should obtain 3 matching subjects (one at each site).