Discover

Discover allows users to create a new Polygenic Risk Score (PRS) using state-of-the-art algorithms.

Login to saas.allelica.com to access these functions in the PRS Discovery App:

Discover Populations

You may choose to use your own population data or the UK Biobank population dataset. Both paths are detailed below.

Step 1 – Upload Summary Statistics

1.1 Verify your data format

The first step to ensure successful processing of your file requires the data to be formatted as 7 tab-delimited fields. The file extension type does not matter, but we expect most users will submit a .csv file. Your column layout must be:

  1. SNP ID

  2. Minor allele

  3. Major allele

  4. Effective allele

  5. Effective allele frequency

  6. Weight

  7. p-value

Matching your data to the expected structure is a vital step. Summary statistics (SS) are provided in many different output formats. You must setup your data appropriately pre-submission.

A tab-delimited file provides these 7 columns with an even spacing (tab) between them such as:

rs12345 A T T 0.12 0.0003 1*E^-8

A comma-delimited file will be rejected, such as:

rs12345,A,T,T,0.12,0.0003,1*E^-8

1.2 Upload your data file

1.2.1 Once you have your data in the correct format, use our file picker to upload it. Click "Browse". Navigate to your file and upload.

1.2.2 Click "Done".

Finalizing Upload Summary Statistics

1.3. A successful upload is confirmed with a "Task completed" verification providing a date/time stamp.

If you want to revert the upload of this data at any time, simply use the "Reset" option.

Step 2 – Upload Validation Population

There are 2 paths available to users.

2.1 Use the UK Biobank data as the population for comparison.

2.2 Provide your own custom population for comparison.

If you choose 2.1, the 1st UK Biobank data release will be used as the validation population and the 2nd release as the testing population.

Path 2.1

2.1.1 Implementing the UK Biobank population dataset as your validation population simply requires you to click the button:

2.1.2 The UK Biobank is a large dataset containing epidemiological, biometric, and clinical data from a population sample of approximately 400,000 European individuals. Each member of the UK Biobank population is also linked to Hospital Episode Statistics (HES) data, as well as national death and cancer registries. This vast amount of data allows you to formulate both simple and complex phenotypes based on a single biometric parameter or a combination of multiple data sources (e.g. Hospital diagnoses and Surgical procedures received by the patient).

Each data source in the UK Biobank is identified by a specific Data-Field number. For example, the heights of UK Biobank participants are specified by the Data-Field 12144. You can specify any desired phenotype by inserting all the phenotype defining-Data-Fields as a comma-separated list. You can browse the Data-Fields id in the UK Biobank showcase.

Please note that Data-Fields in the UK Biobank may contain data referring to multiple conditions; for example, the Data-Field 20002 (non-cancer illness code, self-reported) contains a wide spectrum of self-reported illnesses, each one specified by a different numerical code. In these cases, you must insert the phenotype-defining codes as a comma/separated list enclosed in brackets after the field of interest.

As an example, to specify a self-reported phenotype of diabetes (illness codes: 1220, 1222, and 1223), you must insert the following Data-Fields and codes: 20002 (1220, 1222, 1223). When accounting for multiple Data-Fields and codes, they must be comma-separated after each previous bracket.

2.1.3 Click "Confirm".

2.1.4 Click "Done".

Path 2.2

2.2.1 As in Step 1.1, the data to be uploaded must fit with the expected data format to successfully run your model.

For example, a VCF must provide the data in 2 tab-delimited columns as previously described.

The phenotype ID must correspond to that applied by the UK Biobank. They provide a search engine for this purpose.

Other data-standards that may be parsed include:

  • Oxford genotype format (bgen/bfam/bsam)

  • Plink genotype format (pgen/psam/pfam)

  • Binary format (bim/bed/fam)

Null values are not acceptable.

2.2.2 Once you have your data in the correct format, use our file picker to upload it. Click "Browse". Navigate to your file and upload.

2.2.3 Click "Done".

Finalizing Upload Validation Population

2.3 A successful upload is confirmed with an "Actions done" verification providing a date/time stamp.

Step 3 – Upload Testing Population

There are 2 paths available to users.

3.1 You may use the UK Biobank data as the test population for validating the predictive power of the model.

3.2 You may provide your own custom population for comparison.

Path 3.1

3.1.1 Implementing the UK Biobank popluation dataset as your validation population simply requires you to select:

3.1.2 Click "Confirm".

3.1.3 Click "Done".

Path 3.2

3.2.1 As per previous upload requirements, your data must fit with the expected data format to successfully run your model.

A VCF must be tab-delimited as previously described.

Null values are not acceptable.

3.2.2 Click "Done".

Finalizing Upload Testing Population

3.3 A successful upload is confirmed with an "Actions done" verification providing a date/time stamp.

Step 4 – Algorithm Selection

4.1 Use the checkbox to choose from the available algorithm options:

4.2 Click "Done".

Step 5 – Run the Model

You are able to return to any of the previous steps to verify or update your choices at any time.

Once you have finalized all your selections, click "Run".

Step 6 – Download the Report

The processing power required to run the analysis is substantial. The main factors that will influence your run-time are the algorithm selected and your population size. You will receive an email notification when your report is available to download within 3–5 days.

Troubleshooting?

If you need assistance, please reach out.

Last updated