Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raremetal filter out all the variants #31

Open
stefanucci-luca opened this issue Apr 8, 2021 · 17 comments
Open

Raremetal filter out all the variants #31

stefanucci-luca opened this issue Apr 8, 2021 · 17 comments

Comments

@stefanucci-luca
Copy link

Dear raremetal developers,

I really appreciate the work you are doing to develop and maintain raremetal. I am reaching out because I need some help to understand raremetal behaviour.

Recently, I am trying to run rare metal on a set of variants (~50) on 3 genes. The cohort has about 130,000 individuals. I am using the commands I pasted below. The single variants association runs ok, and I get the score and cov files. When I run raremetal burden and skat (but I also tried the other weighted approaches), all the variants are filtered out; see log file below.

Do you have any suggestion on how to fix this behaviour?

Any suggestion is much appreciated,

Best,

Luca

PED

1000150 1000150 0 0 0 0 1 30 67 1
1000184 1000184 0 0 0 0 1 39 42 1
1000200 1000200 0 0 0 0 0 24 42 0
1000261 1000261 0 0 0 0 1 25 49 2

DAT

T Trait
C SEX
C BMI
C AGE
C SMOKE

VCF

chr3    93874276        chr3:93874276:G:A       G       A       50      .       AF=2e-06;AQ=50;AC=1;AN=262044   GT:DP:AD:GQ:PL:RNC      0/0:16:16 .........
chr3    93874282        chr3:93874282:G:A       G       A       47      .       AF=1.2e-05;AQ=47;AC=4;AN=262044 GT:DP:AD:GQ:PL:RNC      0/0:16:16 ..........

Commands

raremetalworker --ped VTE.ped --dat ${out_dir}/${gn_name}_variopath_burden.dat --vcf ${out_dir}/${gn_name}_variopath_variants.vcf.gz \
		--traitName Trait --inverseNormal --makeResiduals --prefix ${out_dir}/${gn_name}_rare_metalworker \
		--dominant --geneMap ${reflat}

and

raremetal --summaryFiles ${out_dir}/score_file --covFiles ${out_dir}/cov_file \
--geneMap ${reflat} --groupFile ${group} \
--SKAT  --burden --longOutput --altMAF --tabulateHits --prefix ${out_dir}/burden_test_cumulative 

Raremetalworker output score

##ProgramName=RareMetalWorker
##Version=4.15.1
##Samples=131022
##AnalyzedSamples=130458
##Families=131022
##AnalyzedFamilies=130458
##Founders=131022
##MakeResiduals=True
##AnalyzedFounders=130458
##Covariates=SEX,BMI,AGE,SMOKE
##CovariateSummaries	min	25th	median	75th	max	mean	variance
##SEX	0	0	0	1	1	0.455488	0.248021
##BMI	12	24	26	29	68	26.8007	22.4515
##AGE	38	50	58	63	72	56.5656	64.4416
##SMOKE	0	0	0	1	2	0.546168	0.436638
##InverseNormal=ON
##TraitSummaries	min	25th	median	75th	max	mean	variance
##Trait	0	0	0	0	1	0.0440832	0.0421402
## - NullModelEstimates
## - Name	BetaHat	SE(BetaHat)
## - Intercept	-7.04406e-08	0.00276861
##AnalyzedTrait	-4.47433	-0.674924	0.000758956	0.674972	4.47433	-7.04406e-08	0.999996
##Sigma_e2_Hat	0.999988
#CHROM	POS	REF	ALT	N_INFORMATIVE	FOUNDER_AF	ALL_AF	INFORMATIVE_ALT_AC	CALL_RATE	HWE_PVALUE	N_REF	N_HET	N_ALT	U_STAT	SQRT_V_STAT	ALT_EFFSIZE	PVALUE
chr3	93874276	G	A	130458	0.0000038	0.0000038	1	1	1	130457	1	0	0.0556638	10.0556635	0.95561

Raremetalworker output cov

##ProgramName=RareMetalWorker
##Version=4.15.1
#CHROM	CURRENT_POS	MARKERS_IN_WINDOW	COV_MATRICES
chr3	93874276	93874276,93874282,93874387,93877004,93877074,93877089,93877159,93879213,93879279,93879306,93879315,93884755,93884868,93884869,93884886,93884889,93884899,93884905,93886407,93892993,93893024,93893025,93893082,93893108,93893139,93896594,93896595,93898500,93900875,93905828,93905889,93906016,93906059,93906104,93910618,93910664,93910672,93910681,93910696,93910697,93910706,93927245,93927251,93927284,93927336,93927362,93927363,93973667,93973753,	7.66533e-06,-2.3503e-10,-4.11303e-10,-2.9379e-10,-3.4668e-09,-4.40692e-09,-9.40265e-10,-1.35143e-09,-1.17518e-10,-1.17517e-10,-5.87589e-11,-5.8758e-11,-2.35036e-10,-1.17519e-10,-7.05101e-10,-2.87934e-09,-1.5865e-09,-1.17519e-10,-1.17515e-10,-8.8137e-10,-5.87585e-11,-5.87594e-11,-5.87585e-11,-4.70075e-10,-1.76294e-10,-2.93813e-10,-4.70107e-10,-5.87589e-11,-5.8758e-11,-6.46378e-10,-8.81498e-10,-1.17515e-10,-8.1673e-09,-5.87576e-11,-5.87765e-11,-2.93801e-10,-2.76169e-09,-1.28687e-08,-1.17539e-10,-5.87639e-11,-5.87657e-11,-4.17179e-09,-1.05765e-09,-3.52564e-10,-2.23279e-09,-2.3503e-10,-5.87576e-11,-3.93685e-09,-5.87576e-11,

Group file

SERPINC1	chr1:173903903:G:T chr1:173903969:G:T chr1:173903973:G:C chr1:173903996:T:C chr1:173904038:C:A chr1:173904044:C:T chr1:173909648:G:A chr1:173909665:A:T chr1:173909791:G:T chr1:173909819:C:G chr1:173909858:T:C chr1:173909900:C:T chr1:173910797:T:C chr1:173910831:G:A chr1:173910861:T:C chr1:173911851:T:C chr1:173911852:G:A chr1:173911918:A:G chr1:173911941:C:T chr1:173911942:G:A chr1:173911974:T:G chr1:173912015:C:A chr1:173914611:G:A chr1:173914659:G:C chr1:173914668:A:G chr1:173914696:G:A chr1:173914725:C:T chr1:173914726:G:A chr1:173914743:G:A chr1:173914788:G:A chr1:173914795:G:A chr1:173914818:G:T chr1:173914872:A:T 

log output raremetal

The following parameters are in effect:

List of Studies:
============================
--summaryFiles [./test_dir/score_file]
--covFiles [./test_dir/cov_file]

Grouping Methods:
============================
--groupFile [./test_dir/variant_group_file.txt]
--annotatedVcf []
--annotation []
--writeVcf [OFF]

QC Options:
============================
--hwe [1e-05]
--callRate [0.95]

Association Methods:
============================
--burden [true]
--MB [false]
--SKAT [true]
--VT [false]
--condition []

Other Options:
============================
--tabix [false]
--correctGC [false]
--prefix [./test_dir/burden_test_cumulative]
--maf [0.05]
--longOutput [true]
--tabulateHits [true]
--dosage [OFF]
--hitsCutoff [1e-06]
--altMAF [ON]
Analysis started at: Thu Apr  8 12:13:51 2021

Warning: variant 2:127426090 from study ./test_dir/PROC_rare_metalworker.Trait.singlevar.score.txt.gz is skipped because of duplicate records in the same study.

Warning: 1 variants are skipped from analysis due to duplicate records in study ./test_dir/PROC_rare_metalworker.Trait.singlevar.score.txt.gz. Please check log for details.

Warning: variant chr1:173903903:G:T is excluded from group SERPINC1 for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173903969:G:T is excluded from group SERPINC1 for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173903973:G:C is excluded from group SERPINC1 for the following reasons:monomorphic or failed QC.
...
Warning: variant chr1:173914668:A:G is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914696:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914725:C:T is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914726:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914743:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914788:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914795:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914818:G:T is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914872:A:T is excluded from group Triplet for the following reasons:monomorphic or failed QC.

Checking if all groups are analyzed...
Warning: the following groups has no qualifed variants to group and are skipped:
	SERPINC1, PROC, PROS1, Triplet, 
	done!
@jonathonl
Copy link
Collaborator

Is it possible that you are mixing b37 and b38 datasets when meta-analyzing? Or mixing chromosome names (chr1 vs 1) between datasets?

@stefanucci-luca
Copy link
Author

Hi Jonathon,

Thanks for your quick reply.

All my data are in b38 and use chr* format for chromosomes.
Is there any part of the script that sources the chromosome information somewhere else? I am not using the --geneMap option and the standard path (i.e. ../reflath19) won't work.

If this is the problem won't also reflect in the single variant association?

Thanks

Luca

@jonathonl
Copy link
Collaborator

jonathonl commented Apr 8, 2021

So to clarify, when you say that "single variant association" works and gene-level does not, do you mean that raremetalworker works and raremetal does not? Or are you saying that running raremetal with --useExact works and running raremetal with --burden, --SKAT, etc. does not? I'm assuming you mean the former.

My earlier suspicion was that one of the studies in ./test_dir/score_file and ./test_dir/cov_file was in b38 and another was in b37, which would prevent raremetal from finding overlapping variants between the two studies.

What are the contents of ./test_dir/score_file? Can you verify that there are overlapping variants between the files in ./test_dir/score_file and that those variants overlap with the those in your group file?

@stefanucci-luca
Copy link
Author

stefanucci-luca commented Apr 8, 2021

Yes, that raremetalworker works and raremetal does not.

score file has the path to the files

./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.score.txt.gz
./test_dir/PROC_rare_metalworker.Trait.singlevar.score.txt.gz
./test_dir/PROS1_rare_metalworker.Trait.singlevar.score.txt.gz

similarly cov_file

./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.cov.txt.gz
./test_dir/PROC_rare_metalworker.Trait.singlevar.cov.txt.gz
./test_dir/PROS1_rare_metalworker.Trait.singlevar.cov.txt.gz

The variants for /test_dir/score_file overlap with the one in group file.

for instance, the variant in ./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.score.txt.gz are:

#CHROM	POS	REF	ALT
chr1	173903903	G	T
chr1	173903969	G	T
chr1	173903973	G	C
chr1	173903996	T	C
chr1	173904038	C	A
chr1	173904044	C	T
chr1	173909648	G	A
chr1	173909665	A	T
chr1	173909791	G	T
chr1	173909819	C	G
chr1	173909858	T	C
chr1	173909900	C	T
chr1	173910797	T	C
chr1	173910831	G	A
chr1	173910861	T	C
chr1	173911851	T	C
chr1	173911852	G	A
chr1	173911918	A	G
chr1	173911941	C	T
chr1	173911942	G	A
chr1	173911974	T	G
chr1	173912015	C	A
chr1	173914611	G	A
chr1	173914659	G	C
chr1	173914668	A	G
chr1	173914696	G	A
chr1	173914725	C	T
chr1	173914726	G	A
chr1	173914743	G	A
chr1	173914788	G	A
chr1	173914795	G	A
chr1	173914818	G	T
chr1	173914872	A	T

the variants in the group file for that gene are:

SERPINC1	chr1:173903903:G:T chr1:173903969:G:T chr1:173903973:G:C chr1:173903996:T:C chr1:173904038:C:A chr1:173904044:C:T chr1:173909648:G:A chr1:173909665:A:T chr1:173909791:G:T chr1:173909819:C:G chr1:173909858:T:C chr1:173909900:C:T chr1:173910797:T:C chr1:173910831:G:A chr1:173910861:T:C chr1:173911851:T:C chr1:173911852:G:A chr1:173911918:A:G chr1:173911941:C:T chr1:173911942:G:A chr1:173911974:T:G chr1:173912015:C:A chr1:173914611:G:A chr1:173914659:G:C chr1:173914668:A:G chr1:173914696:G:A chr1:173914725:C:T chr1:173914726:G:A chr1:173914743:G:A chr1:173914788:G:A chr1:173914795:G:A chr1:173914818:G:T chr1:173914872:A:T 

@jonathonl
Copy link
Collaborator

I see. The names of those score and cov files suggest that you have run raremetalworker on three different genes (presumably using the same samples/individuals). Raremetal expects that you run raremetalworker on two or more different sets of individuals for the same region of the genome (or entire genome). It then meta-analyzes the results of each sample set.

Does this make sense? Or am I misunderstanding what's going on?

@stefanucci-luca
Copy link
Author

Thanks for pointing that out! I changed the workflow. Now, raremetalworker is performing one association with all my variants of interest producing only one score and cov file. However, raremetal still have the same problem. Any further suggestion?

@jonathonl
Copy link
Collaborator

I'm not sure that Raremetal will run for a single study, i.e., you would need more than one score file and more than one cov file. I suspect that Raremetal is not the tool you should be using. Raremetal is meant for meta-analysis of multiple studies that either have study-specific covariates and/or consent constraints that prevent the studies from being analyzed together. I'm not sure exactly what you are trying to achieve (polygenic analysis, gene-wise analysis accounting for hidden relatedness, etc.), but there are likely other tools better suited for your needs.

@stefanucci-luca
Copy link
Author

stefanucci-luca commented Apr 8, 2021

I'd like to perform a burden test. Collapse all the effect sizes of variants that are within the same gene and have a 'cumulative' effect size. Isn't that possible with raremetal?

The problem is that is filtering out all the variants. What are the reasons for raremetal to filter out variants?

@jonathonl
Copy link
Collaborator

Raremetal will do burden tests if you have multiple datasets (sets of individuals), but it sounds like you only have one. You should look into EPACTS (https://genome.sph.umich.edu/wiki/EPACTS#Gene-wise_or_group-wise_tests) if your sample size is modest or if your sample size is large but you don't want to account for hidden relatedness. Otherwise, SAIGE-GENE (https://github.com/weizhouUMICH/SAIGE) will do burden tests for sample sizes in the hundreds of thousands and account for hidden relatedness.

@stefanucci-luca
Copy link
Author

I'll look into these other softwares, thank you. Do you know why variants are filtered out then?

@welchr
Copy link
Member

welchr commented Apr 8, 2021

What happens if you run without --dominant?

@stefanucci-luca
Copy link
Author

stefanucci-luca commented Apr 8, 2021

Dear Ryan,

Thanks for your suggestion. without --dominant in raremetalworker, raremetal return this error:

FATAL ERROR - 
No p-value available for calculating Genomic Control. Check if markers were filtered!

The log file is similar, all the variants are filtered out and it ends with:

Checking if all groups are analyzed...
Warning: the following groups has no qualifed variants to group and are skipped:
	SERPINC1, PROC, PROS1, Triplet,

@welchr
Copy link
Member

welchr commented Apr 9, 2021

For the variants that are filtered out when running the burden test (such as chr1:173903903:G:T), do they have a non-missing score stat and p-value in the single variant file produced by raremetalworker?

Also, as Jonathon mentions, there are other options for single study rare variant tests, such as SAIGE-GENE, GMMAT, and GENESIS. I think SAIGE-GENE might be the only one that supports very unbalanced case/control ratios (?)

@stefanucci-luca
Copy link
Author

Hi Ryan,

Yes, majority of the variants get score and p-value in raremetalworker, then they are filtered anyway by raremetal.
If possible, I'd prefer to fix this problem because for a collaboration and other analysis we are using raremetal and I'd like to keep it consistent.

For instance:

output raremetalworker score

chr3	93927245	C	T	131022	0.000270947	0.000270947	71	1	1	130951	71	0	21.1172	19.7996	0.0538671	0.286177
chr3	93927251	G	A	131022	6.86918e-05	6.86918e-05	18	0.999985	1	131002	18	0	2.29883	9.97128	0.0231209	0.817668
chr3	93927284	T	G	131022	2.28981e-05	2.28981e-05	6	0.999947	1	131009	6	0	-3.03879	5.75719	-0.0916813	0.59762
chr3	93927329	C	T	131022	0	0	0	1	1	131022	0	0	NA	NA	NA	NA
chr3	93927336	T	C	131022	0.000145014	0.000145014	38	1	1	130984	38	0	37.8297	14.4868	0.180254	0.00901951
chr3	93927362	C	T	131022	1.52646e-05	1.52646e-05	4	1	1	131018	4	0	9.38911	4.70076	0.424902	0.045786
chr3	93927363	G	A	131022	3.81615e-06	3.81615e-06	1	1	1	131021	1	0	10.9084	2.35041	1.97459	3.46589e-06
chr3	93927365	C	A	131022	3.81615e-06	3.81615e-06	1	1	1	131021	1	0	-0.506439	2.35041	-0.0916729	0.829402
chr3	93973667	T	C	131022	0.000255688	0.000255688	67	0.999977	1	130952	67	0	57.3926	19.2341	0.155137	0.00284595
chr3	93973753	G	C	131022	3.81615e-06	3.81615e-06	1	1	1	131021	1	0	-0.506439	2.35041	-0.0916729	0.829402

Raremetal log file

Warning: variant chr3:93927245:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927251:G:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927284:T:G is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927329:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927336:T:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927362:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927363:G:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927365:C:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93973667:T:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93973753:G:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.

@welchr
Copy link
Member

welchr commented Apr 11, 2021

Huh, I'm at a loss there. Is your data confidential? Could you maybe send me the data for one of those genes? That way we can run it in the debugger and see what's going on.

@stefanucci-luca
Copy link
Author

Hi Ryan,

It looks like that removing chr from the #CHROM field solved the problem. Now the majority of variants is included in the analysis.

I have a few another question, possibly, easier to solve now that we know where is the problem:

When I use --binary option I get this error:

ERROR: 3rd phenotype 2.0133 exists. Is that binary trait?
MESSAGE: This script is to calcualte odds ratio of variants based on parameter estimates from LMM generated by raremetalworker4.13.4 or rvtests.
	--Shuang Feng, September 10, 2014

however, the 6th columns of the ped file is:

cat VTE.ped | cut -d' ' -f 6 | sort | uniq
0
1

I can use calculateOR.pl, but then would these values be used in the burden test? Can I swap the columns to let raremetal use the OR rather than effect sizes?

Best,

Luca

@welchr
Copy link
Member

welchr commented Apr 14, 2021

Does 2.0133 appear in your phenotype file anywhere if you grep for it? Does the command line still include --inverseNormal or --makeResiduals? Could be your binary phenotype is being transformed.

I would be cautious using --binary as it doesn't seem to be documented anywhere on the wiki, and the code contains:

* Undocumented CLI parameter: whether to treat this trait as binary. Cannot be used with --makeResiduals flag
*/
static bool binary;

So my guess is the original authors didn't intend for it to be used.

My understanding of calculateOR.pl is that it takes the single variant effect size estimates previously generated by assuming your trait is quantitative (linear model), and converts them roughly to what they would have been if you had fit a logistic model instead. From what I can tell, it appears to be doing the following:

image

But unfortunately there is no written explanation that I can find on the wiki or in the script.

I believe the burden test only looks at the score statistic, covariance, and possibly allele frequencies (weights). So it will likely not take into account your updated effect size estimates. Actually the effects are just calculated from the score stat and variance, and written out to the file.

At the end of the day, for a binary trait, probably best to use software designed (and tested) to correctly handle them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants