Last updated: 2017-07-18
Code version: 49915de
The original IUPAC code is gotten in the way that one finds the bases that has larger frequency than the background probability(default \(0.25\)), then corresponds them to the IUPAC code table. There are some problems with this method. Firstly, if the frequencies of \(A,C,G,T\) at a position are \((0.28,0.24,0.24,0.24)\), then the code would be A for the position. However, it seems that an N is more appropriate which means any base. Secondly, if the frequencies are \((0.35,0.35,0.24,0.06)\), then the code would be \(M(A\) or \(C)\) for the position. But a depletion in \(T\) may be a better representation. Also, to use \((AC)\) representing \(A\) or \(C\) is more straightforward than introducing extra letters.
We studied three possible methods that could improve the IUPAC nucleotide code so that the code could give more information about the enrichment and depletion of bases.
We first divide all possible frequencies into the following categories:
Then define five vectors whose length is 4 and the elements sum to 1. The vectors correspond to the categories above.
We sort the base frequencies at a position descendingly and then get the distance between the sorted frequencies and the five vectors defined above. Then we could know the base frequency vector belongs to the category that minimizes the distance. So according to the distances, we could determine which category the position belongs to and then assign the corresponding code.
The problem then could be formed in a decision theory problem. The action space is \(S=\{V_1,V_2,V_3,V_4,V_5\}\). The action to be taken is \(s\in S\). The loss function is the distance function, e.g. Jensen–Shannon divergence, \(L(p,s)=0.5D_{KL}(p||m)+0.5D_{KL}(s||m)\), where \(m=(p+s)/2\) and \(D_{KL}(p||q)=\Sigma_ip_i\log(p_i/q_i)\). A pseudocount \(w=0.001\) is added if any of the probability is 0 and then the probability vector is rescaled.
There are two types of distance function. The first type is the measure of the distance between two discrete distributions. For example, Kullback-Leibler divergence, JensenÄ-Shannon divergence, Bhattacharyya distance. The second type is the vector norm. For example, \(L^1\) norm ,\(L^2\) norm, infinity norm.
So for each different choice of \(V_1,V_2,V_3,V_4,V_5\) and distance function, we would have one code for the sequence. So in total, there are \(n_{v_1}*n_{v_2}*n_{v_3}*n_{v_4}*n_{v_5}*n_{dis}\) codes. To make the result robust to these choices, we take the mode of the code at the each position. And if there are two codes that happen more than \(40\%\) of the time, we report both. For example, \(A\) happens 3 times and \(At\) happens 4 times out of total 7 times, then the code is given by \([A-At]\)
The five categories are the same as above.
Define three kinds of vectors whose length is 3 and the elements sum to 1:
We sort the base frequencies at a position descendingly and divide it into two parts, the largest three and the smallest three(both are ordred descendingly), then get the distance between the each part(scaled, sum to 1) and the three kinds of vectors defined above. For example, \((1,0,0)\), \((0.5,0.5,0)\), \((1/3,1/3,1/3)\). For the \(N\) case, both of the two parts should have smallest diatance to \((1/3,1/3,1/3)\). And the for the “one enrichment” case, the largest three should have the smallest distance to \((1,0,0)\), etc. So according to the distances, we could determine which category the position belongs to and then assign the corresponding code. I take \(A\) and \(T\) as examples to illuastrate the results. If the largest three is closest to \((1,0,0)\) and the smallest three is closest to \((1/3,1/3,1/3)\), then the code should be \(A\). In the following table, I denote the case as “\((1,0,0),(0.5,0.5,0) - A\)”. Below are examples of different combinations:
Note that the cases 1,5,6,7 may not be straightforward to understand. Here are some brief explanations. Case 1: the largest one is much larger than the others. Case 5: the first three are larger than the last one. Case 6,7: the first two are larger than the last two. Further, \((0.9,0.05,0.05,0)\) would be classified as case 2 however we want it to be case 1. To avoid this, we require that the difference between the smallest and the second smallest should be at least 0.12.
Similarly, for each different choice of \(V_1,V_2,V_3\) and distance function, we would have one code for the sequence. So in total, there are \(n_{v_1}*n_{v_2}*n_{v_3}*n_{dis}\) codes. We also take the mode of the code at the each position as in method 1.
This method usus the height of the negative logo plot, in which the log height is determined by log odds. Still, we have five categories as above and below lists the way we determine the category of a probability vector:
Heuristicly, we chose \(a=0.25,b=0.12,c=0.14\)
In the implementation, for method 1, we chose \(V_1=(1,0,0,0),(0.85,0.05,0.05,0.05),(0.7,0.1,0.1,0.1)\), \(V_2=(0.5,0.5,0,0),(0.45,0.45,0.05,0.05)\), \(V_3=(1/3,1/3,1/3,0),(0.3,0.3,0.3,0.1)\), \(V_4=(0.8,0.1,0.1,0),(0.7,0.15,0.15,0),(0.6,0.2,0.2,0)\), \(V_5=(0.25,0.25,0.25,0.25)\);
for method 2, we chose \(V_1=(1,0,0),(0.9,0.05,0.05),(0.85,0.075,0.075),(0.8,0.1,0.1),(0.75,0.125,0.125),(0.7,0.15,0.15)\), \(V_2=(0.5,0.5,0),(0.475,0.475,0.05),(0.45,0.45,0.1)\), \(V_3=(1/3,1/3,1/3)\).
The distance functions are symmetric Kullback-Leibler divergence, JensenÄ-Shannon divergence, Bhattacharyya distance, \(L^1\) norm, \(L^2\) norm, infinity norm.
Then we apply the code to the depletion data we are interested in(mainly dimer ) and compare them with the IUPAC code from atSNP package.
Dimer from IRF family, mouse.
The IUPAC code is given firstly followed by method1, method2, method3 and the PWM.
for(i in c(11,12,13)){
print(GetIUPACSequence(DepMouse[[i]]$mat))
print(Getmotif1(DepMouse[[i]]$mat))
print(Getmotif2(DepMouse[[i]]$mat))
print(Getmotif3(DepMouse[[i]]$mat))
print(DepMouse[[i]]$mat)
print('-----------------------------------')
}
[1] "GGAAASYGAAASBRRRA"
[1] "GGAAA(Gt)[N/(CT)]GAAAta(GA)(AG)(AG)[N/A]"
[1] "GGAAAtNGAAAta(GA)[(AG)/t](AG)N"
[1] "GGAAA(Gt)NGAAAtactAA"
1 2 3 4 5 6
A 0.14609981 0.22921996 0.83184109 0.91496124 0.85262112 0.16687984
C 0.14609981 0.06297965 0.04219961 0.02141957 0.02141957 0.25000000
G 0.62404070 0.64482074 0.10453973 0.02141957 0.06297965 0.54092054
T 0.08375969 0.06297965 0.02141957 0.04219961 0.06297965 0.04219961
7 8 9 10 11 12
A 0.1460998 0.10453973 0.91496124 0.91496124 0.89418120 0.20843992
C 0.3539002 0.04219961 0.02141957 0.02141957 0.06297965 0.45780039
G 0.2084399 0.81106105 0.04219961 0.02141957 0.02141957 0.25000000
T 0.2915601 0.04219961 0.02141957 0.04219961 0.02141957 0.08375969
13 14 15 16 17
A 0.04219961 0.29156008 0.41624031 0.4785804 0.4785804
C 0.33312016 0.08375969 0.16687984 0.1045397 0.1668798
G 0.27078004 0.43702035 0.33312016 0.2915601 0.1668798
T 0.35390019 0.18765988 0.08375969 0.1253198 0.1876599
[1] "-----------------------------------"
[1] "AAAARAGRAAVTGARA"
[1] "(AG)AA(Ac)(AG)AG(AG)AAtTGAAA"
[1] "NAA[c/(Ac)](AG)AG(AG)AAtTGA[(AG)/A]A"
[1] "NAA(Ac)(AG)A(GA)(AG)AAtTGA(AG)A"
1 2 3 4 5 6
A 0.4549524 0.62451035 0.6317851 0.58310017 0.43872412 0.6424175
C 0.1897034 0.07162843 0.0581981 0.02182429 0.11247902 0.0581981
G 0.2355904 0.13822048 0.1796307 0.22160045 0.37381086 0.1941802
T 0.1197538 0.16564074 0.1303861 0.17347510 0.07498601 0.1052043
7 8 9 10 11 12
A 0.231113598 0.507554561 0.9955232233 0.995523223 0.372691662 0.24902071
C 0.001678791 0.001678791 0.0033575825 0.001119194 0.343033016 0.09513151
G 0.758813654 0.489087857 0.0005595971 0.001678791 0.283156128 0.09513151
T 0.008393956 0.001678791 0.0005595971 0.001678791 0.001119194 0.56071628
13 14 15 16
A 0.17123671 0.88864018 0.704532736 0.883603805
C 0.04924454 0.02518187 0.020145495 0.050923335
G 0.76273083 0.06939004 0.273642977 0.064353665
T 0.01678791 0.01678791 0.001678791 0.001119194
[1] "-----------------------------------"
[1] "GRGRAAVTGAAASYR"
[1] "G(AG)G(AG)AAtTGAAA[N/(CG)](TC)(GA)"
[1] "G(AG)G(AG)AAtTGAAAN(TC)(GA)"
[1] "GtG(AG)AAtTGAAANaN"
1 2 3 4 5 6
A 0.1704854 0.43961180 0.14601933 0.51300992 0.90446654 0.85553446
C 0.1704854 0.12155330 0.07262122 0.04815518 0.02368914 0.04815518
G 0.4885439 0.34174765 0.70873823 0.41514576 0.04815518 0.04815518
T 0.1704854 0.09708726 0.07262122 0.02368914 0.02368914 0.04815518
7 8 9 10 11 12
A 0.26834953 0.1215533 0.02368914 0.88000050 0.83106842 0.80660238
C 0.36621368 0.1215533 0.07262122 0.02368914 0.07262122 0.02368914
G 0.34174765 0.1215533 0.85553446 0.07262122 0.07262122 0.09708726
T 0.02368914 0.6353401 0.04815518 0.02368914 0.02368914 0.07262122
13 14 15
A 0.1704854 0.07262122 0.3417476
C 0.3417476 0.34174765 0.1460193
G 0.3417476 0.12155330 0.3906797
T 0.1460193 0.46407784 0.1215533
[1] "-----------------------------------"
Dimer from SOX family, mouse.
for(i in c(27,28)){
print(GetIUPACSequence(DepMouse[[i]]$mat))
print(Getmotif1(DepMouse[[i]]$mat))
print(Getmotif2(DepMouse[[i]]$mat))
print(Getmotif3(DepMouse[[i]]$mat))
print(DepMouse[[i]]$mat)
print('-----------------------------------')
}
[1] "CAMMAAWSHHCATTGTCS"
[1] "CA[(AC)/A](CA)AATNggCATTGTCN"
[1] "CA(AC)(CA)AA[c/T]NggCATTGTC[N/(CG)]"
[1] "CA(AC)(CA)A(Ac)(Tc)NggCATTGT(Ca)N"
1 2 3 4 5 6
A 0.14794692 0.75210114 0.62147320 0.37654582 0.80108662 0.70311567
C 0.65413019 0.16427541 0.31123185 0.58881622 0.16427541 0.01731898
G 0.09896144 0.06630446 0.01731898 0.01731898 0.01731898 0.06630446
T 0.09896144 0.01731898 0.04997597 0.01731898 0.01731898 0.21326089
7 8 9 10 11 12
A 0.26224637 0.2132609 0.2622464 0.26224637 0.08263295 0.94804305
C 0.01731898 0.2785749 0.3112318 0.36021732 0.67045868 0.01731898
G 0.11528994 0.2785749 0.1152899 0.01731898 0.04997597 0.01731898
T 0.60514471 0.2295894 0.3112318 0.36021732 0.19693240 0.01731898
13 14 15 16 17 18
A 0.01731898 0.01731898 0.01731898 0.01731898 0.01731898 0.1642754
C 0.01731898 0.01731898 0.01731898 0.01731898 0.67045868 0.3928743
G 0.01731898 0.01731898 0.94804305 0.01731898 0.21326089 0.2785749
T 0.94804305 0.94804305 0.01731898 0.94804305 0.09896144 0.1642754
[1] "-----------------------------------"
[1] "KCCWTTGTBYY"
[1] "aCC[(TA)/T]TTGTa(TC)(CT)"
[1] "aCCTTTGTa(TC)N"
[1] "a(Ca)(CT)(TA)TTGTa(TC)a"
1 2 3 4 5 6
A 0.05727837 0.007607848 0.009594669 0.269868209 0.001647385 0.001647385
C 0.22814497 0.683126961 0.758626156 0.007607848 0.025489236 0.003634206
G 0.45265573 0.186421731 0.001647385 0.003634206 0.003634206 0.001647385
T 0.26192093 0.122843461 0.230131791 0.718889737 0.969229173 0.993071024
7 8 9 10 11
A 0.009594669 0.003634206 0.003634206 0.05330473 0.1248303
C 0.069199296 0.003634206 0.371196076 0.40894567 0.3553015
G 0.917571829 0.001647385 0.257947284 0.08708068 0.2062899
T 0.003634206 0.991084204 0.367222434 0.45066891 0.3135783
[1] "-----------------------------------"
EFB1 family:
for (i in grep('EBF1',names(DepEncode))){
print(names(DepEncode[i]))
print(GetIUPACSequence(DepEncode[[i]]$mat))
print(Getmotif1(DepEncode[[i]]$mat))
print(Getmotif2(DepEncode[[i]]$mat))
print(Getmotif3(DepEncode[[i]]$mat))
print(DepEncode[[i]]$mat)
print('------------------------------------------')
}
[1] "EBF1_disc1"
[1] "TCCCHDGGGA"
[1] "TCCCgcGGGA"
[1] "TCCCgcGGGA"
[1] "TCCCgcGGGA"
1 2 3 4 5
A 0.0009960159 0.0009960159 0.0009960159 0.0009960159 0.29774402
C 0.1537768924 0.9970119522 0.8236643426 0.9970119522 0.35650598
G 0.0009960159 0.0009960159 0.0009960159 0.0009960159 0.02156275
T 0.8442310757 0.0009960159 0.1743436255 0.0009960159 0.32418725
6 7 8 9 10
A 0.33593924 0.0009960159 0.1772818725 0.0009960159 0.8824262948
C 0.02156275 0.0009960159 0.0009960159 0.0009960159 0.0009960159
G 0.37413446 0.9970119522 0.8207260956 0.9970119522 0.1155816733
T 0.26836355 0.0009960159 0.0009960159 0.0009960159 0.0009960159
[1] "------------------------------------------"
[1] "EBF1_known1"
[1] "VDMDABTCCCYRRRGRRBDKRB"
[1] "[N/t][N/c](Ct)c(Ac)aTCCC(CT)[(AG)/A]G(GA)G(Ac)(GA)a[N/c](TG)(Gt)[N/a]"
[1] "[N/t][N/c]tc(Ac)aTCCC(CT)(AG)G(GA)Gc(GA)a[N/c](TG)t[N/a]"
[1] "NN(Ct)c(Ac)aTCCC(CT)A(GA)(GA)G(Ac)(GA)aN(TG)(Gt)N"
1 2 3 4 5 6
A 0.250000 0.374502 0.2500000000 0.3745019920 0.7480079681 0.0009960159
C 0.374502 0.125498 0.6235059761 0.0009960159 0.0009960159 0.3745019920
G 0.250000 0.250000 0.1254980080 0.2500000000 0.1254980080 0.2500000000
T 0.125498 0.250000 0.0009960159 0.3745019920 0.1254980080 0.3745019920
7 8 9 10 11
A 0.0009960159 0.0009960159 0.0009960159 0.0009960159 0.1254980080
C 0.0009960159 0.9970119522 0.9970119522 0.8725099602 0.4990039841
G 0.0009960159 0.0009960159 0.0009960159 0.0009960159 0.0009960159
T 0.9970119522 0.0009960159 0.0009960159 0.1254980080 0.3745019920
12 13 14 15 16
A 0.499004 0.2500000000 0.3745019920 0.0009960159 0.6235059761
C 0.125498 0.0009960159 0.0009960159 0.0009960159 0.0009960159
G 0.250000 0.7480079681 0.6235059761 0.9970119522 0.2500000000
T 0.125498 0.0009960159 0.0009960159 0.0009960159 0.1254980080
17 18 19 20 21 22
A 0.3745019920 0.0009960159 0.250000 0.0009960159 0.2500000000 0.125498
C 0.1254980080 0.2500000000 0.125498 0.1254980080 0.1254980080 0.250000
G 0.4990039841 0.3745019920 0.250000 0.3745019920 0.6235059761 0.250000
T 0.0009960159 0.3745019920 0.374502 0.4990039841 0.0009960159 0.374502
[1] "------------------------------------------"
[1] "EBF1_known2"
[1] "KTCCCYWGRGA"
[1] "aTCCC(TC)(TA)G(GA)G(Ac)"
[1] "aTCCC(TC)(TA)G(GA)G[c/(Ac)]"
[1] "aTCCC(TC)(TA)G(GA)G(Ac)"
1 2 3 4 5
A 0.0009960159 0.0009960159 0.0009960159 0.0009960159 0.0009960159
C 0.2001992032 0.0009960159 0.9970119522 0.9970119522 0.9970119522
G 0.3994023904 0.0009960159 0.0009960159 0.0009960159 0.0009960159
T 0.3994023904 0.9970119522 0.0009960159 0.0009960159 0.0009960159
6 7 8 9 10
A 0.0009960159 0.3994023904 0.2001992032 0.3994023904 0.0009960159
C 0.3994023904 0.0009960159 0.0009960159 0.0009960159 0.0009960159
G 0.0009960159 0.0009960159 0.7978087649 0.5986055777 0.9970119522
T 0.5986055777 0.5986055777 0.0009960159 0.0009960159 0.0009960159
11
A 0.5986055777
C 0.0009960159
G 0.2001992032
T 0.2001992032
[1] "------------------------------------------"
[1] "EBF1_known4"
[1] "ATTCCCHWGGGAMT"
[1] "ANTCCCg(AT)GGGA(AC)T"
[1] "ANTCCCg[(AT)/c]GGGA(AC)g"
[1] "ANTCCCg(Ac)GGGA(AC)(Tg)"
1 2 3 4 5 6
A 0.71704382 0.1666763 0.0213227092 0.0009960149 0.002900398 0.0009960159
C 0.07518127 0.1789253 0.1284511952 0.9944382470 0.981142430 0.9931563745
G 0.10808068 0.2163157 0.0009960159 0.0029262948 0.003534861 0.0009960159
T 0.09969422 0.4380827 0.8492300797 0.0016394422 0.012422311 0.0048515936
7 8 9 10 11 12
A 0.36330080 0.50223008 0.0143097610 0.0652360558 0.0047978040 0.92241434
C 0.25306175 0.05066733 0.0047998008 0.0057988048 0.0009960159 0.01114143
G 0.04161052 0.13517430 0.9798944223 0.9279691235 0.9792709163 0.05112550
T 0.34202689 0.31192829 0.0009960159 0.0009960159 0.0149352590 0.01531873
13 14
A 0.39062948 0.16226793
C 0.37901793 0.23580777
G 0.09453386 0.05324795
T 0.13581972 0.54867629
[1] "------------------------------------------"
sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 15063)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] Logolas_1.1.2 TFBSTools_1.14.0 JASPAR2014_1.12.0
[4] Biostrings_2.43.8 XVector_0.15.2 IRanges_2.9.19
[7] S4Vectors_0.13.17 BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 lattice_0.20-35
[3] GO.db_3.4.1 png_0.1-7
[5] Rsamtools_1.27.16 gtools_3.5.0
[7] rprojroot_1.2 digest_0.6.12
[9] R6_2.2.0 GenomeInfoDb_1.12.0
[11] plyr_1.8.4 backports_1.0.5
[13] RSQLite_1.1-2 evaluate_0.10
[15] httr_1.2.1 ggplot2_2.2.1
[17] zlibbioc_1.21.0 lazyeval_0.2.0
[19] annotate_1.54.0 R.utils_2.5.0
[21] R.oo_1.21.0 Matrix_1.2-9
[23] rmarkdown_1.6 splines_3.4.0
[25] BiocParallel_1.10.0 readr_1.1.1
[27] stringr_1.2.0 CNEr_1.12.0
[29] RCurl_1.95-4.8 munsell_0.4.3
[31] DelayedArray_0.2.0 compiler_3.4.0
[33] rtracklayer_1.35.12 seqLogo_1.42.0
[35] DirichletMultinomial_1.18.0 htmltools_0.3.5
[37] KEGGREST_1.16.0 SummarizedExperiment_1.6.0
[39] tibble_1.3.0 GenomeInfoDbData_0.99.0
[41] matrixStats_0.52.2 XML_3.98-1.6
[43] TFMPvalue_0.0.6 GenomicAlignments_1.11.12
[45] bitops_1.0-6 R.methodsS3_1.7.1
[47] grid_3.4.0 xtable_1.8-2
[49] gtable_0.2.0 DBI_0.6-1
[51] git2r_0.18.0 magrittr_1.5
[53] scales_0.4.1 stringi_1.1.5
[55] reshape2_1.4.2 tools_3.4.0
[57] BSgenome_1.44.0 Biobase_2.35.1
[59] poweRlaw_0.70.0 hms_0.3
[61] yaml_2.1.14 AnnotationDbi_1.38.1
[63] colorspace_1.3-2 GenomicRanges_1.27.23
[65] caTools_1.17.1 memoise_1.1.0
[67] VGAM_1.0-3 knitr_1.15.1
This R Markdown site was created with workflowr