title: "Local Score Package" author: "Sébastien Déjean - Sabine Mercier - Sebastian Simon - David Robelin" date: "2023-11-02" output: pdf_document: toc: yes html_document: df_print: paged toc: yes vignette: > % % % ---

library(localScore)

Introduction

This package provides functionalities for two main tasks: 1- Calculating the local Score of a given score sequence or, of a given component sequence and a given scoring scheme. 2- Calculating the statistical relevance (\(p\)-value) of a given local Score, associated to a given sequence length and a given distribution for the model.

This first version of the package deals with a model of independent and identically distributed (I.I.D.) sequences.

Local Score computation methods

First defined in Karlin and Altschul 1990, it represents the value of the highest scoring segment in a sequence of scores. It corresponds to the highest cumulated sum of all subsequences (independent of length). For the score to be relevant, the expectation of the sequence should be negative. Thus, for a sequence of interest, the possible score of sequence components can be positive or negative.

A first example: function ``localScoreC()''

Let us assume a score function taking its values in [-2, -1, 0, 1, 2]. A sample score sequence of length 100 could be

library(localScore)
help(localScore)
ls(pos = 2)
#>  [1] "Aeso"                               "CharSequence2ScoreSequence"        
#>  [3] "CharSequences2ScoreSequences"       "HydroScore"                        
#>  [5] "LongSeq"                            "MidSeq"                            
#>  [7] "MySeqList"                          "RealScores2IntegerScores"          
#>  [9] "SJSyndrome.data"                    "Seq1093"                           
#> [11] "Seq219"                             "Seq31"                             
#> [13] "SeqListSCOPe"                       "ShortSeq"                          
#> [15] "aeso.data"                          "automatic_analysis"                
#> [17] "daudin"                             "dico"                              
#> [19] "exact_mc"                           "karlin"                            
#> [21] "karlinMonteCarlo"                   "karlinMonteCarlo_double"           
#> [23] "lindley"                            "loadMatrixFromFile"                
#> [25] "loadScoreFromFile"                  "localScoreC"                       
#> [27] "localScoreC_double"                 "maxPartialSumd"                    
#> [29] "mcc"                                "monteCarlo"                        
#> [31] "monteCarlo_double"                  "scoreDictionnary2probabilityVector"
#> [33] "scoreSequences2probabilityVector"   "sequences2transmatrix"             
#> [35] "stationary_distribution"            "transmatrix2sequence"
sequence <- sample(-2:2, 100, replace = TRUE, prob = c(0.5, 0.3, 0.05, 0.1, 0.05))
sequence
#>   [1] -1 -2 -2 -1 -1 -2  0 -2 -1 -1 -1 -1 -2  1  1  0 -2 -2 -2 -1 -2 -2 -2 -2 -2
#>  [26] -2 -2  1  1  2 -2 -1 -1 -1 -2 -2 -1 -1  1 -2  1 -2 -2  1 -1 -2 -2 -1 -1  2
#>  [51] -2  1 -1 -1 -1 -1 -1 -1 -1 -2 -1 -1 -1 -2  1  1 -2  1 -2 -1 -2 -2 -2 -1 -1
#>  [76] -2 -2  1 -2 -1  1  0 -2  1 -2 -1 -2 -2 -1 -2 -2  0 -1  1 -1 -2 -1  2  1 -1
localScoreC(sequence)
#> $localScore
#> value begin   end 
#>     4    28    30 
#> 
#> $suboptimalSegmentScores
#>       value begin end
#>  [1,]     2    14  15
#>  [2,]     4    28  30
#>  [3,]     1    39  39
#>  [4,]     1    41  41
#>  [5,]     1    44  44
#>  [6,]     2    50  50
#>  [7,]     1    52  52
#>  [8,]     2    65  66
#>  [9,]     1    68  68
#> [10,]     1    78  78
#> [11,]     1    81  81
#> [12,]     1    84  84
#> [13,]     1    94  94
#> [14,]     3    98  99
#> 
#> $RecordTime
#>  [1] 14 28 39 41 44 50 52 65 68 78 81 84 94 98

The result is a maximum score and for which subsequence it has been found: the starting position "[" and the end of it ("]"). It also yields all other subsequences with a score equal or less and its position in the $suboptimalSegmentScores matrix. "Stopping Times" are local minima in the cumulated sum of the sequence and correspond to the beginning of excursions (potential segments of interest).

Another example with missing score values.

library(localScore)
sequence <- sample(c(-3, 2, 0, 1, 5), 100, replace = TRUE, prob = c(0.5, 0.3, 0.05, 0.1, 0.05))
head(sequence)
#> [1]  2 -3  1  5 -3  2
localScoreC(sequence)
#> $localScore
#> value begin   end 
#>    13    93   100 
#> 
#> $suboptimalSegmentScores
#>       value begin end
#>  [1,]     2     1   1
#>  [2,]    10     3  17
#>  [3,]     7    31  36
#>  [4,]     2    43  43
#>  [5,]     2    48  48
#>  [6,]     5    50  50
#>  [7,]     6    54  57
#>  [8,]     6    60  62
#>  [9,]     9    69  76
#> [10,]     1    84  84
#> [11,]     7    86  87
#> [12,]    13    93 100
#> 
#> $RecordTime
#>  [1]  1  3 31 43 48 50 54 60 69 84 86 93

Example with real scores: function ``localScoreC_double()''

Real scores can also be considered with a dedicated function 'localScoreC_double'.

score_reels <- c(-1, -0.5, 0, 0.5, 1)
proba_score_reels <- c(0.2, 0.3, 0.1, 0.2, 0.2)
sample_from_model <- function(score.sple, proba.sple, length.sple) {
  sample(score.sple,
    size = length.sple, prob = proba.sple, replace = TRUE
  )
}
seq.essai <- sample_from_model(score.sple = score_reels, proba.sple = proba_score_reels, length.sple = 10)
localScoreC_double(seq.essai)
#> $localScore
#> value begin   end 
#>   1.5   4.0   5.0 
#> 
#> $suboptimalSegmentScores
#>      value begin end
#> [1,]   1.5     4   5
#> [2,]   1.0     8   8
#> 
#> $RecordTime
#> [1] 4 8

Example using a scoring function

# Loading a fasta protein
data(Seq219)
Seq219
#> [1] "MSGLSGPPARRGPFPLALLLLFLLGPRLVLAISFHLPINSRKCLREEIHKDLLVTGAYEISDQSGGAGGLRSHLKITDSAGHILYSKEDATKGKFAFTTEDYDMFEVCFESKGTGRIPDQLVILDMKHGVEAKNYEEIAKVEKLKPLEVELRRLEDLSESIVNDFAYMKKREEEMRDTNESTNTRVLYFSIFSMFCLIGLATWQVFYLRRFFKAKKLIE"
# or using your own fasta sequence
# MySeqAA_P49755 <- as.character(read.table(file="P49755.fasta",skip=1)[,1])
# MySeqAA_P49755
# Loading a scoring function
data(HydroScore)
?HydroScore
# or using your own scoring function
# HydroScoreKyte<- loadScoreFromFile("Kyte1982.txt")
# Transforming the amino acid sequence into a score sequence with the score function in HydroScore file
SeqScore_P49755 <- CharSequence2ScoreSequence(Seq219, HydroScore)
head(SeqScore_P49755)
#> [1]  2 -1  0  4 -1  0
length(SeqScore_P49755)
#> [1] 219
# Computing the local score
localScoreC(SeqScore_P49755)
#> $localScore
#> value begin   end 
#>    52    14    38 
#> 
#> $suboptimalSegmentScores
#>       value begin end
#>  [1,]     5     1   4
#>  [2,]     2     9   9
#>  [3,]    52    14  38
#>  [4,]    17   121 124
#>  [5,]     7   138 139
#>  [6,]     4   144 144
#>  [7,]     4   147 147
#>  [8,]     4   149 149
#>  [9,]     4   151 151
#> [10,]     4   154 154
#> [11,]     4   157 157
#> [12,]     9   161 162
#> [13,]     2   175 175
#> [14,]    43   186 208
#> 
#> $RecordTime
#>  [1]   1   9  14 121 138 144 147 149 151 154 157 161 175 186

File HydroScore is read using read.csv() with a default option header=TRUE. It is so necessary that the file has one. See Section "File format" for more details.

\(p\)-Value computation methods

There are different methods available to establish the statistical significance, also called \(p\)-value, of the local score depending on the length and the score expectation. This value describes the probability to encounter a given local score for a given score distribution and a given sequence length. Therefore it allows to determinate if the local score in question is significant or could have been obtained by chance.

For an Identically and Independently Distributed Variables model (I.I.D.), the following methods for the calculus of the \(p\)-value are provided:

Simulating computation: functions "monteCarlo()" and "monteCarlo_double()"

The function monteCarlo() simulates a number of score sequences similar (same distribution) to the one having yielded the local score in question. Therefore, it requires a function that produces such sequences and the parameters used by this function. See the help page and the following example to use the empirical distribution of a given sequence, but any other function, such as rbinom() or custom functions, are valid too.

In the following example we search the probability to obtain a local score of 10 in the sequence we created in the previous section and which serves as a blueprint for the score sequences to produce. The return value is the \(p\)-value of the local score for the given score sequence. A plot of the distribution of all local scores simulated and the cumulative distribution function are displayed. These plots can be hidden by setting the argument "hist" to FALSE. The number of sequences simulated in our example is 1000, a default value, and can be changed by setting the argument "numSim" to an appropriate value.

Note that the cumulative distribution function plot indicates \(P( LocalScore\leq\cdot)\) and so the corresponding \(p\)-value equals 1 minus the cumulative distribution function value.

monteCarlo(local_score = 10, FUN = function(x) {
  return(sample(x = x, size = length(x), replace = TRUE))
}, x = sequence)

#> p-value 
#>   0.991

The use of this method depends of computing power, number of simulations, implementation of the simulating function and the length of the sequence. For long sequences, you may prefer the next method which combines simulating and approximated methods and called KarlinMonteCarlo() (see next section).

There also exists a special function for real scores.

# Example
score_reels <- c(-1, -0.5, 0, 0.5, 1)
proba_score_reels <- c(0.2, 0.3, 0.1, 0.2, 0.2)
sample_from_model <- function(score.sple, proba.sple, length.sple) {
  sample(score.sple,
    size = length.sple, prob = proba.sple, replace = TRUE
  )
}
monteCarlo_double(5.5,
  FUN = sample_from_model, plot = TRUE, score.sple = score_reels, proba.sple = proba_score_reels,
  length.sple = 100, numSim = 1000
)

#> p-value 
#>   0.619

A mixed method: functions "karlinMonteCarlo()" and "karlinMonteCarlo_double()"

The function karlinMonteCarlo() also uses a function supplied by the user to do simulations. However, it does not deduce directly the \(p\)-value from the cumulative distribution function. However, this function is used to estimate the parameters of the Gumbel distributon of Karlin and al. approximation. Thus, it is suited for long sequences. Note: simulated_sequence_length is the length of the simulated sequences. This value must correspond to the length of sequences yielded by FUN. The value of \(n\) is the sequence length for which we want to compute the \(p\)-value. If \(n\) too large function MonteCarlo() could be too much time consumming. UsingkarlinMonteCarlo() with a smaller sequence length simulated_sequence_length allows to extract the parameters and then to apply them to the sequence value \(n\).

fu <- function(n, size, prob, shift) {
  rbinom(size = size, n = n, prob = prob) + shift
}
karlinMonteCarlo(12,
  FUN = fu, n = 10000, size = 8, prob = 0.2, shift = -2,
  sequence_length = 1000000, simulated_sequence_length = 10000
)