Empirical distribution of k-word matches in biological sequences
Forêt, Sylvain, Wilson, Susan R., and Burden, Conrad J. (2009) Empirical distribution of k-word matches in biological sequences. Pattern Recognition, 42 (4). pp. 539-548.
|PDF (Published Version) - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader|
View at Publisher Website: http://dx.doi.org/10.1016/j.patcog.2008....
This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D2 for ranges of parameters most frequently encountered in the study of biological sequences.
|Item Type:||Article (Refereed Research - C1)|
|Keywords:||alignment-free sequence comparison; biological sequences; genomic data|
|FoR Codes:||01 MATHEMATICAL SCIENCES > 0104 Statistics > 010402 Biostatistics @ 100%|
|SEO Codes:||81 DEFENCE > 810101 Air Force @ 100%|
|Deposited On:||01 Jun 2010 08:50|
|Last Modified:||12 Feb 2011 21:56|
Last 12 Months: 0
|Citation Counts with External Providers:|
Repository Staff Only: item control page