Approximate Pattern Matching

Defining The Problem

This is a new project. I think (hope) readers will appreciate the opportunity to see an Engineers' approach to a technical question. I am a System Engineer. A generalist. Fundamentally, I apply scientific and mathematical principles to practical ends. I rely heavily on science and mathematics when working on a problem. Consequently, my notes and papers will reflect this dependency. I have included a list of references below as an aid to better understanding of the subject at hand and the methodology of my approach.

Approximate Pattern Matching (APM) is a method of searching a relatively large data source to find occurrences of a relatively small given pattern. A match is found if a data source sequence is identical or close to the given pattern. Just how close is close enough is part of the problem. If the data source is a text file of surnames and the pattern is "O'Keefe", we might want to count the sequence "Okeefe" as a match but if the sequence is "Okeafe", we might not want to count that as a match. The point here is that if the match criteria is too loose, we get too many matches (like Web search sites). If the criteria is too tight, we lose the advantage sought by using APM.

EBNF 4.0 is a tool for specifying linear procedural grammars. It produces parser/compilers that employ top-down successive alternation with back-tracking and conditional compilation. A pattern, in EBNF terms, is called a token and tokens must be identical to the source sequence to be counted as a match.

The question (i.e. problem space) is can I develop a reasonable model and specification for APM that, preferably, makes use of EBNF 4.0 in the implementation? The implication, of course, is that the APM model can be expressed as an EBNF 4.0 linear procedural grammar.

My thinking process about this is that every specific area of human endeavor has its own distinct language. There is, I believe, some inherent advantage in developing specialized languages for specialized subject areas. The consequence, however, is that when two areas of specialization are working the same problem in parallel, neither may be intuitively aware of that fact. The result can be duplication of effort and missed opportunities for cooperation. I suspect that APM is closely related to compiler theory so that one may benefit from the other.

Background References

Methodology and APM Sources

P.D. Terry, Compilers and Compiler Generators, Rhodes University, 1996, p.terry@ru.ac.za
Jean Bezivin, On The Unification Power Of Models, ATLAS Group (INRIA & LINA), University of Nantes, 2001, jean.bezivin@lina.univ-nantes.fr, http://www.sciences.univ-nantes.fr/lina/atl/ , Model Driven Engineering
A.V. Aho and S.C. Johnson Bell Laboratories, J.D. Ullman Princeton University, Deterministic Parsing of Ambiguous Grammars, Communications of the ACM, volumn 18 number 8 August 1975
Robert S. Boyer Stanford Research Institute, J.S. Moore Xerox Palo Alto Research Center, A Fast String Searching Algorithm, Communications of the ACM, volume 20 number 10 October 1977
M.A. Sustik, J.S. Moore, String Searching over Small Alphabets, Technical Report TR-07-62, Department of Computer Sciences, University of Texas at Austin, December 2007
W.R. Gish, Introduction To Alignment Scoring Statistics, Saint Louis, Missouri, 2003, http://www.cs.uml.edu/~grinstei/91.510/Lectures%205-8/Introduction%20to%20Alignment%20Scoring%20Statistics.htm
R. Karchin, Hidden Markov Models and Protein Sequence Analysis, http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html , Highly Recommended!
Gonzalo Navarro, Multiple Approximate String Matching by Counting, Dept. of Computer Science, University of Chile, 1997, gnavarro@dcc.uchile.cl , http://www.dcc.uchile.cl/~gnavarro/ps/wsp97.1.pdf , Highly Recommended!

Note: I will place document links in the pane on the left as they become available.