Formal Grammar Matcher:
takes in as arguments:
- query - string
- language : array of rules
- rule type :
- match - list of string: the string to match. See Match Syntax.
- matchtype - list of string: the type of the subexpression to match
- type - string ["query" | "data" | "function"]
- if type == "query" then
- query - string: a query in rdf form. %matchtype% will get replaced by that subquery's result.
- this type has been depricated. Use a function that returns the proper query data structure instead.
- if type == "data" then
- data - string: plain rdf data.
- if type == "function" then
- code - string: lua code to execute.
returns:
- node :
- rule - rule: one matching rule
- subexpressions = {[matchtype] = array of nodes}
returns array of nodes
Some problems to solve:
- can a rule have multiple match types?
- any way to abstract above match types?
- what to do when there are multiple matches to a query?
- what to do when there are no matches to a query?
- what kind of data will these algorithms require?
Multiple matches:
Required data:
- each possible match
- if the match is a query, how many results it returns
Dealing with punctuation and whitespace variations
The following match is applied to all incoming queries to deal with whitespace s/s+/ /; or gsub("%s+", " ")
How to deal with situations like:
- '45 + 34'
- 'emails, from bob.'
When will extra spaces or punctuation be important?
- 'emails from bob, joe or bill'
- 'emails, from bob, where something ...'
Still, what about when is white space necessary?
- at this point I might just want to get rid of it all... or rather s/s+/ /; or gsub("%s+", " ")
What about this type of case:
- 'thirty-five' => 35
- 'four-five' => -1
you don't want just '%number%-%number%', you want '%number%-%number% but not '%tens place number%-%single digit number%'
Complex matching
Abstracting above match types
Abstractions that would be useful:
- types (of data) that have a specific attribute ( has a dc:date )
- %person% matching when the string matches a name of a person in the contact list
- named groups of matchtypes
- rules that work as multiple match types
Simplify this to:
- match if data coming into matchtype x has (or could have?) a dc:date attribute
- this requires evaluating the sub-query to determine what kind of properties it will have...
- match if the string depending on a query into the database. Ex:
- "Bob" matches %person% if the query: "... WHERE { ?contact ex:name "Bob" } has a single result. (We know exactly who Bob is)
- "Bob" matches %person% if the query: "... WHERE { ?contact ex:name "Bob" } has more than one result. (Bob is actually ambiguous)
- "Bob" matches %person% if the query: "... WHERE { ?contact ex:name "Bob" } has no results. (We've got nothing on Bob)
Matching types and the existence of attributes
- Is this going to require the definition of invariant rules that apply to matchtypes (this matchtype always has these attributes)
- How can you test for these attributes without executing the query?
- If what we are matching against is a property of the query can we simply examine it?
- how hard is it to determine the type of each of the select variables?
- what other kind of information might you want to have access to?
- maybe the select coming in doesn't have a date selected out, but it could ... ex:
- list of emails (but the user doesn't care about the date so it isn't displayed and not in the query)
- or does this not really happen by convention? Any good examples?
- this is not an issue if all queries always select all of the possible information you might want to match against. This doesn't seem very feasible.
Matching data and the values of their properties
This can be addressed by allowing the definition of a function which is called when a match is being tested. The environment would provide functions which execute a query and return its results.
- inputs to the function are in the typical vars table.
- return true if these vars match, otherwise false.
Some examples of what you would want to use this for:
- regular expression matching on the string coming in. This could be done by making a match which is simply %name% and then testing from the match-function.
- test to see if this is a person's name by looking it up in the database.
- Are there any cases where the match isn't going to be a simple '%abc%', but have other words in it etc?
This function is called when testing to see if a rule matches, if the simple text match already matched. If this function returns true it actually is a match, if it returns false it overrides the simple text match.
This actually requires a new type of match type. Say the case of matching numbers entered as actual ascii digits. What do you match to?
- first instinct is %number%, but this is a recursive definition because the type is %number%, which isn't really what we want. In this case, we don't actually want the search algorithm to try to find a match for a number. We just want a variable there called number. In some cases we may actually want to let it search for a number, then pass it to us for further testing. I'm not sure when, but I don't see anything wrong with allowing that... Anyway, I think the solution here is to add a new identifier here like _ to signify that we don't actually want to match against number, we just want to capture it as a variable for use in the match function later. So %_number%.
Need to be able to match against parts of the RDF database. Ex:
... ?person contacts:person ?name ...
all of the ?name results can be matched against.
or
?contact ex:name ?name .
?contact ex:email ?email ...
match ?name as a %person% type. Result is ?email %email address% type.
match: "%person%"
match-sparql: "/intro/ SELECT ?email WHERE { ?contact ex:name %person% . ?contact ex:email ?email }"
How does this type of search work?
Optimizations
The search is a complete search. It finds all possible matches, not just one. This makes sense now, but an optimization algorithm here is going to be closely related to the algorithm that predicts how to interpret each ambiguous statement.
- keep track of how often certain matches are used as a top level query (this is probably a hard part of the search) - I'd check it out though.
Old issues (already dealt with) - here for completeness and reference
Can a rule have multiple match types?
How does this effect the performance? * it seems like it may effect performance, but is really unavoidable. This abstraction is really useful. Tests could be done later to determine if expanding rules with multiple matchtypes to the equivalent set of multiple rules.
Is there any other way to handle this type of abstraction?