Package owltools.sim

OWLSim -- Semantic Similarity over OWL ontologies

See:
          Description

Interface Summary
SimilarityCalculator  
 

Class Summary
AsymmetricJaccardSimilarity  
AvgInformationContentBestMatchesSimilarity Deprecated.
AvgInformationContentLCSSimilarity Deprecated.
CombinedJaccardConjunctiveSetSimilarity filters first based on SimJ, if passes threshold performs full ConjunctiveSetSimilarity
CombinedSimilarity  
CombinedSimilarity2  
ConjunctiveSetInformationContentRatioSimilarity score is the IC of the intersection of all attributes divided my min IC of a or b it's recommended this used as a sub-method of a MultiSimilarity check
ConjunctiveSetSimilarity ConjunctiveSetSimilarity(a,b) = IC( LCA_1(a,b) ^ LCA_2(a,b) ...
DescriptionTreeSimilarity A similarity between two entities constructed by threading two description trees together.
DisjunctiveSetSimilarity  
EnrichmentEngine  
JaccardBloomSimilarity  
JaccardSimilarity  
MaximumInformationContentSimilarity MaximumInformationContentSimilarity(a,b) = max { IC(LCA_1(a,b)) ...
MultiSimilarity This is the standard method to use when comparing entities with multiple attributes.
OldSimpleOwlSim Inputs
OldSimpleOwlSim.EnrichmentConfig  
OverlapSimilarity  
OWLClassExpressionPair  
OWLObjectPair  
OWLResultsManager  
Reporter  
SimEngine SimEngine provides access to multiple similarity calculation procedures (see Similarity).
Similarity Represents a pairwise similarity between two OWLObjects.
SimSearch quick search for best matches
 

Enum Summary
OldSimpleOwlSim.Stage  
 

Exception Summary
 

Package owltools.sim Description

OWLSim -- Semantic Similarity over OWL ontologies

Synopsis

Command Line Use

   
owltools file:test_resources/pizza.owl --sim-method DescriptionTreeSimilarity --sim http://owl.cs.manchester.ac.uk/2009/07/sssw/pizza#MeatFishAndVegetarianPizza-Closed http://owl.cs.manchester.ac.uk/2009/07/sssw/pizza#VegetarianToppingsPizza-Closed
   
   

Introduction

One drawback with most existing bioinformatics ontology-based similarity analyses is the lack of OWL semantics. For example, most existing semantic similarity analyses treat the ontology as a simple graph. The subset of analyses that incorporate edge labels do so in ad-hoc ways, rather than taking advantage of the full semantics of the relationship.

These analyses are also dependent on all potential useful grouping classes being specified in advance, in a pre-coordinated ontology. This results in a lack of precision, as it is all but impossible to anticipate all potential grouping classes in advance.

Here we define a small number of methods for scoring similarity of two entities based on shared properties. The methods are completely neutral w.r.t the nature of these entities - they could be two individuals, two classes, or an individual and a class. They can be any type of individual or class, and can be compared with respect to any attribute. For example, organisms can be compared by phenotype or pizzas can be compared by toppings. The only requirement is that all the information is specified in OWL. The more heavily axiomatized the OWL is, the more informative the results will be.

The primary innovation is the use of a novel algorithm for computing the least common subsuming description. Unlike naive LCA computations, this also includes anonymous classes in the result. These are constructed on the fly. This has utility in a number of situations; for example, when comparing two complex descriptions, one using a human anatomy ontology and another using a zebrafish anatomy ontology, if we include a bridging ontology such as uberon then novel descriptions are generated that subsume both.

Definitions

Common Ancestors and Common Subsumers

Subsumed by

A subsumed by B if SubClassOf(A B). This can be either asserted or inferred. Subsumption is transitive, reflexive and anti-symmetric. The inverses relation is subsumes.

Descendant of

A is a descendant of B if there is an OWL Graph Edge in which A is the source and B is the target. See owltools.graph for a description of OWL Graphs. The inverse relation is is ancestor of. Descent is a super-property of subsumption (i.e. if A subsumed by B, the A is a descendant of B).

Common Subsumers

A is a common subsumer of B and C if A subsumes B and A subsumes C

As we count owl:Thing in the universe of objects, if B and C are satisfiable then it follows that they must have at least one CS, owl:Thing

Least Common Subsumers

A is a Least Common Subsumer (LCS) of B and C if:

note that the reflexivity of the subsumption relation necessitates the final clause to ensure that equivalent objects are included in the LCS set.

Common Ancestors

A is a common ancestor of B and C if A is a ancestor of B and A is a ancestor of C

Least Common Ancestors

A is a Least Common Ancestor (LCP) of B and C if:

(we use P rather than A, as we reserve A for "anonymous", see below)

note that the lack of anti-symmetric characteristic on the ancestor relation necessitates the final clause to ensure that objects related via cycles are included in the LCS set. For example:


SubClassOf(wheel SomeValuesFrom(partOf car))
SubClassOf(car SomeValuesFrom(hasPart wheel))
SubClassOf(audi car)
SubClassOf(bmw car)

P(audi) = { audi, car, wheel}
P(bmw) = { bmw, car, wheel}
LCP(audi, bmw) = { car, wheel }

Named Objects and Named Subsumers

A is a named subsumer of B if A is a subsumer of B, and A is in the set of named objects.

Example:


SubClassOf(car SomeValuesFrom(hasPart wheel))
SubClassOf(audi car)

SNamed(audi) = { audi, car, owl:Thing}

Anonymous Expression Subsumers

A is an anonymous subsumer of B if A is a subsumer of B, and A is in the set of class expressions.

With many ontologies, the number of anonymous subsumers will be larger than the set of named subsumers, because there are many combinations of expressions that can be made that subsume the source object.

Example:


SubClassOf(car SomeValuesFrom(hasPart wheel))
SubClassOf(wheel SomeValuesFrom(hasQuality round))
SubClassOf(audi car)

SAnon(audi) = { 
  SomeValuesFrom(hasPart,wheel)
  SomeValuesFrom(hasPart,owl:Thing)
  SomeValuesFrom(hasPart,SomeValuesFrom(hasQuality,round))
  SomeValuesFrom(hasPart,SomeValuesFrom(hasQuality,Thing))
}

Referenced Entities

X is a referenced entity if it is in the signature of the ontology, or set of ontologies. All named objected are referenced SubClassOf(wheel SomeValuesFrom(partOf car)) SubClassOf(car vehicle) ObjectProperty(partOf) R = { partOf car vehicle wheel SomeValuesFrom(partOf car) } Note that in the above example, SomeValuesFrom(partOf vehicle) is not in the set of referenced entities, even though it is a satisfiable class expression.

Common Referenced Ancestors

A is a Common Referenced Ancestor (CRP) of B and C if:


SubClassOf(car SomeValuesFrom(hasPart wheel))
SubClassOf(wheel SomeValuesFrom(hasQuality round))
SubClassOf(audi car)
SubClassOf(bmw car)

CRP(audi bmw) = { car wheel round SomeValuesFrom(hasPart wheel) SomeValuesFrom(hasQuality round) }

Least Common Named Subsumers

A is a Least Common Named Subsumer (LCNS) of B and C if:

if A ∈ LCNS(B,C) then A ∈ LCS(B,C)

Note that calculation of the LCNS is relatively efficient, as the set is derived from the universe of named objects

Least Common Anonymous Subsumers

A is a Least Common Anonymous Subsumer (LCAS) of B and C if:

if A ∈ LCAS(B,C) then A ∈ LCS(B,C) -- ie LCS includes anonymous classes

Calculation of LCAS is a non-trivial problem, because the universe of possible anonymous class expressions is very large and under certain models, infinite.

For any two classes B and C, under OWL-DL there is always at least one trivial LCAS: UnionOf(B C)

Least Common Meaningful Anonymous Subsumers

A is a Least Common Meaningful Subsumer (LCMS) of B and C if:

Example


EquivalentClasses(car IntersectionOf(vehicle motorized ObjectExactCardinality(hasPart car_wheel 4)))
EquivalentClasses(bicycle IntersectionOf(vehicle ComplementOf(motorized) ObjectExactCardinality(hasPart bicycle_wheel 2)))

SubClassOf(car_wheel wheel)
SubClassOf(bike_wheel wheel)

# subsumers:
LCMAS(car,bicycle) = { IntersectionOf(vehicle ObjectMinCardinality(hasPart wheel 2) ObjectMaxCardinality(hasPart wheel 4) } 
LCAS(car,bicycle) = LCMAS ∪ { UnionOf(car bicycle) }
LCMS(car,bicycle) = LCMAS ∪
LCS(car,bicycle) = LCAS
LCNS(car,bicycle) = { vehicle }
LCNP(car,bicycle) = { vehicle, wheel }
LCRP(car,bicycle) = { vehicle, wheel }

The LCMAS is the most informative of the measures, as it contains the most specific non-trivial description of what is shared between cars and bicycles based on the knowledge available (they are vehicles with between 2 and 4 wheels)

(note: def of M needs modified to eliminate the description "vehicle with 2 wheels or with 4 wheels" which has a smaller number of members, but does not generalize as well)

If we add an axiom:


SubClassOf(vehicle SomeValuesFrom(hasPart wheel))

Then this adds a reference to a new class expression changes the results as follows:

LCNS(car,bicycle) = { vehicle }
LCRS(car,bicycle) = { vehicle SomeValuesFrom(hasPart wheel) }
LCNP(car,bicycle) = { vehicle, wheel }
LCRP(car,bicycle) = { vehicle, SomeValuesFrom(hasPart wheel) }


Algorithms

Calculation of Common Referenced Ancestors (CRP and LCRP)


CRP(a,b) :
  P = {}
  for x in ancestor_of(a)
    add x to P
  for x in ancestor_of(b)
    add x to P
  RETURN P
 
 

LCRP(a,b) :
  P = CRP(a,b)
  for x in P:
    IF (exists y in P AND
       x ancestor_of y AND
       NOT(y ancestor_of x))
    THEN remove x from P
  RETURN P
 
 

Calculation of LCS (including LCAS)

This algorithm finds an approximation of the meaningful least common subsumers (including both anonymous and named). Recall that using standard LCA algorithms will only give us the least common referenced subsumer, which misses informative answers from the much wider set of possible anonymous expressions.

The basic idea can be informally sketched as follows. The descriptions of a and b can be conceived of as description trees, with each branchpoint representing set of sub-expressions join by the IntersectionOf construct. For example, the description motorized vehicle with 4 red wheels is a tree with 4 leaves and 3 branch points:

If we include axioms from the ontology and add edges derived using the OWL Graph traversal algorithm, we end up with a much bushier tree. For example, vehicle might no longer be a leaf node, if we include a SUBCLASS edge to manufactured object. Note that here we are reversing the normal terminology of trees when applied to ontologies - our root is the initial description and the leaves are ancestors.

We attempt to find a maximally informative tree that subsumes two given trees by aligning the branchpoints. We start at the roots of both trees, and traverse downwards, finding the best match for each node in the opposing tree. We calculate the LCRP at each point; if there are multiple non-redundant ancestors then we create an IntersectionOf expression. Where there is one, we use the LCRP.


LCS(a,b) :
  X = {}
  EA = all edges emanating from a
  for ea in EA:

    # find best matching edge for ea
    eb = null
    score = 0
    for bx in LCRP(b):
      for eb' in findEdge(b,bx)
        score' = scoreMatch(ea,eb')
        if score' > score:
          score = score'
          eb = eb'

    x = mkUnionExpression(ea, eb, LCS(ea.tgt. eb.tgt))
    add x to X

  # combine
  if |X| = 0 : return owl:Thing
         = 1 : return only element of X
         > 1 : return IntersectionOf(X)
 
 

Examples

Given the following ontology:


Ontology(test)
axon_terminals_degenerated_in_CA4 EquivalentTo phenotypeOf some hasPart some (CA4 and hasPart some (axon_terminal and hasQuality some degenerated))
dendrites_degenerated_in_CA3      EquivalentTo phenotypeOf some hasPart some (CA3 and hasPart some (dendrite and hasQuality some degenerated))
axon_terminal SubClassOf cell_process and partOf some neuron
dendrite SubClassOf cell_process and partOf some neuron
CA4 SubClassOf partOf some hippocampus
CA3 SubClassOf partOf some hippocampus
org1 instanceOf human and hasPhenotype some axon_terminals_degenerated_in_CA4
org2 instanceOf human and hasPhenotype some dendrites_degenerated_in_CA3
 
 
The answer to LCS(org1,org2) is: human and hasPhenotype some (partOf some hippocampus) and hasPart some ((cell_process and partOf some neuron) and hasQuality some degenerated)

Similarity Measures

Jaccard Similarity

Jaccard Similarity is the number of common attributes divided by the union of attributes. We define SimJ-RP using the set of referenced ancestors:


SimJ-RP(x,y) :
  return | CRP(x,y) | / | RP(x) ∪ RP(y) |

One limitation of this technique is that it does not take into account anonymous expressions.

We do not want to perform SimJ over all anonymous expressions, as there are potentially massive numbers of such expressions. Instead we can first compute the LCS (which includes anonymous expressions) using the above algorith, and then calculate the SimJ using only referenced ancestors.


SimJ-LCS-RP(x,y) :
  z = LCS(x,y)
  return | RP(z) | / | RP(x) ∪ RP(y) |

We illustrate these using the following ontology:

EquivalentClasses(car IntersectionOf(vehicle motorized ObjectExactCardinality(hasPart car_wheel 4)))
EquivalentClasses(bicycle IntersectionOf(vehicle ComplementOf(motorized) ObjectExactCardinality(hasPart bicycle_wheel 2)))

SubClassOf(car_wheel wheel)
SubClassOf(bike_wheel wheel)

# subsumers:
SimJ-RP(car,bicycle) =
  | {vehicle, wheel |
  -------------------
  | { car, bicycle, vehicle, bicycle_wheel, car_wheel, wheel, motorized, complementOf(motorized),
      IntersectionOf(vehicle motorized ObjectExactCardinality(hasPart car_wheel 4)), 
      ObjectExactCardinality(hasPart car_wheel 4),
      IntersectionOf(vehicle ComplementOf(motorized) ObjectExactCardinality(hasPart bicycle_wheel 2))
      ObjectExactCardinality(hasPart bicycle_wheel 2)
       |
  = 1/6
SimJ-LCS-RP(car,bicycle) =
  | {vehicle, wheel, IntersectionOf(vehicle ObjectCardinalityBetween(hasPart wheel 2 4)) ObjectCardinalityBetween(hasPart wheel 2 4) |
  -------------------
  | ... |
  = 1/3

Information Content

Average Best Match

The methods above work for any OWL individuals or classes - they are completely generic. We can achieve higher precision if we compare with respect to a particular attribute. The choice of attribute is application-dependent. For example, with phenotype comparison, we may wish to compare two entities with respect to the hasPhenotype object property. We could specify this using a DL-like query atts.x = { p : p ∈ SomeValuesFrom(phenotypeOf x) }

The Average Best Match is the average of all best matches between each attribute in x and the set of attributes in y, averaged for both x to y and y to x.


ABM(x,y) :
  score = 0;
  n = 0;
  for a in x.atts:
    score += bestMatch(a, y.atts)
    n++
  for a in y.atts:
    score += bestMatch(a, x.atts)
    n++
  RETURN score / n

bestMatch(a1, atts) :
  score = 0;
  for a2 in atts:
    score' = getScore(a1,a2)
    score = max(score, score')
  return score
 
 
ABM can be used with a number of different scoring schemes to score attribute pairs. For example, ABM with Jaccard similarity of the LCS would be called ABM-J-LCS.

SimJ-LCS(x,y) :
  z = LCS(x,y)
  return | LCRP(z) | / | RP(x) ∪ RP(y) |

Examples

OMIM

   
owltools file:imports.owl --sim-method MultiSimilarity --sim -p http://purl.obolibrary.org/obo/RO_0002200 http://purl.obolibrary.org/obo/MGI_101759 http://purl.obolibrary.org/obo/MGI_101761
   
   
   
owltools file:imports.owl --sim-method MultiSimilarity --sim -p 'has phenotype' http://purl.obolibrary.org/obo/MGI_101759 http://purl.obolibrary.org/obo/MGI_101761
   
   

Author:
cjm


Copyright © 2010-2013. All Rights Reserved.