Toward Scalable Semantic Big Data
Julian Dolby
IBM Thomas J. Watson Research Center
SPLASH-I, Vancouver, October 2017
Collaborative Work
- Bishwaranjan Bhattacharjee
- Mihaela Bornea
- James Cimino
- Patrick Dantressangle
- Achille Fokoue
- Aditya Kalyanpur
- Anastasios Kementsietsidis
- Aaron Kersehbaum
- Li Ma
- Chintan Patel
- Edith Schonberg
- Kavitha Srinivas
- Octavian Udrea
Outline
- Running Example
- Scalable expressive reasoning
- Storing RDF data in a database
- Integration of Web data
Running Example
Objects in Our Universe
Example OWL Universe
- Individuals
$\begin{array}{l}Museum(Athens), Museum(Heraklion), \\
Museum(MOMA), Minoan(LaParisienne),\\
Mycenean(DeathMask),VanGogh(StarryNight)\end{array}$
- Roles
$\begin{array}{l}
has(Athens,DeathMask),has(MOMA,StarryNight)\\
has(Heraklion,LaParisienne)
\end{array}$
- Axioms (TGDs)
$\begin{array}{l}
Minoan \sqsubseteq \exists{creationSite.Crete}\\
Mycenaean \sqsubseteq \exists{creationSite.Mycenae}
\end{array}$
Example ABox $A$
The Summary ABox
- Map ABox $A$ to $A'$ for scalability using $f$
$\begin{array}{l}
C(a) \in A \implies C(f(a)) \in A'\\
R(a,b) \in A \implies R(f(a), f(b)) \in A'
\end{array}$
- We choose concept sets as f
J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershenbaum, E. Schonberg, K. Srinivas, L. Ma
Scalable Semantic Retrieval through Summarization and Refinement.
AAAI 2007
Example Summary ABox $A'$
Pointer Analysis Analogy
class Cell {
private Object datum;
Cell(Object d) { datum = d; }
static void catInTheHat() {
Cell x = new Cell("Thing 1");
Cell y = new Cell("Thing 2");
}
}
Heap
Type-Based Abstraction
Example Query
- "Museums that have works from Crete"
- Answer is Heraklion since has Minoan LaParisienne
- DL
- $Museum \wedge \exists{has.\exists{creationSite.Crete}}$
- Negate query at each node, find contradictions
- Entities, edges in contradiction called justification
- Needs DL reasoning: creationSite edge is implicit
Initial Query Answer
Refinement
- J is the justification, i.e. the conflict
- Partition summary nodes by edges in justification
$key(a) \equiv \left\{ R(s, t) \left| \begin{array}{l}
f(a) = s \wedge\\
R(s,t) \in J \wedge\\
\exists b \; R(a,b) \in A \wedge f(b) = t
\end{array} \right. \right\}$
Query Refinement
Refined Query Answer
Pointer Analysis Analogy
class Cell {
private Object datum;
Cell(Object d) { datum = d; }
static void catInTheHat() {
Cell x = new Cell("Thing 1");
Cell y = new Cell("Thing 2");
}
}
- Pointer analyis query: “what points to ‘Thing 1’?”
- refinement in the style of Plevyak or Sridharan
Reasoning Results
$${\scriptsize \begin{array}{|l|l|l|l|l|l|}
\hline
Reasoner & Dataset & Avg. Time & St.Dev & Range \\ \hline
KAON2 & UOBM1 & 20.7 & 1.2 & 18-37\\ \hline
KAON2 & UOBM10 & 447.6 & 23.3 & 414.8-530\\ \hline
SHER & UOBM1 & 4.2 & 3.8 & 2.4-23.8\\ \hline
SHER & UOBM10 & 15.4 & 25.6 & 6.4-191.1 \\ \hline
SHER & UOBM30 & 34.7 & 63.5 & 11.6-391.1 \\ \hline
\end{array}}$$
RDF in a Relational Store
- Numerous large RDF data sources
- DBPedia (>300M triples)
- Web data (>3B triples from BTC)
- Exploit scalable RDBMS technology
- query optimization
- transaction support
- concurrency
- Quetzal
M. Bornea et al., Building an efficient RDF store over a relational database. SIGMOD 2013
Museums with Locations
Quetzal Schema
- Entity-oriented schema
- properties for subject on single row
- rows for predicates and values
- secondary table for multi-valued predicates
- Fit entities onto limited database rows
- generally more predicates than available rows
- graph coloring to maximize density
- spill onto multiple rows only when necessary
- Analogous tables for reverse direction
- Reduces joins for "star" queries
Example Entity-Oriented Table
${\tiny \begin{array}{|r|l|l|l|l|l|l|}\hline
{\rm{subject}} & {\rm{p1}} & {\rm{v1}} & {\rm{p2}} & {\rm{v2}} & {\rm{p3}} & {\rm{v3}}\\ \hline
LaParisienne & type & Minoan & & & &\\
DeathMask & type & Mycenean & & & &\\
StarryNight & type & VanGogh & & & &\\
Heraklion & type & Museum & has & La\dots & at & l1\\
Athens & type & Museum & has & De\dots & at & l2\\
MOMA & type & Museum & has & St\dots & at & l3\\
l1 & type & Location & lat & 35.3 & long & 25.1\\
l2 & type & Location & lat & 38.0 & long & 23.7\\
l3 & type & Location & lat & 40.7 & long & -74.0\\ \hline
\end{array}}$
Register Allocation Analogy
var x = 7
var y = x + 7
var z = 3
y = y + z
code |
reg 1 |
reg 2 |
x = 7 |
x |
|
y = x + 7 |
x |
y |
x = 3 |
z |
y |
y = y + z |
z |
y |
Quetzal Results on LUBM
Data Everywhere
- Numerous structured data sources available
- medical (Drugbank, Uniprot); general (DBpedia)
- much data in RDF, queried with SPARQL
- But data increasingly diverse
- RDF, XML, JSON, CSV formats
- accessible as dumps, query endpoints and APIs
- Powerful if integrated and queried effectively
- reuse and extend existing declarative SPARQL
J. Dolby et al., Extending SPARQL for Data Analytic Tasks. ISWC 2016
Modularize SPARQL with Functions
- "museums with some type of exhibit"
function museumsWith(?type ->
?museum ?lat ?long) {
?museum has ?art .
?art type ?type .
?museum at ?loc .
?loc geo:lat ?lat .
?loc geo:long ?long .
}
Functions called with bind
bind ?museum ?lat ?long
as museumsWith(Minoan) .
Web Service http://ip-api.com/
- Web service returns IP-based information
<query>
<status><![CDATA[success]]></status>
<country><![CDATA[United States]]></country>
<countryCode><![CDATA[US]]></countryCode>
<region><![CDATA[IL]]></region>
<regionName><![CDATA[Illinois]]></regionName>
<city><![CDATA[Chicago]]></city>
<zip><![CDATA[60605]]></zip>
<lat><![CDATA[41.8632]]></lat>
<lon><![CDATA[-87.6198]]></lon>
<timezone><![CDATA[America/Chicago]]>&</timezone>
<isp><![CDATA[AT&T Services]]></isp>
<org><![CDATA[Hilton Hotels Corporation]]></org>
<as><![CDATA[AS7018 AT&T Services, Inc.]]></as>
<query><![CDATA[12.218.232.8]]></query>
</query>
Web Service Example
- Web service to return latitude and longitude
function geo:getPosition( -> ?lat ?long)
service get http://ip-api.com/xml [] -> xml
"/query": "./lat" "./long"
getPosition
used with bind
select ?lat ?long where {
bind ?lat ?long as geo:getPosition()
}
Combining Disparate Data
- "Nearby Minoan exhibits"
- RDF museums,
http://ip-api.com
locations
select ?museum where {
bind ?museum ?lat1 ?long1
as museumsWith(Minoan) .
bind ?lat2 ?long2 as geo:getPosition()
FILTER(?lat2-?lat1 < .1 &&
?long2-?long1 < .1)
}
The Future
- Further extensions to SPARQL
- further modularization, e.g. module systems
- potential for first-class functions
- Semantic big data means reasoning
- put summarization, refinement into Quetzal
- exploit entity-oriented schema for summarization
- Would love others to get involved