Accessing Data In the Multiverse

Being with my lover (plus friend and mentor), another tiny detail crossed my mind that I thought I should clarify, and be clear with Sandra about. I had read all the information about camping, then missed the deadline, and wasn’t planning on going to the event anyway, so I deleted the email — it wasn’t personal. (However, based on what I’ve written below, it will probably seem like an intentional slight. Yet, I was processing all the aforementioned, and am just in a major declutter mode.)

If an email doesn’t hold any useful information, or important correspondence, or have nostalgic relevance, etc — especially now that I’m so inundated — I try to get rid of as much as possible, asap. While I do store or file an awful lot (see below), I actually try to keep my Inbox emails to a bare minimum (my main account is overflowing and hasn’t been decluttered in years; but my PM account, pre-March 2022, is a better example of how I prefer things to be; and at one time, I had planned to move everything over there — even though I knew back in 2002-ish that basically paid privacy services of any kind are basically a sham, and a type of honeypot, but with equal spying on all legal users, as well as on any prospective criminals or illegal activities — see below**).

And for the record, early on in my relationship with Sandra, I tried to do right by her (inviting her whenever possible; offering lifts; enquiring about her mental state; trying to potentially be a listening ear to perhaps facilitate her healing process, and help keep her on a stronger, more stable track, since she had shared with me her traumatic experiences). Yet, as with so many things since 1999 (and earlier), I’m basically just a puppet on a much larger stage, or a fly in a big web, lol.

Cybersecurity? LOL!

This article below is 25 years too late — and anyway, there has never been much public interest in the topic of online privacy and security — even though it means your offline daily life and personal safety can be affected in a heartbeat; plus, 99.999% of users lack the know-how to stay on top of this all-encompassing reality; the few true tech-savvy privacy-defenders and freedom-fighters have been safely ‘put away’ or co-opted (by 2001?); and any oversight body formed would be toothless, or any information they might provide merely offers a false sense of security:

Cybersecurity 101: Protect your privacy from hackers, spies, and the government” by Charlie Osborne, Contributor on Jan. 21, 2022

Just another pointless activity to keep people occupied, and distract them from what’s really going on, lol.

Government Surveillance = Corporate Gain…And So Much More

All this is actually related to another thing that Big Brother observed about others and myself early on (ie, prior to Gmail’s soft launch in 2003), which I believe may have contributed to the clear and sudden change in tack by Big Business email providers. Whereas many of the larger companies at the time (Yahoo, Hotmail, AOL, etc) were limiting their users to a really small amount of data storage, and then charging them to add more — and naturally, most customers regularly deleted their emails asap to maintain free service — I tended to keep many or all emails for reference.

Big deal, right? But when you’re living under a microscope, and your email habits, content, and activities are far from typical — this can all become highly useful, even creative, fodder. Kind of like an all-access *case study* — psychological profiling of an imperfect, but innocent person, who becomes driven to extreme and desperate measures. Result: an at-times deviant or possibly devious sociopath. (I would say that I oscillate between being too blunt and personal, to being seriously indirect and not knowing how to broach issues or ask for help, especially when it mattered most.)

After 9/11 happened, Homeland Security in the U.S. were allowed to expand their powers limitlessly, and various large-scale data mining operations etc were launched. However, encouraging companies to allow customers to keep more of their emails does (or did) lighten the workload for them, and also offers a clearer picture about every individual citizen they may decide to home in on anytime (ie, their mindset, behaviours or habits, filing systems, and whatever) — both from a security and marketing perspective, lol.

Anyway, that’s all old-school. The metaverse of digital media and communications is infinite — and likewise, all electronic (and print) information may be compromised, mirrored, ghosted, captured/intercepted, redirected, edited, and/or otherwise manipulated in billions of ways — even retroactively.

No sense getting into the macro or micro of it all. Some of the information, explaining these highly concerning technical issues in layman’s terms, is probably no longer even readily available on the internet. Online content is being curated to such an extent, that 2 people can go to the same website and see different information; or, your social media content can be deliberately limited to a small audience; or, an article and weblink you saw yesterday, suddenly disappears the next day (or in my case, sometimes within minutes). Life in the bubble, lol.

[Frankly, this is a first: Seeing my birthday connected with something *remotely* (and supposedly) positive. Yes, pun intended. But of course, all the typos, nonsensical sentences and simplistic language kind of suggest that it is squarely aimed at me. LOL…]

Excerpt from article: “Towards multiverse databases


The application makes queries on behalf of an authenticated user, but it is up to the application itself to make sure that the user only sees data they are entitled to see. [Emphasis mine.]

With multiverse databases, each user sees a consistent “parallel universe” database containing only the data that user is allowed to see. Thus an application can issue any query, and we can rest safe in the knowledge that it will only see permitted data.

**NOTE: All privacy and security-oriented services or SmartHome automation, etc — both online and offline — are simply an easy-access repository of clients and any homes or information they may wish to safeguard, as well as being a means to infiltrate, observe and potentially control or manipulate people, for and by authorities / govts / corporations / spies — but which also trickles down to many everyday hackers, who have become their incognito army of ‘digital henchmen’. Or, in the offline world, all the actors who may likewise come into play. That sounds like ‘conspiracy theory’, but it’s real. What’s reflected in various media programs or movies, for example, (w.r.t. surveillance; but also, all the elaborate ways to track, study, ‘logistically manage’ or deceive everyday, law-abiding citizens — individually or en masse — like with well-placed strangers or service staff; fake news; it’s an endless list) is actually quite true to life and where much technology and human activities are already at — and it is totally widespread, both locally and globally. Hollywood just tweaks it enough to maintain a seemingly limited fictional or fantastical veneer.


Skip to content

the morning paper

a random walk through Computer Science research, by Adrian Colyer
Made delightfully fast by strattic

Towards multiverse databases


Towards multiverse databases Marzoev et al., HotOS’19

A typical backing store for a web application contains data for many users. The application makes queries on behalf of an authenticated user, but it is up to the application itself to make sure that the user only sees data they are entitled to see.

Any frontend can access the whole store, regardless of the application user consuming the results. Therefore, frontend code is responsible for permission checks and privacy-preserving transformations that protect user’s data. This is dangerous and error-prone, and has caused many real-world bugs… the trusted computing base (TCB) effectively includes the entire application.

The central idea behind multiverse databases is to push the data access and privacy rules into the database itself. The database takes on responsibility for authorization and transformation, and the application retains responsibility only for authentication and correct delegation of the authenticated principal on a database call. Such a design rules out an entire class of application errors, protecting private data from accidentally leaking.

It would be safer and easier to specify and transparently enforce access policies once, at the shared backend store interface. Although state-of-the-are databases have security features designed for exactly this purpose, such as row-level access policies and grants of views, these features are too limiting for many web applications.

In particular, data-dependent privacy policies may not fit neatly into row- or column-level access controls, and it may be permissible to expose aggregate or transformed information that traditional access control would prevent.

With multiverse databases, each user sees a consistent “parallel universe” database containing only the data that user is allowed to see. Thus an application can issue any query, and we can rest safe in the knowledge that it will only see permitted data.

The challenging thing of course, is efficiently maintaining all of these parallel universes. We’ll get to that, but first let’s look at some examples of privacy policies and how they can be expressed.

Expressing privacy policies

In the prototype implementation, policies are expressed in a language similar to Google Cloud Firestore security rules. A policy just needs to be a deterministic function of a given update’s record data and the database contents. Today the following are supported:

  • Row suppression policies (e.g. exclude rows matching this pattern)
  • Column rewrite policies (e.g. translate / mask values)
  • Group policies, supporting role-based (i.e., data-dependent access controls)
  • Aggregation policies, which restrict a universe to see certain tables or columns only in aggregated or differentially private form.

Consider a class discussion forum application (e.g. Piazza) in which students can post questions that are anonymous to other students, but not anonymous to instructors. We can express this policy with a combination of row suppression and column rewriting:

Maybe we want to allow teaching assistants (TAs) to see anonymous posts in the classes they teach. We can define a group via a membership condition and then attach policies to that group:

Write policies (not supported in the current implementation) permit specification of allowed updates. For example:

An aggregation policy could be used to rewrite any matching aggregation into a differentially-private version. The basis for this could be e.g. Chan et al.’s ‘Private and continual release of statistics’. Composing such policies with other policies remains an open research question.

Managing universes

A multiverse database consists of a base universe, which represents the database without any read-side privacy policies applied, and many user universes, which are transformed copies of the database.

For good query performance we’d like to pre-compute these per-user universes. If we do that naively though, we’re going to end up with a lot of universes to store and maintain and the storage requirements alone will be prohibitive.

A space- and compute-efficient multiverse database clearly cannot materialize all user universes in their entirety, and must support high-performance incremental updates to the user universes. It therefore requires partially-materialized views that support high-performance updates. Recent research has provided this missing key primitive. Specifically, scalable, parallel streaming dataflow computing systems now support partially-stateful and dynamically-changing dataflows. These ideas make an efficient multiverse database possible.

So, we make the database tables in the base universe be the root vertices of a dataflow, and as the base universe is updated records move through the flow into user universes. Where an edge in the dataflow graph crosses a universe boundary, any necessary dataflow operators to enforce the required privacy policies are inserted. All applicable policies are applied on every edge that transitions into a given user universe, so whichever path data takes to get there we know the policies will have been enforced.

We can build the dataflow graph up dynamically, extending the flow’s for a user’s universe the first time a query is executed. The amount of computation required on a base update can be reduced by sharing computation and cached data between universes. Implementing this as a joint partially-stateful dataflow is the key to doing this safely.

By reasoning about all users’ queries as a joint dataflow, the system can detect such sharing: when identical dataflow paths exist, they can be merged.

Logically distinct, but functionally equivalent dataflow vertices can also share a common backing store. Any record reaching such a vertex in a given universe implies that universe has access to it, so the system can safely expose the shared copy.

Just as user universes can be created on demand, so inactive universes can be destroyed on demand as well. Under the covers, these are all manipulations of the dataflow graph, which partially-stateful dataflow can support without downtime.

Prototype evaluation

The authors have built a prototype implementation of these ideas based on the Noria dataflow engine. It runs to about 2,000 lines of Rust. A Piazza-style class forum discussion application with 1M posts, 1,000 classes, and a privacy policy allowing TAs to see anonymous posts is used as the basis for benchmarking.

The team compare the prototype with 5,000 active user universes, a MySQL implementation with inlined privacy policies (‘with AP’) and a MySQL implementation that does not enforce the privacy policy (‘without AP’):

Since the prototype is serving reads from a pre-computed universe stored in memory cached results are fast and make for a very favourable comparison against MySQL. Writes are significantly slower though (about 2x) – much of this overhead is in the implementation rather than essential. Memory footprint is 0.5GB with one universe, and 1.1GB with 5,000 universes, introduces a shared record store for identical queries reduces their space footprint by 94%.

These results are encouraging, but a realistic multiverse database must further reduce memory overhead and efficiently run millions of user universes across machines. Neither Noria nor any other current dataflow system support execution of the huge dataflows that such a deployment requires. In particular, changes to the dataflow must avoid full traversals of the dataflow graph for faster universe creation.

Support for write authorization policies (with some tricky consistency considerations for data-dependent policies) is future work, as is the development of a policy-checker (perhaps similar to Amazon’s SMT-based policy checker for AWS) to help ensure policies themselves are consistent and complete.

Our initial results indicate that a large, dynamic, and partially-stateful dataflow can support practical multiverse databases that are easy to use and achieve good performance and acceptable overheads. We are excited to further explore the multiverse database paradigm and associated research directions.


Post navigation

< PREVIOUSA case for managed and model-less inference serving

NEXT >Nines are not enough: meaningful metrics for clouds

About groovy777

Toronto gal. Curious about people, life, the universes.
This entry was posted in architecture and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s