May 18, 2015
A Quest for Syncable Private Online Storage

It's necessary for apps to sync data, either documents or preferences, among our devices. Syncable means that modifications made on one device must be transferred to other devices swiftly. Private means that data must be encrypted with a user provided key before upload to server, so that neither cloud provider nor app developer can look inside your documents.

These requirements seem to contradict each other. Encrypted data is extremely expensive to sync. Even if you just change one byte, newly encrypted data will be entirely different, therefore a full syncing will have to copy every byte. However, if we design the storage file to be append-only, and use a stream cipher instead, then our goal can be met.

An append-only file is opened for reading and writing, only that the writing always happens at the end of file. For C programmers, such a file is opened in this way:

fopen("datafile", "a+");

Since it's append only, it's easy to sync by comparing file size and only downloading the missing data at the end, and another benefit is that your data will never get corrupted. When things go wrong, we can simply revert to earlier versions. Stream cipher encrypts data on the fly as they are being appended. There is no need to re-encrypt whole file from the start. Effectively we also have an encryptable version control storage.

The problem of append only data storage is that, unlike usual database systems, we need to build an external index file for fast queries. The external index file has to be built on first time load, and always be updated whenever there is new data coming in. It can also be encrypted so even if other people get access to your device, your index file is still safe.

Normally a remote server is required to help devices sync with each other. The server only has meta information, for example, size of the data file, timestamps of updates from clients. It can do some basic conflict resolution. When a client tries to push or append new data, the server requires the client to provide its local head position (same as file size) and checksum. If client head position is not equal to the head at server, then the server will reject updates from the client. The client should catch up with the server head position first, by downloading missing data and doing conflict resolution locally. Since the server has no idea of the contents, keeping the content in proper status is at the discretion of all clients. A badly behaving client could post garbage data to the server. Even in such case, we can still revert data to earlier versions, and revoke access permission for those bad clients if necessary.

Like git, clients have full copy of data. Therefore they can switch to another remote storage provider at will.

Data syncing can also be peer-to-peer. Two different local storages can negotiate a common head position by exchanging checksums of different portions of data, and then try to merge their differences after that.

Going forward, this is how app developers should protect private data of users, and this is how we can completely close any possible backdoors to user data, yet still provide convenience of fast syncing.