comparison of basic incremental one-way data sync mechanisms

in this post I will compare 3 data sync mechanisms that vary in the casual aspects, space and time.

problem is, we have 2 data sources with same data specification, one is read/write enabled, the other is ready only, what we want is that whenever a new entity is added/updated/deleted in the first source, the change gets reflected on the second source.

a good example for this is a client-server database replication, a single serverDB and many clientDB instances have the table myTable, clients constantly make changes to serverDB, we want every change done to serverDB to propagate to all clients.

1st method: the history log method (space: big, time: small)

just as the name implies, you keep a history log of each and every change that happens to data, the log entity can be something like :

operation {
    type, //add/update/delete
    author, //who did it, optional but useful since you are going to have an extra model anyway
    time, //timestamp, !!watch out for different timezones!!
    target, //the model that had the change, ex: myTable
    rowId, //id of the entry that got added, updated, deleted 
    change, //this can contain changed data, ex: updated/created column names + values
}

in this case what happens is:

  • client: would you please hand me all operations that happened after time x ?
  • server: retrieves all operations where time > x and sends them back to client
  • client: applies received operations in chronological order

pros:

  • you now have a log of everything that happened to the data, you know who's to blame when something goes wrong.

cons:

  • you have to maintain the storage for all change operations that happened since ever, any loss in the chain of history will require some out of date client to have a working snapshot of data before starting the sync, which will decrease the usability of the system.
  • data redundancy, current values for an entity will have 2 copies, one in the original table/data store, another in the operation history table.

2nd method: the timestamps trio method (space: mid, time: small)

each model definition should have 3 extra timestamp type attributes (createdAt, updatedAt and nullable deletedAt), they should get updated at the event of corresponding change type, deletedAt should be set to null initially, updated at entity delete.

what happens is:

  • client: would you please hand me all operations that happened after time x ?
  • server: retrieves all entities with createdAt, updateAt or deletedAt > x
  • server: sorts all entities chronologically, add operation type to each entity, sends them back to client
  • client: applies operations

pros:

  • you have a bit of historical information, when something goes off, you know when did the latest change happen.
  • you have access to deleted entities since they are soft-deleted.
  • no extra model/table.

cons:

  • extra complexity is added since after introducing soft deletes any operation that targets all/filtered entities should now target entities that have deletedAt set to null, that may require adding wrapper code around data store access logic.
  • adding 3 attributes to every model participating to the sync can still be a big overhead.

3rd method: the checksumming method (space: small, time: big)

as name implies here is what happens:

  • client: checksum every single entity and put the result in a form like [{id, data_checksum}, ..]
  • client: would you please hand me a list of changes that happened to the data represented by these checksums
  • server: compare client checksums with local checksums and figure out what entities are present in local store and absent from client store (added), what entities have different checks sums (updated) and what entities are present in client store but absent in local store (deleted)
  • server: replies with changes in a form like [{operationType, entity}, ..]
  • client: applies changes, might poll server until checksums match

pros:

  • no extra data models or model attributes.
  • data integrity is guaranteed the most, no chance is left for an unhandled data transfer scenario or a change that has been applied directly to the database skipping the log history table or the entity timestamps.

cons:

  • requires more cpu cycles and more data to be transferred over the network

that's it

Leave a Reply