The Guide To Building Sync-friendly APIs

Apr 19, 2022

How to build a performant endpoint for pulling data

Pulling objects from an API for the purpose of data synchronization adds some unique requirements to how an endpoint should work: the API should allow the clients to continuously fetch only the new changes while making sure that performance stays consistent regardless of the amount of data.

The simplest approach would be to use page numbers backed by some variant of OFFSET ... LIMIT query. While this is a popular choice, we can do much better than that.

Throughout this guide we’ll build an endpoint for fetching customer data (contacts) from an imaginary CRM system.

Note: the guide describes how to build a JSON-based RESTful API with a relational database as the underlying storage, but the ideas should translate to other technologies.

The client’s perspective

Let’s see how an ideal API would work from the client’s perspective. The clients will start by building the initial URL to request the first page of results:

GET https://api.acme-crm.com/contacts

We also want to let them start from any point in time:

GET https://api.acme-crm.com/contacts?after=2022-01-01T00:00:00Z

The response will contain the actual data with the information about the paging:

{
  "contacts": [
    ...
  ],
  "page": {
    "has_more": true,
    "next": "/contacts?after=..."
  }
}

The clients will then use the has_more attribute to decide if they want to ask for more data immediately or delay the next request to preserve the API quota. Make sure to limit the amount of requests per minute and respond with HTTP 429 Too Many Requests as needed.

From this point on, the clients will use the next attribute to build the URL to the next page of data:

GET https://api.acme-crm.com/contacts?after=...

This process will continue indefinitely.

In this approach the client-side implementation is straightforward and allows the clients to be as efficient as possible - a contact is returned by the API more than once only if it ever gets updated.

Server-side implementation

Before jumping into the implementation, we need to address the following case. Imagine a situation where 15 contacts have the same modification timestamp and we want to fetch a page of 10 items. Which contacts should we pick?

We’ll solve this by using the contact’s unique ID as a tie-breaker when the timestamps are equal: we’ll sort the contacts by their modification timestamp first and then by their unique ID. This way, we retain the modification order while making sure that we always know which contact comes before. This also means that we need to add another parameter to the query:

GET https://api.acme-crm.com/contacts?after=2022-01-10T12:00:00Z&after_id=3712

We’ll see how it’s used when we build our query. Now we’re ready to follow the steps listed below.

Step 1: Track modification time

We’ll ensure that we track the last modification time of a contact. The timestamp should include the timezone information:

{
  "id": 3712,
  "updated_at": "2020-01-05T10:30:15Z",
  ...
}

Every time the contact is updated, the "updated_at" field needs to be modified as well.

We’re using the extended ISO 8601 format which supports higher precision (like milliseconds) if needed. The timestamp could also be expressed as a Unix time, e.g. 1578220215. Consider using the precision of at least seconds for either type.

Step 2: Enable performant queries

We’ll add an index on the contacts table to make the querying faster:

CREATE INDEX contacts_paging ON contacts (updated_at, id);

Step 3: Fetch the querying params

As we’ve seen earlier, we need to support the following URLs:

https://api.acme-crm.com/contacts
https://api.acme-crm.com/contacts?after=2020-01-05T10:30:15Z
https://api.acme-crm.com/contacts?after=2020-01-05T10:30:15Z&after_id=3712

Let's fetch the query parameters and provide default values for them:

after = params["after"] || "1970-01-01T00:00:00Z"
after_id = parse_int(params["after_id"]) || 0

We’ll also allow the clients to set the page size within some limits:

limit = ensure_between(parse_int(params["limit"]) || 100, 1, 1000)

Note: for brevity we’re not showing the full validation of the incoming parameters, but you should make sure to gracefully handle missing, blank and malformed values.

Step 4: Fetch the subset of the data

With the parameters in place, we’re ready to build the query. The query consists of three steps: filtering, sorting and limiting:

SELECT   *
FROM     contacts
WHERE    (updated_at = $after AND id > $after_id) OR (updated_at > $after)
ORDER BY updated_at, id
LIMIT    $limit;

Note how we’re both filtering and sorting the contacts by the two fields (timestamp and ID) before limiting the result set - all done in the same database query. This ensures that we only return contacts that were not returned by the previous fetch.

If you need to move parts of that logic into your application code, you need to make sure that the filtering, sorting and limiting are applied in the correct order, otherwise you’ll risk skipping some data.

Important: make sure to use the same precision when comparing timestamps. You can either expose the timestamp with the full precision in the API or truncate the column value when filtering:

WHERE (DATE_TRUNC('second', updated_at) = $after AND id > $after_id) OR (DATE_TRUNC('second', updated_at) > $after)

Extra step: improve performance

We can improve the query performance by avoiding using an OR condition. We can use the so-called row values:

SELECT   *
FROM     contacts
WHERE    (updated_at, id) > ($after, $after_id)
ORDER BY updated_at, id
LIMIT    $limit;

This has the same semantics as the query we used before: first compare by the updated_at, then by the id. For example, both of these conditions evaluate to TRUE:

SELECT
  (2022, 5) > (2020, 10),
  (2022, 5) > (2022, 1);

If your database does not support row values, you can still avoid an OR condition with a more complicated, but equivalent filter:

SELECT   *
FROM     contacts
WHERE    updated_at >= $after AND NOT (updated_at = $after AND id <= $after_id)
ORDER BY updated_at, id
LIMIT    $limit;

Step 5: Return the results

Now we can provide the query with the parameters and run it:

contacts = run_query("SELECT ...", after: after, after_id: after_id, limit: limit)

Since the results come ordered from the database, we can just grab the last contact to get the next paging params:

has_more = array_length(contacts) == limit

next_after = after
next_after_id = after_id

if array_length(contacts) > 0 do
  last = array_last(contacts)
  next_after = last.updated_at
  next_after_id = last.id
end

next_page = "/contacts?after=#{next_after}&after_id=#{next_after_id}"

return {
  "contacts": contacts,
  "page": {
    "has_more": has_more,
    "next": next_page
  }
}

Note how we initialize next_after and next_after_id: if there are no updated contacts, the clients should make the exact same request.

Extra step: avoid querying for empty data

If there are 10 contacts and the page size is 10, the first call will return "has_more": true even though there are no more contacts. We can fix this by asking for an extra contact and doing more work when fetching:

extra_contacts = run_query("SELECT ...", after: after, after_id: after_id, limit: limit + 1)
contacts = array_take(extra_contacts, limit)
has_more = array_length(extra_contacts) == limit + 1

The rest of the code stays the same.

Step 6: Handle aggregates

If our contact aggregates other objects we need to do some extra work. Consider the following example:

{
  "id": 3712,
  "updated_at": "2020-01-05T10:30:15Z",
  "addresses": [
   {
       "id": 6311,
       "street": "South Avenue",
       "primary": true
    },
    {
       "id": 6312,
       "street": "Central Avenue",
       "primary": false
    },
    ...
  ]
}

In this model a contact has multiple addresses. Imagine there’s an endpoint that allows modifying a specific address, like so:

PUT https://api.acme-crm.com/contacts/3712/addresses/6311
{"address": {... }}

We need to let the clients know that a change has been made to the contact. We can handle this in two ways:

  • Make sure to transactionally update the updated_at field on the contact whenever an address is added, updated or deleted; or
  • Provide a sync-friendly endpoint for the addresses; the endpoint must not be scoped by the contact ID and should instead include that ID in the payload:
  GET https://api.acme-crm.com/addresses

  {
    "addresses": [
    {
        "id": 6311,
        "contact_id": 3712,
        "updated_at": "2022-01-10T12:00:00Z",
        "street": "South Avenue",
        "primary": true
    },
      ...
    ],
    "page": {
      ...
    }
  }

Depending on the code organization and the tech stack used, one option might be easier to implement than the other.

Wrap-up

By following this guide we’ve built an API that allows clients to continuously pull data without compromising the application performance. Now we’re ready to build some great integrations!

Building CRM integrations?

Xkit delivers deep integrations for every CRM with one build.

Get access ▶