Provenance hoisting - HealthKite

Every HealthKit sample carries device and sourceRevision properties. At the Swift / in-memory layer those are references — one HKDevice instance for “My Apple Watch” can be shared by 5,000 samples. At the JSON layer there’s no concept of references, so a naive encoding inlines those structs per sample. HealthKite MCP hoists provenance into a top-level sources array and replaces inline provenance with integer src indices.

Shape

Every response that contains sample-level data uses the same sources block:

"sources": [
  {
    "id": 0,
    "device": {
      "name": "Apple Watch",
      "model": "Watch",
      "manufacturer": "Apple Inc.",
      "hardwareVersion": "Watch6,11",
      "softwareVersion": "26.4"
    },
    "sourceRevision": {
      "source": {
        "name": "My Apple Watch",
        "bundleIdentifier": "com.apple.health.76CB32B3-FB98-41D2-BE82-46A492C2A369"
      },
      "version": "26.4",
      "productType": "Watch6,11",
      "operatingSystemVersion": "26.4.0"
    }
  },
  {
    "id": 1,
    "device": null,
    "sourceRevision": {
      "source": { "name": "My iPhone", "bundleIdentifier": "com.eightsleep.Eight" },
      "version": "2",
      "productType": "iPhone15,4",
      "operatingSystemVersion": "26.4.2"
    }
  }
]

Each top-level container (workout detail, quantity series, sleep session list, day snapshot) carries its own src: Int referencing one of these entries. Sub-streams (e.g., samples inside a workout) can also carry per-sample src arrays when a stream’s samples come from multiple sources (rare).

Rules

id: 0 is the primary source for the top-level object. For a WorkoutDetail, it’s the source that recorded the workout itself. For a SleepSession, it’s the source for the bulk of the session’s samples. For a quantity series, it’s the source of the first sample.
Tuple identity: a SourceEntry is uniquely identified by (device, sourceRevision). device == nil is distinct from a present device, even when the sourceRevision matches.
Append order: entries appear in the order their tuples were first encountered while walking samples. The workout/session/series’ own source is always first.
Omitted when uniform: per-sample src arrays inside columnar streams are omitted entirely if every sample in that stream has src == 0. This is the common case and keeps payloads small.

Why hoist

For a 33-minute run with 5,000 samples coming from a single Apple Watch, naive inlining duplicates 5,000 copies of device + sourceRevision blocks. Hoisting drops payload size by ~70% in real measurements. The structure is also more honest: there’s one watch involved, so the JSON should reflect one watch entry.

Multi-source case

Many users get health data from more than one source — for example, Apple Watch for workouts and HR, and Eight Sleep for sleep tracking. A weekly snapshot response will have multiple entries in sources. Each sample references the appropriate one by src. Consumers wanting to filter by source can do so with one integer comparison instead of struct equality.

Stability

Source id values are only stable within a single response. The same physical Apple Watch may be id: 0 in one response and id: 1 in another, depending on which sample types were queried and in what order. Always resolve src against the sources array in the same response — do not cache id mappings across requests.

​Shape

​Rules

​Why hoist

​Multi-source case

​Stability

Shape

Rules

Why hoist

Multi-source case

Stability