Columnar encoding - HealthKite

Sample-rich workouts are dominated by repetition: thousands of samples per workout, each with the same shape and ~70% structural overhead. HealthKite MCP encodes those streams column-major instead of row-major.

The problem

A typical 30-minute run produces ~5,000 quantity samples across HR, distance, energy, step count, running power, stride length, ground contact, vertical oscillation, and speed streams. Encoded as one dict per sample, each carrying device, sourceRevision, uuid, startDate, endDate, value, unit, and metadata:

{
  "uuid": "5A8C4551-C7C0-4BDA-B154-3FCB0704B560",
  "device": { ... full device struct, 120 bytes ... },
  "sourceRevision": { ... full source revision, 200 bytes ... },
  "startDate": "2026-05-09T16:11:55Z",
  "endDate":   "2026-05-09T16:12:00Z",
  "value": 142,
  "unit": "count/min"
}

That’s ~500 bytes per sample. 5,000 samples × 500 = 2.5 MB. 70% of which is the same device and sourceRevision repeated 5,000 times.

The columnar shape

Each per-stream block becomes:

{
  "unit": "count/min",
  "t":   [0, 5, 10, 15, 20, ...],         // integer seconds offset from the workout's startDate
  "dur": [3, 3, 3, 3, 3, ...],            // integer seconds (per-sample duration), omitted if all instant
  "v":   [142, 144, 145, 147, 148, ...]   // sample values in `unit`
}

Optional columns:

dur — included only if any sample in the stream has non-zero duration. HR samples are typically instant; step-count samples are typically 3-second windows.
src — array of source-index integers, one per sample. Omitted entirely if every sample uses source 0 (the common case for single-device users). When present, references entries in the top-level sources array.
metadata — a sparse object keyed by sample-index-as-string, only present for samples that have metadata. Most samples have none.

What’s gone, and what’s preserved

Information	Status
Per-sample `value` (the actual number)	Preserved verbatim
Per-sample `startDate`	Preserved as integer seconds offset from workout `startDate`
Per-sample `endDate`	Preserved when distinct from `startDate`, as `dur`
Per-sample `unit`	Hoisted once per stream (always the same per stream anyway)
Per-sample `device` / `sourceRevision`	Hoisted to top-level `sources`, referenced by `src`
Per-sample `uuid`	Dropped. UUIDs are HealthKit-internal handles; no consumer of an export uses them.
Per-sample `metadata`	Preserved via sparse map when present, omitted otherwise

The only true deletion is per-sample UUIDs. Every value, timestamp, duration, unit, and provenance link is reconstructible.

Why integer-second offsets

Apple’s quantity samples are second-aligned in practice — the Apple Watch writes samples at second boundaries. Storing them as offsets from the workout’s startDate:

Drops ~22 bytes per timestamp ("2026-05-09T16:11:55Z") to typically 2-4 bytes (5, 120, 1800).
Makes the data trivially plottable: for (i, t) in enumerate(stream.t): print(workout_start + t, stream.v[i]).
Negative offsets are allowed (a sample whose startDate is just before the workout window — rare, but valid).

Example: a real 33-min run

Encoding	File size
Original per-sample-dict JSON	2.58 MB
After provenance hoist (sources/src)	760 KB
After columnar (this format)	94 KB

sampleEncoding: "columnar-v1" is the discriminator. Future versions of the encoding will bump the suffix; consumers should check it before parsing.

Per-stream rules summary

For each entry in samples.*:

unit is always present and constant for the stream
t is required (integer seconds, may be empty if no samples)
v is required (same length as t)
dur is included iff any sample has non-zero duration; same length as t
src is included iff any sample’s source differs from 0; same length as t
metadata is included iff any sample has metadata; keyed by string-form sample index

Streams with zero samples are omitted from the parent samples dict entirely.

​The problem

​The columnar shape

​What’s gone, and what’s preserved

​Why integer-second offsets

​Example: a real 33-min run