Data Model¶
MyKrok stores activities in a Hive-partitioned directory structure optimized for both human readability and efficient querying with tools like DuckDB.
Directory Structure¶
data/
├── athletes.tsv # Summary of all athletes
├── mykrok.html # Generated map browser
└── athl={username}/ # Per-athlete directory
├── athlete.json # Profile data
├── avatar.jpg # Profile photo
├── sessions.tsv # Activity index
├── gear.json # Equipment catalog
└── ses={datetime}/ # Per-activity directory
├── info.json # Activity metadata
├── tracking.parquet # GPS + sensor data
├── tracking.json # Data manifest
└── photos/
└── {timestamp}.jpg # Activity photos
Entities¶
Athlete¶
Profile data for the authenticated user.
Storage: athl={username}/athlete.json
| Field | Type | Description |
|---|---|---|
id |
integer | Strava athlete ID |
username |
string | Strava username |
firstname |
string | First name |
lastname |
string | Last name |
city |
string | City from profile |
country |
string | Country from profile |
profile_url |
string | Profile image URL |
Activity¶
Athletic activity with metadata.
Storage: ses={datetime}/info.json
The datetime format is ISO 8601 basic (e.g., 20251218T143022).
| Field | Type | Description |
|---|---|---|
id |
integer | Strava activity ID |
name |
string | Activity title |
type |
string | Activity type (Run, Ride, etc.) |
sport_type |
string | Specific sport type |
start_date |
datetime | Start time (UTC) |
distance |
float | Distance in meters |
moving_time |
integer | Moving time in seconds |
elapsed_time |
integer | Total time in seconds |
total_elevation_gain |
float | Elevation gain in meters |
average_heartrate |
float | Average heart rate |
max_heartrate |
integer | Maximum heart rate |
average_watts |
float | Average power |
kudos_count |
integer | Number of kudos |
comment_count |
integer | Number of comments |
Nested data in info.json:
comments[]- Array of comment objectskudos[]- Array of kudo objectslaps[]- Array of lap dataphotos[]- Array of photo metadata
Track (GPS/Sensor Data)¶
Time-series data stored as Parquet for efficient querying.
Storage: ses={datetime}/tracking.parquet
| Column | Type | Description |
|---|---|---|
time |
float64 | Seconds from start |
lat |
float64 | Latitude |
lng |
float64 | Longitude |
altitude |
float32 | Elevation in meters |
distance |
float32 | Cumulative distance |
heartrate |
int16 | Heart rate (BPM) |
cadence |
int16 | Cadence |
watts |
int16 | Power |
temp |
float32 | Temperature (°C) |
A manifest file (tracking.json) describes available columns:
{
"columns": ["time", "lat", "lng", "altitude", "heartrate"],
"row_count": 3600,
"has_gps": true,
"has_hr": true,
"has_power": false
}
Sessions Summary¶
Denormalized index for quick filtering.
Storage: athl={username}/sessions.tsv
| Column | Type | Description |
|---|---|---|
datetime |
string | ISO 8601 basic format (UTC) |
datetime_local |
string | ISO 8601 basic format (local time) |
type |
string | Activity type |
sport |
string | Sport type |
name |
string | Activity name |
distance_m |
float | Distance in meters |
moving_time_s |
integer | Moving time (seconds) |
elevation_gain_m |
float | Elevation gain |
avg_hr |
float | Average heart rate |
has_gps |
boolean | Has GPS data |
has_photos |
boolean | Has photos |
start_lat |
float | Starting latitude |
start_lng |
float | Starting longitude |
Athletes Summary¶
Top-level index of all athletes.
Storage: data/athletes.tsv
| Column | Type | Description |
|---|---|---|
username |
string | Athlete username |
firstname |
string | First name |
lastname |
string | Last name |
session_count |
integer | Number of activities |
total_distance_km |
float | Total distance |
activity_types |
string | Comma-separated types |
Querying with DuckDB¶
-- Total distance by sport
SELECT sport, SUM(distance_m)/1000 as km, COUNT(*) as activities
FROM read_csv_auto('data/athl=*/sessions.tsv')
GROUP BY sport
ORDER BY km DESC;
-- Average heart rate by activity
SELECT datetime, name, avg_hr, max_hr
FROM read_csv_auto('data/athl=*/sessions.tsv')
WHERE avg_hr > 0
ORDER BY datetime DESC
LIMIT 10;
-- Query GPS tracks
SELECT ses, AVG(heartrate) as avg_hr, MAX(altitude) as max_alt
FROM read_parquet('data/**/tracking.parquet', hive_partitioning=true)
WHERE heartrate > 0
GROUP BY ses;
File Format Details¶
Why Parquet?¶
GPS and sensor data is stored as Parquet because:
- Columnar format - Efficient for analytics queries
- Compression - 5-10x smaller than CSV
- Browser support - hyparquet enables direct browser reading
- Type safety - Schema preserved
Why TSV?¶
Summary files use TSV (Tab-Separated Values) because:
- Human readable - Easy to inspect and edit
- Git-friendly - Clean diffs
- Universal - Works with any tool
- DuckDB compatible - Direct querying