SuccessChanges

Summary

  1. Add support for exporting non-anonymized data for studies (#598) (commit: d049748b62bb283c693b6bdd55150db8b4513d73) (details)
Commit d049748b62bb283c693b6bdd55150db8b4513d73 by noreply
Add support for exporting non-anonymized data for studies (#598)
Our consent document states that studies led by other researchers can
also use
governed by their own
anonymized dump, including the
Until now, we did not have a project that wanted to do this, so we did
not have support for dumping the email <-> uuid mapping.
This commit adds that support.
Main changes:
associated with a "channel"
extraction script to be objects
instead of a string list of uuids
a `-m` option that will load the emails from
a map file instead of creating dummy ones.
Bonus fix:
- changed the existing script that extracts uuids for creating open
datasets to
match the new format, so that uuids generated by both scripts can be
handled
by the extraction code. This only contains uuids, though, so cannot be
used
as a map file for the load. Which makes sense, since you never want to
expose
emails in a public dataset.
Testing done:
Extracting uuids from email list continues to work, although the list
now
New interface with arg parsing outputs to stdout by default

```
$ ./e-mission-py.bash bin/public/extract_uuids_from_email_list.py
/tmp/email_list
configuration
email test_ucdb_load to uuid 74f5bc92-9d2e-45b8-89ba-fc6f18000ab6
[{"uuid": {"$uuid": "74f5bc929d2e45b889bafc6f18000ab6"}}]
```
And to a file otherwise

```
$ ./e-mission-py.bash bin/public/extract_uuids_from_email_list.py
/tmp/email_list -o /tmp/uuid_list
to sample, default configuration
DEBUG:root:Mapped email test_ucdb_load to uuid
74f5bc92-9d2e-45b8-89ba-fc6f18000ab6
```
Extracting the timeline for this list still works after changing the
parsing for the `-f` option

```
$ ./e-mission-py.bash
bin/debug/extract_timeline_for_day_range_and_user.py -f /tmp/uuid_list
2000-01-01 2020-01-01 /tmp/test_file
back to sample, default configuration
localhost
INFO:root:Extracting timeline for user
74f5bc92-9d2e-45b8-89ba-fc6f18000ab6 day 2000-01-01 -> 2020-01-01 and
saving to file /tmp/test_file
(2000-01-01T00:00:00+00:00), end_day_ts = 1577836800
(2020-01-01T00:00:00+00:00)
...
entries = 90 total entries
{'stats/server_api_time'}
```
New get_users_for_channel script also works for stdout...

```
$ ./e-mission-py.bash bin/debug/get_users_for_channel.py foo
not configured, falling back to sample, default configuration
Connecting to database URL localhost
channel foo
[{"_id": {"$oid": "5b84e9df91da20c31c8ce47e"}, "user_email":
"test_ucdb_load", "uuid": {"$uuid": "74f5bc929d2e45b889bafc6f18000ab6"},
"update_ts": {"$date": 1535412606366}}]
```
and to file

```
$ ./e-mission-py.bash bin/debug/get_users_for_channel.py foo -o
/tmp/uuid_list_2
default configuration
DEBUG:root:Mapped 1 entries for channel foo
```
Extraction continues to work since the objects have a `uuid` field.

```
$ ./e-mission-py.bash
bin/debug/extract_timeline_for_day_range_and_user.py -f /tmp/uuid_list_2
2000-01-01 2020-01-01 /tmp/test_file_2
back to sample, default configuration
localhost
INFO:root:Extracting timeline for user
74f5bc92-9d2e-45b8-89ba-fc6f18000ab6 day 2000-01-01 -> 2020-01-01 and
saving to file /tmp/test_file_2
(2000-01-01T00:00:00+00:00), end_day_ts = 1577836800
(2020-01-01T00:00:00+00:00)
...
entries = 90 total entries
{'stats/server_api_time'}
```
Loading the extracted data with the new `-m` mapping option works.

```
$ ./e-mission-py.bash bin/debug/load_multi_timeline_for_range.py -v 1 -m
/tmp/uuid_list_2 /tmp/test_file_2
to sample, default configuration
INFO:root:Loading file or prefix /tmp/test_file_2
matching files for prefix /tmp/test_file_2
['/tmp/test_file_2_pipelinestate_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz',
'/tmp/test_file_2_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz'] ...
['/tmp/test_file_2_pipelinestate_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz']
INFO:root:==================================================
INFO:root:Loading data from file
/tmp/test_file_2_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz
INFO:root:Analyzing timeline...
INFO:root:timeline has data from 1 users
following unique keys {'stats/server_api_time'}
user 74f5bc92-9d2e-45b8-89ba-fc6f18000ab6 contains only raw data
INFO:root:About to load range 0 -> 90
74f5bc92-9d2e-45b8-89ba-fc6f18000ab6 from
/tmp/test_file_2_pipelinestate_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz
args.prefix = user, args.mapfile = /tmp/uuid_list_2
user entries for 1 users from map of length 16
insert mapping test_ucdb_load -> 74f5bc92-9d2e-45b8-89ba-fc6f18000ab6
INFO:root:For 1 users, loaded 90 raw entries, 0 processed entries and 0
pipeline states
data, need to run the intake pipeline
```
And the list of users is unchanged because we loaded the user with
email
`test_ucdb_load`, which was always there.

```
[{'_id': ObjectId('5b7c5cce58b6cc2f0e261660'),
'update_ts': datetime.datetime(2018, 8, 21, 11, 41, 18, 297000),
'user_email': 'indy_day',
'uuid': UUID('c3613fb1-8975-439f-9a11-06a9d18ba01f')},
{'_id': ObjectId('5b7c5cf258b6cc2f0e261e5e'),
'update_ts': datetime.datetime(2018, 8, 21, 11, 42, 1, 792000),
'user_email': 'iphone_2016',
'uuid': UUID('31252f6e-47b1-43ce-9d76-76972d0f2f33')},
{'_id': ObjectId('5b84e95191da20c31c8ce16d'),
'update_ts': datetime.datetime(2018, 8, 27, 23, 18, 57, 800000),
'user_email': 'test_tsdb_load',
'uuid': UUID('b1e80dcd-eb4e-4926-bace-d046a284b5aa')},
{'_id': ObjectId('5b84e9df91da20c31c8ce47e'),
'update_ts': datetime.datetime(2018, 8, 28, 13, 38, 16, 323000),
'user_email': 'test_ucdb_load',
'uuid': UUID('74f5bc92-9d2e-45b8-89ba-fc6f18000ab6')}]
```
After resetting...

```
UUID('74f5bc92-9d2e-45b8-89ba-f
   ...: c6f18000ab6')})
0x105c050c8>
```
reload without the `-m` option which creates dummy email addresses for
additional privacy protection.

```
$ ./e-mission-py.bash bin/debug/load_multi_timeline_for_range.py -v 1
/tmp/test_file_2
default configuration
INFO:root:Loading file or prefix /tmp/test_file_2
matching files for prefix /tmp/test_file_2
['/tmp/test_file_2_pipelinestate_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz',
'/tmp/test_file_2_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz'] ...
['/tmp/test_file_2_pipelinestate_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz']
INFO:root:==================================================
INFO:root:Loading data from file
/tmp/test_file_2_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz
INFO:root:Analyzing timeline...
INFO:root:timeline has data from 1 users
following unique keys {'stats/server_api_time'}
user 74f5bc92-9d2e-45b8-89ba-fc6f18000ab6 contains only raw data
INFO:root:About to load range 0 -> 90
74f5bc92-9d2e-45b8-89ba-fc6f18000ab6 from
/tmp/test_file_2_pipelinestate_74f5bc92-9d2e-45b8-89ba-fc6f18000ab6.gz
INFO:root:No pipeline states found, skipping load
args.mapfile = None
INFO:root:pattern = user-%01d
-> 74f5bc92-9d2e-45b8-89ba-fc6f18000ab6
90 raw entries, 0 processed entries and 0 pipeline states
entries in the timeline contain only raw data, need to run the intake
pipeline
```
And we see another entry with the same uuid

```
[{'_id': ObjectId('5b7c5cce58b6cc2f0e261660'),
'update_ts': datetime.datetime(2018, 8, 21, 11, 41, 18, 297000),
'user_email': 'indy_day',
'uuid': UUID('c3613fb1-8975-439f-9a11-06a9d18ba01f')},
{'_id': ObjectId('5b7c5cf258b6cc2f0e261e5e'),
'update_ts': datetime.datetime(2018, 8, 21, 11, 42, 1, 792000),
'user_email': 'iphone_2016',
'uuid': UUID('31252f6e-47b1-43ce-9d76-76972d0f2f33')},
{'_id': ObjectId('5b84e95191da20c31c8ce16d'),
'update_ts': datetime.datetime(2018, 8, 27, 23, 18, 57, 800000),
'user_email': 'test_tsdb_load',
'uuid': UUID('b1e80dcd-eb4e-4926-bace-d046a284b5aa')},
{'_id': ObjectId('5b84e9df91da20c31c8ce47e'),
'update_ts': datetime.datetime(2018, 8, 28, 13, 38, 16, 323000),
'user_email': 'test_ucdb_load',
'uuid': UUID('74f5bc92-9d2e-45b8-89ba-fc6f18000ab6')},
{'_id': ObjectId('5b85b39c91da20c31c8cf796'),
'update_ts': datetime.datetime(2018, 8, 28, 13, 42, 4, 615000),
'user_email': 'user-0',
'uuid': UUID('74f5bc92-9d2e-45b8-89ba-fc6f18000ab6')}]
```

(commit: d049748b62bb283c693b6bdd55150db8b4513d73)
The file was addedbin/debug/get_users_for_channel.py
The file was modifiedbin/debug/extract_timeline_for_day_range_and_user.py (diff)
The file was modifiedbin/public/extract_uuids_from_email_list.py (diff)
The file was modifiedbin/debug/load_multi_timeline_for_range.py (diff)