MediaWiki API result

This is the HTML representation of the JSON format. HTML is good for debugging, but is unsuitable for application use.

Specify the format parameter to change the output format. To see the non-HTML representation of the JSON format, set format=json.

See the complete documentation, or the API help for more information.

{
    "compare": {
        "fromid": 1,
        "fromrevid": 1,
        "fromns": 0,
        "fromtitle": "Main Page",
        "toid": 2,
        "torevid": 2,
        "tons": 0,
        "totitle": "Data Platform/Data Lake/Traffic/Pageview hourly/Sanitization",
        "*": "<tr><td colspan=\"2\" class=\"diff-lineno\" id=\"mw-diff-left-l1\">Line 1:</td>\n<td colspan=\"2\" class=\"diff-lineno\">Line 1:</td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div><del class=\"diffchange diffchange-inline\">&lt;strong&gt;MediaWiki has been installed</del>.<del class=\"diffchange diffchange-inline\">&lt;</del>/strong<del class=\"diffchange diffchange-inline\">&gt;</del></div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In an effort toward privacy with regards to reader pageview data, we aim to sanitize the aggregate logs that we store long-term.\u00a0 There are [[#Problems:_Reconstruction_of_browsing_patterns_.26_safe_data_publication|two reasons]] for sanitizing the dataset. The first is to protect our users from having their browsing pattern reconstructed if somebody hacks our cluster. The second is to publicly release aggregated datasets on interesting dimensions (user agent and geography, to be precise) without risk for our users. The [[#Solution:_Sanitizing_using_K-Anonymity_over_multiple_fields|approach chosen]] to sanitize the dataset is to anonymize (set value to unknown) certain values on rows, when the row is [[#Strategies|subject to identification]]</ins>. <ins class=\"diffchange diffchange-inline\">This page summarizes and links to details about the [[Analytics|Analytics Team]] approach, research and results on sanitizing the [[Analytics/Data</ins>/<ins class=\"diffchange diffchange-inline\">Pageview hourly|pageview_hourly]] dataset.\u00a0 Our [[#Information_loss_analysis|analysis]] shows that the strategy we chose provides a </ins>strong <ins class=\"diffchange diffchange-inline\">level of resistance to attacks while still keeping a lot of value in our dataset.</ins></div></td></tr>\n<tr><td class=\"diff-marker\"></td><td class=\"diff-context diff-side-deleted\"><br></td><td class=\"diff-marker\"></td><td class=\"diff-context diff-side-added\"><br></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div><del class=\"diffchange diffchange-inline\">Consult the [https</del>:<del class=\"diffchange diffchange-inline\">//www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software.</del></div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Problems</ins>: <ins class=\"diffchange diffchange-inline\">Reconstruction of browsing patterns &amp; safe data publication ==</ins></div></td></tr>\n<tr><td class=\"diff-marker\"></td><td class=\"diff-context diff-side-deleted\"><br></td><td class=\"diff-marker\"></td><td class=\"diff-context diff-side-added\"><br></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div>== <del class=\"diffchange diffchange-inline\">Getting started </del>==</div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>==<ins class=\"diffchange diffchange-inline\">== Browsing patterns reconstruction ==</ins>==</div></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div><del class=\"diffchange diffchange-inline\">* </del>[<del class=\"diffchange diffchange-inline\">https:</del>//<del class=\"diffchange diffchange-inline\">www</del>.<del class=\"diffchange diffchange-inline\">mediawiki</del>.<del class=\"diffchange diffchange-inline\">org</del>/<del class=\"diffchange diffchange-inline\">wiki</del>/<del class=\"diffchange diffchange-inline\">Special:MyLanguage</del>/<del class=\"diffchange diffchange-inline\">Manual</del>:<del class=\"diffchange diffchange-inline\">Configuration_settings Configuration settings list]</del></div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As we found in our [</ins>[<ins class=\"diffchange diffchange-inline\">Analytics/Data</ins>/<ins class=\"diffchange diffchange-inline\">Pageview hourly</ins>/<ins class=\"diffchange diffchange-inline\">Identity reconstruction analysis|Identity reconstruction analysis]], an attacker with access to our cluster could follow user browsing patterns by combining two datasets: pageview_hourly and webrequest. More precisely, users with a rare combination of values in various fields, especially user-agent and geographical location, are at risk of first being identified in the more raw webrequest dataset, and then followed in pageview_hourly. We only keep data in the webrequest dataset for a short period of time, but we would like to keep pageview_hourly indefinitely, and so we need to make it safe against this type of attack</ins>. .</div></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div>* <del class=\"diffchange diffchange-inline\">[https</del>://<del class=\"diffchange diffchange-inline\">www</del>.<del class=\"diffchange diffchange-inline\">mediawiki</del>.<del class=\"diffchange diffchange-inline\">org</del>/<del class=\"diffchange diffchange-inline\">wiki</del>/<del class=\"diffchange diffchange-inline\">Special:MyLanguage</del>/<del class=\"diffchange diffchange-inline\">Manual:FAQ MediaWiki FAQ</del>]</div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div><del class=\"diffchange diffchange-inline\">* [https:</del>//<del class=\"diffchange diffchange-inline\">lists</del>.<del class=\"diffchange diffchange-inline\">wikimedia</del>.<del class=\"diffchange diffchange-inline\">org</del>/<del class=\"diffchange diffchange-inline\">postorius</del>/<del class=\"diffchange diffchange-inline\">lists</del>/<del class=\"diffchange diffchange-inline\">mediawiki</del>-<del class=\"diffchange diffchange-inline\">announce</del>.<del class=\"diffchange diffchange-inline\">lists</del>.<del class=\"diffchange diffchange-inline\">wikimedia</del>.<del class=\"diffchange diffchange-inline\">org/ MediaWiki release mailing list]</del></div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">An analysis of the potential decay of fingerprinting data (see [[Analytics</ins>/<ins class=\"diffchange diffchange-inline\">Data</ins>/<ins class=\"diffchange diffchange-inline\">Pageview hourly</ins>/<ins class=\"diffchange diffchange-inline\">Fingerprinting Over Time|this page]] for more details) shows that data getting old doesn't imply enough change among user information to prevent us from sanitizing.</ins></div></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div>* [<del class=\"diffchange diffchange-inline\">https</del>://<del class=\"diffchange diffchange-inline\">www.mediawiki.org</del>/<del class=\"diffchange diffchange-inline\">wiki</del>/<del class=\"diffchange diffchange-inline\">Special</del>:<del class=\"diffchange diffchange-inline\">MyLanguage</del>/<del class=\"diffchange diffchange-inline\">Localisation#Translation_resources Localise MediaWiki </del>for <del class=\"diffchange diffchange-inline\">your language]</del></div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div>* <del class=\"diffchange diffchange-inline\">[https</del>://<del class=\"diffchange diffchange-inline\">www</del>.<del class=\"diffchange diffchange-inline\">mediawiki</del>.<del class=\"diffchange diffchange-inline\">org/wiki/Special</del>:<del class=\"diffchange diffchange-inline\">MyLanguage/Manual</del>:<del class=\"diffchange diffchange-inline\">Combating_spam Learn how to combat spam on your wiki</del>]</div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Safe pageview data publication ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Our team core mission is to try to release publicly as much data as we can. The pageview_hourly dataset is no exception, and we'd like to use it to provide more data to our users. However the sensitivity of the data it contains needs us to be very careful about how we publish its content. This means</ins>:</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>* <ins class=\"diffchange diffchange-inline\">Separate page_title (or page_id) from other non-global dimensions (particularly geo data and user agent data) - Example of attack</ins>: <ins class=\"diffchange diffchange-inline\">A user modifies a page, therefore making a hit for that page_title in the pageview_hourly dataset&lt;ref&gt;pageview_hourly doesn't contain edits, but in order to make an edit, the probability that you'll have a pageview for the given page (either before or after editing) is very high.&lt;</ins>/<ins class=\"diffchange diffchange-inline\">ref&gt;, and nobody else access that page for the given hour --&gt; If we keep geo data and user agent data associated with page_title, the user geo and user agent becomes easily known. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Ensure published aggregated data is not easily reconcilable with page_title level traffic -\u00a0 We have already released pageview data at page_title level of granularity (see [[Data Platform</ins>/<ins class=\"diffchange diffchange-inline\">AQS|pageview API in AQS]])</ins>. <ins class=\"diffchange diffchange-inline\">We want to be sure that newly published geo and user agent data aggregated at project / access / agent_type level will not be linkable to page_title traffic (or to be more precise, could only be linked to anonymized traffic were geo and user_agent data are not present anymore)</ins>.</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Solution: Sanitizing using K-Anonymity over multiple fields ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">See [[Analytics</ins>/<ins class=\"diffchange diffchange-inline\">Data</ins>/<ins class=\"diffchange diffchange-inline\">Pageview hourly</ins>/<ins class=\"diffchange diffchange-inline\">Sanitization algorithm proposal|this page]</ins>] <ins class=\"diffchange diffchange-inline\">for a detail version of the algorithm we propose.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Very briefly, the idea is to group pageviews into buckets by sensitive fields, such as user agent and location.\u00a0 When these buckets have less than &lt;code&gt;K&lt;sub&gt;ip&lt;/sub&gt;&lt;/code&gt; disctinct IPs or less than Kpv distinct pages viewed, we anonymize one of the sensitive fields and repeat so that all possible buckets have more than &lt;code&gt;K&lt;sub&gt;ip&lt;/sub&gt;&lt;/code&gt; distinct IPs and more than &lt;code&gt;K&lt;sub&gt;pv&lt;</ins>/<ins class=\"diffchange diffchange-inline\">sub&gt;&lt;</ins>/<ins class=\"diffchange diffchange-inline\">code&gt; distinct pages viewed</ins>. <ins class=\"diffchange diffchange-inline\"> Fields with values that are unlikely to show up often are anonymized first</ins>.</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Strategies ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== The good Ks ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We (the Analytics-Team) did a manual/qualitative review of browsing patterns over an hour with various distinct IPs, distinct pages and settings. Detailed data on exercise can be found on [[Analytics</ins>/<ins class=\"diffchange diffchange-inline\">Data</ins>/<ins class=\"diffchange diffchange-inline\">Pageview hourly</ins>/<ins class=\"diffchange diffchange-inline\">K Anonymity Threshold Analysis|this dedicated page]] along with Hive code.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== We found that: ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* When looking at groups of pages viewed by multiple people, it is sometimes easy to guess which sub</ins>-<ins class=\"diffchange diffchange-inline\">groups of pages could have been viewed together based on topics</ins>.</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** It is however not possible to re-attach sub-groups to the underlying people with certainty</ins>.</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** It could be feasible to reattach subgroups to the underlying people with some probability of being right using prior knowledge of browsing habits of those people.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* When looking at groups of pages having a small number of distinct pages, even with a very small number of distinct pages (2, 3, 4, 5), we have almost never identified those set sets as single sessions, and somewhat regularly we can identify them as two sessions.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== It means that: ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* The minimum anonymization we could go for would make sure that at least 2 distinct IPs and 2 distinct pages occur per bucket.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* We prefer to go on the safer side and add more variability to our buckets, ensuring that at least 3 distinct IPs and 5 distinct pages occur per bucket. It involves us anonymizing 91.28% of buckets making 35.11% of requests.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Choosing hourly or longer term data to establish the \"uniqueness\" of values in sensitive fields ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We want to anonymize the most rare values first, because they are the most identifying. We can establish the \"rareness\" of each value by looking at either hourly statistics or longer term, such as monthly statistics:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Using hourly statistics would establish a Local probability. This should reduce the processing time and the number of steps needed to terminate the algorithm (because locally rare values are anonymized first, leading to faster progress towards buckets of size greater than K)</ins>.</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>* <ins class=\"diffchange diffchange-inline\">Using monthly statistics would establish a more Global probability. This normalizes any temporal patterns (such as hourly or weekly seasonality) and accounts for differences across time zones. This approach gives more value to global data quality but would run slower.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We decided as a team to use hourly statistics, for two reasons. One is technical computation resource, as statistics computation over a month would be very big, the other is data preservation, since using monthly statistics would mean anonymizing more data. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Information loss analysis ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Entropy analysis - Definition ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Per dimension ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We used [</ins>[:<ins class=\"diffchange diffchange-inline\">en:Entropy_(information_theory)|Shannon Entropy]]\u00a0 definition trying to measure how much information was lost in the process of anonymizing the pageview dataset.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For each fingerprinting dimension we computed entropy using the probability of a value (&lt;code&gt;P&lt;sub&gt;val&lt;/sub&gt;&lt;/code&gt;) in the dimension as:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;math&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">P_{val} = \\sum_{rows} \\textrm{view\\_count}_{val} </ins>/ <ins class=\"diffchange diffchange-inline\">\\sum_{rows} \\textrm{view\\_count}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;</ins>/<ins class=\"diffchange diffchange-inline\">math&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Entropy for the dimension was then computed using the formula:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;math&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">H_{dim} = \\sum_{val \\in dim} P_{val} * \\textrm{log2}(1 / P_{val})</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;</ins>/<ins class=\"diffchange diffchange-inline\">math&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">An interesting point to notice is how to count &lt;code&gt;unknown&lt;</ins>/<ins class=\"diffchange diffchange-inline\">code&gt; values in this definition. We have tried 3 methods</ins>:</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &lt;code&gt;unknown&lt;/code&gt; as a regular value</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &lt;code&gt;unknown&lt;</ins>/<ins class=\"diffchange diffchange-inline\">code&gt; only in adds to the total sum of view_counts </ins>for <ins class=\"diffchange diffchange-inline\">the dataset, but not as a value having a probability</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>* <ins class=\"diffchange diffchange-inline\">not counting &lt;code&gt;unknown&lt;/code&gt; at all</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Results have shown that the last method gives better result: more coherent in term of entropy definition, and providing a better view of how much data has been lost.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Global ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Finally, we have computed a global value of entropy for the dataset, using the formula</ins>:</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;math&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">H_{dataset} = \\sum_{row \\in dataset} ( \\sum_{val \\in row} P_{val} * \\textrm{log2}(1 </ins>/ <ins class=\"diffchange diffchange-inline\">P_{val}) )</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;</ins>/<ins class=\"diffchange diffchange-inline\">math&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Entropy analysis - Results ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Per dimension ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[File:Anonymization data loss per dimension</ins>.<ins class=\"diffchange diffchange-inline\">png|none|thumb|599x599px]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Global ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Tests realized over 3 hours at different time of the day (see next section for circadian patterns details):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| class=\"wikitable\"</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">!Hour (UTC)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">!Entropy Default Dataset</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">!Entropy Anonymized Dataset</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">!Data loss</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|65525642</ins>.<ins class=\"diffchange diffchange-inline\">24</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|64143222.07</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|2.11%</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|8</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|65024942.53</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|63503727.41</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|2.34%</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|17</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|87047745.81</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|85519908.15</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|1.76%</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Circadian Rhythm ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Continents ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[File:Anonymization circadian continent.png|none|thumb|591x591px]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Countries ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[File:Anonymization circadian country anim.gif|none|thumb|591x591px]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Category</ins>:<ins class=\"diffchange diffchange-inline\">Pageviews]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Category</ins>:<ins class=\"diffchange diffchange-inline\">Data platform]</ins>]</div></td></tr>\n"
    }
}