Published research on privacy, transparency and machine learning
Dissemination @ Cliqz
Space to share studies, talks, findings by Cliqz Team.
Online tracking poses a serious privacy challenge that has drawn significant attention in both academia and industry. Existing approaches for preventing user tracking, based on curated blocklists, suffer from limited coverage and coarse-grained resolution for classification, rely on exceptions that impact sites’ functionality and appearance, and require significant manual maintenance.
In this paper we propose a novel approach, based on the concepts leveraged from k-Anonymity, in which users collectively identify unsafe data elements, which have the potential to identify uniquely an individual user, and remove them from requests.
We deployed our system to 200,000 German users running the Cliqz Browser or the Cliqz Firefox extension to evaluate its efficiency and feasibility. Results indicate that our approach achieves better privacy protection than blocklists, as provided by Disconnect, while keeping the site breakage to a minimum, even lower than the community-optimized Adblock Plus. We also provide evidence of the prevalence and reach of trackers to over 21 million pages of 350,000 unique sites. 95% of the pages visited contain 3rd party requests to potential trackers and 78% attempt to transfer unsafe data. Tracker organizations are also ranked, showing that a single organization can reach up to 42% of all page visits in Germany.
The standard approach to collect users’ activity data on the Web relies on server-side processing. This approach requires the presence of user-identifiers in order to aggregate data in sessions, which leads to tracking. Server-side aggregation is bound to produce side-effects because the scope of sessions cannot be safely limited to a particular use-case. We provide several examples of such side-effects.
To preserve privacy we propose an alternative approach based on client-side aggregation, where user-identifiers are not needed because sessions only exist on the client-side (i.e. the user’s browser). We demonstrate the feasibility of this approach by providing an implementation of a tracking agent – green-tracker – able to gather the data needed to power a service functionally equivalent to Google Analytics.
The goal of this study is to shed light on the impact of trackers from a performance perspective, rather than the more frequently studied privacy standpoint. Previous research on the topic has looked at the ubiquitous nature of online tracking and their various business models, pervasiveness of tracking, especially among news websites and the privacy implication of tracking in the wild where a few companies have extensive reach on web traffic.
Beyond privacy concerns, we are left with one question: Do trackers cost us time? More specifically, what is the relationship between the number of trackers and the time a page takes to load? We call this tracker impact on the website page load times, also referred to as page latency, the Tracker Tax.
Online tracking has become of increasing concern in recent years, however our understanding of its extent to date has been limited to snapshots from web crawls. Previous attempts to measure the tracking ecosystem, have been done using instrumented measurement platforms, which are not able to accurately capture how people interact with the web.
In this work we present a method for the measurement of tracking in the web through a browser extension, as well as a method for the aggregation and collection of this information which protects the privacy of participants. We deployed this extension to more than 5 million users, enabling measurement across multiple countries, ISPs and browser configurations, to give an accurate picture of real-world tracking. The result is the largest and longest measurement of online tracking to date based on real users, covering 1.5 billion page loads gathered over 12 months. The data, detailing tracking behaviour over a year, is made publicly available to help drive transparency around online tracking practices.
Anonymous data collection systems allow users to contribute the data necessary to build services and applications while preserving their privacy. Anonymity, however, can be abused by malicious agents aiming to subvert or to sabotage the data collection, for instance by injecting fabricated data.
In this paper we propose an efficient mechanism to ratelimit an attacker without compromising the privacy and anonymity of the users contributing data. The proposed system builds on top of Direct Anonymous Attestation, a proven cryptographic primitive. We describe how a set of rate-limiting rules can be formalized to define a normative space in which messages sent by an attacker can be linked, and consequently, dropped. We present all components needed to build and deploy such protection on existing data collection systems with little overhead. Empirical evaluation yields performance up to 125 and 140 messages per second for senders and the collector respectively on nominal hardware. Latency of communication is bound to 4 seconds in the 95th percentile when using Tor as network layer.
In collaboration with Valentin Hartmann and Dr. Robert West from EPFL, Switzerland.
Today, large amounts of valuable data are distributed among millions of user-held devices, such as personal computers, phones, or Internet-of-things devices. Many companies collect such data with the goal of using it for training machine learning models allowing them to improve their services.
However, user-held data is often sensitive, and collecting it is problematic in terms of privacy. We address this issue by proposing a novel way of training a supervised classifier in a distributed setting akin to the recently proposed federated learning paradigm (McMahan et al. 2017), but under the stricter privacy requirement that the server that trains the model is assumed to be untrusted and potentially malicious; we thus preserve user privacy by design, rather than by trust. In particular, our framework, called secret vector machine (SecVM), provides an algorithm for training linear support vector machines (SVM) in a setting in which data-holding clients communicate with an untrusted server by exchanging messages designed to not reveal any personally identifiable information.
We evaluate our model in two ways. First, in an offline evaluation, we train SecVM to predict user gender from tweets, showing that we can preserve user privacy without sacrificing classification performance. Second, we implement SecVM’s distributed framework for the Cliqz web browser and deploy it for predicting user gender in a large-scale online evaluation with thousands of clients, outperforming baselines by a large margin and thus showcasing that SecVM is practicable in production environments. Overall, this work demonstrates the feasibility of machine learning on data from thousands of users without collecting any personal data. We believe this is an innovative approach that will help reconcile machine learning with data privacy.
To access how third-party services are being used in online-banking portals, we present a survey of German banks, analysing where third parties are included in online-banking pages, what is being loaded, and who these third parties are. We can then access the specific security and privacy implications of these practices
Blocking all ads faster than the blink of an eye. In this study we present a detailed analysis of the performance of some of the most popular content-blocker engines: uBlock Origin, Adblock Plus, Brave, DuckDuckGo and Cliqz/Ghostery’s advanced adblocker.
In collaboration with Malin Eiband and Prof. Dr. Heinrich Hußmann from LMU, Munich.
The collection of personal data and creation of user profiles to monetize products is a common business model in the online world. Comprehending how and why their data is processed is often impossible for users, often resulting in the feeling of privacy violation.
The presented master thesis investigates MyOffrz, a privacy-preserving advertisement system developed by Cliqz. MyOffrz is based on client-side data processing and thus does not track the users and does not create personal profiles. The challenge arises when it comes to mental models. Users who already faced so many advertisements online, transfer this knowledge to the new systems as well.
In this thesis we define a framework for educating users about MyOffrz, which consists of three stages: pre-interaction, interface properties and informational content. We focus on informational content and evaluate a prototype explaining value, business goal, data flows and underlying algorithm of MyOffrz. Our results indicate that the tested design was effective in terms of changing users’ mental models about the system. Moreover, we found that gamification elements in explanations are well perceived by users, they like to be in control and they want explanations to be as concise as possible. We could also observe a connection between technology proficiency and privacy attitudes: Users who have more knowledge about technology tend to be more privacy- concerned, those who are less proficient tend to also be less concerned. We derive several product-related findings, that are implemented later on.
In collaboration with Prof. Dr. Florian Alt from LMU, Munich.
Internet users regularly need to re-find information or content that they looked at in the past. In some cases, these revisitations take place weeks after the initial visit. Long-term revisitations, also called rediscoveries, are often time-consuming, prone to failure and require high mental effort. Existing research showed that current browsers poorly support this activity requiring users to rely on less efficient strategies, such as re-creating queries or re- tracing previous browsing paths, to find the desired information.
In two formative studies, I confirmed the existing findings and showed that, on average, rediscoveries take about the same time as the initial search for the information, users often fail because of trouble identifying pages and users are unable to make use of contextual memories.
These insights led me to the development of the Cliqz Browsing History, which acts as a replacement for the browser’s history list. Common user behaviors and memories are directly supported by grouping the history into sessions, by showing context and by providing a searchable query history. Additionally, users are able to explore previous browsing paths and recognize pages using mouseover previews.
To evaluate the developed tool, I conducted an evaluation, which confirmed the benefits of the underlying concepts with a promising performance increase after continued usage and users needing significantly fewer page visits for successful rediscoveries.
The Human Face of Big Data - Organisation and Scalability of Manual Testing for Big Data Applications
November edition - Two Faces of Big Data
High-Dimensional Nearest Neighbor Search - Algorithm Itself, Applications, Difficulties and a Few Existing Solution
November edition - Two Faces of Big Data
How Adblockers Work
Hacktoberfest Meetup 2019
Blocking Ads at Scale
Adblocker Summit 2019
Watching Them Watching Us
Decentralized Internet and Privacy devroom FOSDEM 2018
This talk was given in collaboration with Trackula.org.
CVE-2019-17004: Semi-Universal XSS in Firefox iOS
Trouble With OCSP: Side Channel Leaks in OnionBrowser
This is a post about side channel information leak present in OnionBrowser OCSP requests. This post omits a lot of details about OCSP protocol.
Tracking Bitwarden Firefox Add-on Users
Firefox web extensions, generate a UUID for each web-extension which is specific to a user (Unlike chrome extensions). Whenever the user installs Bitwarden on Firefox, it generates a different extension ID for each user.
The problem occurs when Bitwarden prompts the user with the message: “Should Bitwarden remember this password for you?”.
Because this prompt loads a local resource from
Now, because this is UUID is unique to each user, it is a potential user ID which can be used for tracking a user.
Brave Browser: Tor Bypass by Different URI Schemes
Sites can return malicious URLs in
Location headers like
xmpp://<domain>. On Mac when a location header with such value is received it does not ask permission to allow opening another app. Rather, it opens it in the background, leaking the domain name over clearnet.
Brave Browser: Tor Bypass While Loading Favicons
When visiting websites in Brave Tor tab, some calls to fetch favicons are done on normal network, leaking domains user is visiting over clearnet.
Chromium Browsers: Domains and URLs Persisted on Disk Forever
When a website uses any of LevelDB, Service Workers or push notifications, some information is stored on the disk. Because of the way LevelDB compacts data, the data remains on the disk even in the event of explicit history clearing.
CVE-2018-12400: Favicons Are Cached in Private Browsing Mode on Firefox for Android
In private browsing mode on Firefox for Android, favicons are cached in the
cache/icons folder as they are in non-private mode. This enables information leakage of sites visited during private browsing sessions.
Note: this issue only affects Firefox for Android. Desktop versions of Firefox are unaffected.
Privacy Issue on bing.com and Other Microsoft Sites
Popular sites like bing.com, microsoft.com and office365.com leaked an identifier that could be used to deanonymize users. Microsoft acknowledged and fixed this issue.
CVE-2017-7843: Web Worker in Private Browsing Mode Can Write IndexedDB Data
When Private Browsing mode is used, it is possible for a Web Worker to write persistent data to IndexedDB and fingerprint a user uniquely. IndexedDB should not be available in Private Browsing mode and this stored data will persist across multiple private browsing mode sessions because it is not cleared when exiting.
CVE-2016-5288: Web Content Can Read Cache Entries
Security issue in Firefox Desktop allows web content to access information in the HTTP cache if e10s is disabled.
This can reveal some visited URLs and the contents of those pages. This issue affects Firefox 48 and 49.