What are third-party cookies? why privacy on the web is an illusion
In this post, Konstantin, Chief Software Engineer II, and Vladimir, Senior Data Scientist, discuss today's privacy on the web, future cookie deprecation, mechanisms behind user tracking, and fingerprinting.
Introduction
What are third-party cookies? How do they aid in obtaining user data? Our experts, Konstantin Perikov, Chief Software Engineer II, and Vladimir Sergeev, Senior Data Scientist, shed some light on modern privacy on the web, cookie deprecation by Google, mechanisms of user tracking, and fingerprinting.
1x1 tracking pixel. This is a tiny pixel-sized image that can be hidden anywhere, from a web banner to an email. Tracking pixels allow tracking of user behavior, site conversions, web traffic, and other metrics on the back-end.
Beacon API. The Beacon API is a set of protocols that permits small amounts of data to be sent to a server without waiting for a response and tracking a user's activity. Here is an example of JS code that can send data from your web beacon:
Account tracking. Many websites track user activities with user accounts. For example, after you log into a Facebook page, Facebook knows where you put likes and which pages you visit.
Use cases of user tracking
User tracking has its pros and cons. On the bright side, user tracking enables users to receive relevant ads, content, and eCommerce products that match their activities and interests. User tracking also helps companies measure revenue streams, monitor site usability, and gain insight into user behavior.
The mechanism of user tracking, however, is still not clear to most users. The majority of users don't even realize that they're being tracked and for what purposes. Non-tech-savvy users are unable to turn the feature off, which can deteriorate the user experience.
User data collected by tracking can be shared with third parties, sold for profit without user consent, and can contribute to a variety of cybersecurity threats.
Fingerprinting
This technique originated in the late nineties and started to increase in the wake of 3rd party cookies' departure. It consists of all information that can be gathered from a user’s interaction with a web browser. Fingerprinting is entirely legal in most areas. Even strict GDPR rules allow fingerprinting since they only require asking users for consent for cookie tracking, not fingerprinting.
A fingerprint is a unique identifier for the configurations of the user’s web browser and operating system. It collects the information about the software and hardware of the user’s device for the purpose of identification (think Mac address, IP, and many others). Companies and network providers use browser fingerprints to prevent fraud and identity theft.
Here are typical fingerprint types and methods of tracking:
- Browser version
- Browser plugins/extensions
- Hardware properties
- Font metrics
- Canvas and WebGL fingerprinting
- Audio fingerprinting
- Benchmarks
As you can see, even when cookies are not being utilized, users are still a potential target for security attacks. Methods like switching to an incognito mode, enabling a VPN or an ad blocker, and cleaning up cookies or your search history don't prevent fingerprinting. A special secure browser like Brave or TOR could distort some of the strong fingerprinting techniques like canvas or audio.
In the next section, we'll review a data science-based use case we implemented for one of the EPAM clients.
Data science approaches behind fingerprinting
Now that we addressed the fingerprinting concept, let's review data science approaches to solving fingerprinting problems. To begin, let’s focus on the problem from the data science perspective. Once we’ve gathered user-agent information, any available information about fonts, canvases, attached devices, media codecs, etc., we want to understand – who is the user in front of us, with this set of features.
Hashing approach
One of the simplest solutions is to calculate hash based on different values. If the hashes are the same, we’ve found our user. Or a row in our history with the same hash may say that two users are the same person, and enable us to collect more information about our user.
However, this approach has limitations. First, consider the "small changes problem." Basically, this may occur when we work with features that can be changed without any meaningful alterations in the target. Let's say we're analyzing a user-agent string that contains information about a user agent, an operational system, and a device. Different users most likely will have different devices or browsers, but what about browser versions or some minor patches of operational systems? Browser versions may change several times per day by some insignificant value without any action from the users’ side. Keeping this in mind, we (or our model) should prepare for unexpected changes.
At a modeling stage, the simplest solution could be to drop that feature, because we hope our model could "understand" that the exact value of a minor version of a web browser is not descriptive. But in the data preparation step, we'd keep such information because later we may use it for different purposes, for example, calculating distances between different users. To solve this “small changes” issue, we may work with different hashing techniques to keep original values close in terms of hashing results (simhash for example). The result, however, is hashed values that aren't representative and are hard to analyze.
Classification approach
We'll review this option using our original problem statement when we gathered some features and tried to predict a user. A classification approach may seem simple to those who have at least little experience with data science. With millions of users, however, and running classification on millions of classes in place, the task becomes a challenge. As a result, after a user classification step, we may want to identify user preferences based on other data we gathered. The most common business case being addressed in this situation is the detection of user interests to provide more relevant advertisements.
Clustering approach
The clustering approach helps solve the challenge with an enormous number of users. Let's say we have a huge number of similar users. We split them into different groups that have common interests and, hopefully, similar users have similar features. That’s how we cluster our users. This approach may work if we need to cluster different user preferences. For example, we may predict that a user may be close to a cluster for males aged 25-34 years with an interest in advertisements about cars.
Recommendation approach
After the approaches described above, the recommendation approach may seem a bit complicated. However, it is still appropriate in certain cases.
The recommendation approach works only with information about categories of interests that we want to predict, which represents another challenge. In the simplest solution, as we collect fingerprinting features, we may also collect information about visited pages (it is another challenge to predict page’s context and that may be a topic for another presentation). Once we gather all the requested data, we may build a user-category matrix (or implement any other model except collaborative filtering) and try to find similarities between different users or make predictions on which categories might be interesting to the exact user.
This solution already sounds complicated, and there are also hidden difficulties. One of them relates to the time and memory consumption of the recommender system model. If we want to predict the interests of a user who came to our site, and we want to show some advertisements, we need to make predictions on the front-end side while our page is loading.
In the two previous approaches, we might use simple models since they're easy to implement with JavaScript code. In a recommender system model, we usually have to work with either big matrices that still need to be stored inside our script, or make recommendations with neural networks. Neural networks may not be the best solution in terms of the quality of recommender systems though. There are more advanced approaches for recommender systems like factorization machines, but usually, these need either more data preparation and matrices or neural network, or both.
To recap
We’ve discussed possible data privacy solutions. Each of them has its own pros and cons, and challenges.
As you can see, no one can assume that the web is a safe place, but there are protection mechanisms and safety measures that can be employed to protect our data without deteriorating user experience. Stay tuned for more content on this topic from our team.
Contributed by Konstantin Perikov, Chief Software Engineer II at EPAM in collaboration with Vladimir Sergeev, Senior Data Scientist at EPAM.