Skip to main content

SOCKS: The Digital Data Cleaner by Dr. Prasanta Pal, Brown University






Figure 1: Far side of the moon as seen by Luna-3 mission in 1959 (A) and the same by NASA LRO mission in 2009 (C).  (B) is curated from (A) to reveal and compare pristine feature points from LRO gold-standard image (C). 


What is truth? hmm...๐Ÿค”. If you hear the sound ๐Ÿ•ช of a frog ๐Ÿธ in the living room, the sound may come from a National Geographic documentary on the TV ๐Ÿ“บ or from the garden ๐Ÿก๐Ÿก๐Ÿก or may be like one of those rare occasions, Mr. ๐Ÿธ managed to make it to the living room! However, it is clear from the ๐Ÿธ story that, merely noticing some evidence may not necessarily lead us to the original truth without some additional information or constraints. But with a given level of evidence, almost always, there is a best possible way to know the most pristine version of the truth! Truth with versions? However outrageous it may seem, new insights from data science makes it clear that, truth, at least the way it is often revealed is, as a matter of fact, versioned! 

Truth is derived from data. In the age of ubiquitous digital technology, data is the king (๐Ÿฆ)! While more data implies more information, the overabundance of it gets your head spinning! Given lots of data, clarity about the system that underlies the data is really what we are looking for. Information is as good as the clarity around it! In Fig. 1 we see on the left side the grainy image of the far side of the moon by the Russian Luna 3 mission in 1959. It was a great achievement in those days but the image itself is anything but clear. A later NASA LRO mission brought the clarity with better resolution camera. Now how can we reveal the cleaner part of that Luna 3 image to make it comparable with the that taken later by NASA LRO mission ?

This is a typical scenario where SOCKS can be enormously userful. Have a look at the comparison at this link.


In recent years, there has been a lot of buzz about digital data due to the fact that it is relevant in almost every possible aspect of life. The scope of digital data can range from your favorite facebook post,  brain-MRI scans, EKG to financial transactions or even cellular call records. However, one thread that binds them all is existence of a ghost-like (๐Ÿ‘ป) feature deeply embedded in almost all forms of measured data. The existence of this ๐Ÿ‘ป is often ignored unknowingly  (or knowingly ) can be of enormous importance to understand the core features of the modern scientific measurement apparatus. Of particular importance is the truest possible understanding (often hypothetically ) of the behavior of the underlying system under equilibrium (and often ergodic) or steady state conditions. This mysterious ๐Ÿ‘ป is the existence of "noise and outliers"  inherently infused with the measured data that more often than not contaminates pristine, steady signature of the underlying system.


EEG data train
Figure 1: A sudden jaw-clench (red-circle) outlier on a steady-state EEG-Data train

    In real-life we encounter outliers on a regular basis. A sudden call-drop during an otherwise pleasantly smooth phone conversation, the honk-of-a-car ๐Ÿš— from the car behind us during a wonderfully ecstatic countryside joy ride, a sudden hail-storm amidst a hot summer day, a white spot-like lesion on a routine MRI brain-scan, sudden adrenaline rush triggering the racing of our heart-beats at the climax scene of a thriller movie or a casual jaw-clench during an regular EEG-brain-scan registered as a spike in the data as shown in Fig. 1, are only some of the many examples of data-contamination by noise and outliers.

 Very often, it is just not enough to detect the outlier. The bigger question is, how to get rid of it most efficiently under reasonable but limited knowledge about the underlying system. Alternatively, we can simply ask, how life would have been if outliers were not there in the first place? The problem is even more interesting when the outlier ridden events are happening in a real-time data-acquisition system at varied time or length-scales such that we do not have much prior knowledge about the system.
Interestingly, the outliers, although by definition is an out-of-place citizen, may not always be trivial to recognize with our perceivable senses under various commonplace situation. For example, if we look at an image, we tend to pay attention (as our visual perception is stimuli driven) to only relatively brighter sections of an image (or the opposite of it) that defines the characteristic feature points of an image in our common perception. This limitation of our sensory perception leads us to implicitly discard image or data-features with more subtler intensity profiles but possibly having more  intriguing underlying features than the higher-intensity points. 

Figure 2: Panoramic view of the milky-way galaxy
   

As a concrete example, let's have a look at the panoramic view of the milky-way galaxy as in Fig. 2, where only the bright mid-region is clearly visible as measured by numerous telescopes and cameras. However merely looking at the image, gives us the false impression that the most important region is the middle horizontal patch. Due to our stimuli driven visual perception mechanisms, the bright spots skew the perception of our favorite galaxy and we possibly don't even recognize the darker quantiles of the image. Even if we invert the color of the image, we have and exactly opposite problem. So looking at an image as is, in my opinion, is not a "balanced view" of the world around us and particularly misleading in the context of scientific discoveries. A more "appropriate view" would be to perceive the panorama after multiple scales of outlier removal and as well as curation.

Figure 3: S&P 500 market data and the onset of Covid-19 pandemic

If the stock market is an indicator of economic activities, it as captures a lot of surprises on the top of underlying economic realities. While upward rallies are measures of positive outcomes, downward swirls are indicative of economic gloom. Over a "reasonable" scale of time, a fair measure of equilibrium of the market data can only be obtained after removing the time-scale sensitive outliers and calculate the equilibrium value afterwards. Often great theoretically sound financial models are applied on real-world data like that in Fig. 3 where, more often than not, the underlying assumptions around those theories are grounded on the equilibrium nature of the data. So application of the right theory under equilibrium on wrong data is completely preposterous!

For the above mentioned problems, there is no open-source tools yet available, although there exist ad-hoc solutions here and there to address discrete situations like removing or ignoring bad segments of data. SOCKS is designed to potentially address these issues through a simplistic two-step process 1) flagging of outliers based on predetermined threshold and 2) curation of outliers based on a predetermined protocol (e.g. interpolation, mean or median of neighborhood points, or even by creating a local region specific mesh-model etc.) on the flagged data points. The goal is to turn it into a full-fledged outlier detection and curation software and a vibrant community around it. SOCKS will be made available online and offline to solve a wide range of scientific and real-world data-science problems.

Some of the data driven disciplines where SOCKS technology is suitable for immediate integration are, EEG, EKG, MRI, ECHO and various  bio-imaging modalities, financial or other time-series, astrophotography, microscopic images, internet-network traffic monitoring, fraud detection systems etc.

Is there a "philosophy" behind SOCKS?


While SOCKS has been conceived and implemented as a data sanitizer tool, the deep philosophy behind its design is that, real world data can be made more useful by systematically curating it through sanitization pipelines. There should be minimal preliminary assumption about the data and a discovery process is needed first to identify (or flag) noisy  (or outliers) feature points followed by appropriate curation on those flagged points.  From a design standpoint, stream of data is no different from the pipelines of the water supply of a city. Should we drink water directly from the source or clean it for better potability? A radical shift in our traditional thought processes is necessary to make many existing data-science tools more reliable with better predictive power by curating the data before feeding data into those tools. It should be noted that, noise and outliers often contain no less important information than the pristine data behind it except that noise and outliers capture a different flavor of information. Also, in real-world there is no such thing as ad-hoc outlier or noise in the data. SOCKS democratizes the space by adding the notion of depth and degree to the problem as defined by the curation parameters. So we should talk in the language like soft-noise, deep noise etc. Another important concept is the concept of equilibrium in real-world data. Most of signal processing tools like Fast Fourier Transform (FFT) are presumed to be relevant only to data collected under equilibrium conditions. However, due to widespread availability of FFT tools we use it in situations where the data is far from equilibrium. Often we overlook the fact that, almost any non-trivial time series data is FFT-fiable as it is simply a mathematical IO transform. However, it is not a fair thing to do so in many (if not most) practical situations not only due to the Gibb's ringing phenomenon but in situations where consecutive data frames are completely out of sync with each other the notion of equilibrium and ergodicity are completely broken and thus FFT is problematic.

Is SOCKS  relevant to "Machine Learning Algorithms"?

While SOCKS itself has not been designed as an ML algorithm in traditional sense, however, any ML training problem would immensely benefit from its core philosophy of curation. Since ML outcomes are no better than the data that trains the models it makes logical sense to train on pristine data and do the same on prediction data sets. For the sake of argument,  if raw data sets make ML based predictions possible in certain situations, curated data would enhance the validity of the core ML algorithms many fold. A lot of computational cost can be saved due to rapid convergence (of SGD) with pristine data sources.

Mission Statement

SOCKS will clean your messy data while revealing the pristine nature of the underlying truth and separately provide you with the noisy elements.
๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน๐Ÿงน
๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜

Use cases (The following wikipedia edits were enabled by SOCKS):


References
  1. https://www.nytimes.com/2020/07/15/technology/just-collect-less-data-period.html
  2. https://www.sciencedirect.com/science/article/pii/S1877050919318575
  3. https://towardsdatascience.com/noise-its-not-always-annoying-1bd5f0f240f
  4. https://doi.org/10.36227/techrxiv.15043695

© 2021 Dr. Prasanta Pal

Comments

  1. Sir ,
    your blog called "SOCKS-verse" helps me a lot to understand "What is SOCKS and what are the benefits of SOCKS?
    I really , agree with you , we have so many data but the main fact is that how clarity those data have. Converting noisy data to pristine data such a amazing idea for further research and innovation.
    Thanks for sharing in details about SOCKS.

    ReplyDelete

Post a Comment