Advertisement

New algorithm easily shrinks enormous data sets, while preserving their mathematical properties

Solution makes big data analysis computationally practical

If you work in the tech sector, chances are, you’ve been exposed to spreadsheets at one point or another. 

Spreadsheet
The spreadsheet is a super convenient way to organize data into tables for quick reference and easy updates. The problem of late, though, is the increased demand for data, which has led to enormous tables consisting of – in some cases – hundreds of millions of rows. 

Big data
In order to make big data analysis computationally practical, the size of these data tables, or matrices as they’re also referred to as, needs to be reduced; this is often done by leaving out a bunch of rows. The trick is that the remaining rows have to be in some way, shape, or form, representative of the ones that were omitted. This is to ensure the computations performed on them yield approximately the right results. 

Figuring out which rows to eliminate and which to keep is an incredibly complicated process, one that makes the algorithm proposed by MIT researchers as a more simplified solution, even more impressive. Their process finds the smallest possible approximation of the original matrix to guarantee reliable computations. 

Big data shrink
The way it works is the algorithm figures out how well a given row of the condensed matrix represents the original matrix, and it does this by measuring the “distance” between them. 

Now, there are different ways to determine “distance”. One way is called “Euclidean distance”, in which the differences between the entries at corresponding positions in the two rows are squared and added together, and the distance between rows is the square root of the resulting sum. The foundation of this approach is the Pythagorean Theorem: the square root of the sum of the squares of the lengths of a right triangle’s legs gives the length of the hypotheses.

Another measure of distance is the lesser known but more useful in solving machine-learning approach called “Manhattan Distance”. This method is the sum of the absolute differences between the corresponding entries in the rows. 

Both distance solutions are instances that statisticians call “norms”; specifically, The Manhattan distance, or 1-norm, is the first root of the sum of differences raised to the first power, and the Euclidean distance, or 2-norm, is the square root of the sum of differences raised to the second power. For those curious, the 3-norm is the cube root of the sum of differences raised to the third power, and so on, and so on. 

The aforementioned MIT researchers, which include Richard Peng, a postdoc in applied mathematics, and Michael Cohen, a graduate student in electrical engineering and computer science, demonstrated their algorithm is optimal for condensing matrices under any norm, though, according to Peng, the only one they really cared about was the 1-norm. 

When performing a matrix condensation, regardless of norm, the first thing to do is assign each row of the original matrix a “weight”—it is meant to represent the number of other rows that it’s similar to, and determines the likelihood that the row will be included in the condensed matrix. If it is to be included, its values will be multiplied according to its weight. Case in point: if 10 rows are good stand-ins for one another, but not for any other rows within the matrix, each will have a 10% chance of getting into the condensed matrix. If one of them does, its entries will all be multiplied by 10, so that it best reflects the contribution of the other nine rows it’s standing in for. 

Now, while Manhattan is by and large simpler than Euclidean, its approach makes calculating rows’ weights more difficult. In the past, the best algorithm for condensing tables under the 1-norm would yield a matrix whose number of rows was proportional to the number of columns of the original matrix raised to the power of 2.5. When it comes to 2-norm, though, the best algorithm for condensing matrices yields a matrix whose number of rows was proportional to the number of columns of the original matrix multiplied by its own logarithm. 

To give a practical example: a matrix consisting of 100 columns under 1-norm yields a matrix consisting of hundreds of thousands of rows. Under the 2-row, it’s a matrix of a couple of hundred of rows. As the number of columns increases, so too does this discrepancy. 

Peng and Cohen’s algorithm condenses matrices under the 1-norm as well as it does under the 2-norm; and under the 2-norm, it condenses as well as its predecessors. That’s because for the 2-norm, it uses the best existing algorithm; for the 1-norm, it uses the same algorithm, just five or six times. 

The real contribution of the researchers’ paper is mathematically proving that the 2-norm algorithm will yield reliable results under the 1-norm. Per Peng, an equation for calculating 1-norm weights has long been known, but “the funny thing with that definition is that it's recursive,” he explains. “So the correct set of weights appears on both the left-hand side and the right-hand side.” What he means here is,the weight for a given matrix row — call it “w” — is set equal to a mathematical expression that itself includes w.

“This definition was known to exist, but people in stats didn't know what to do with it,” Peng says. “They look at it and think, 'How do I ever compute anything with this?'”

So, what Peng and Cohen have proven is that you should start by setting the “w”  on the right side of the equation equal to 1, then evaluate the expression and plug the answer back into the right-hand “w”, and if you do it again, and again, you’ll very quickly converge on a good approximation of the correct value of “w”.

“It's highly elegant mathematics, and it gives a significant advance over previous results,” says Richard Karp, a professor of computer science at the University of California at Berkeley and a winner of the National Medal of Science and of the Turing Award, the highest honor in computer science. “It boils the original problem down to a very simple-to-understand one. I admire the mathematical development that went into it.”

Peng and Cohen will present their algorithm at the ACM Symposium on Theory of Computing in June. 

Download their study, “Row Sampling by Lewis Weights” via the Cornell University Library

Via MIT

Advertisement



Learn more about Electronic Products Magazine

Leave a Reply