From: orourke@grendel.csc.smith.edu (Joseph O'Rourke) Newsgroups: comp.graphics.algorithms Subject: Re: Fast 2D neighbor point algorithm wanted Date: 19 Feb 1996 19:15:25 GMT In article <4fvkq5$j4g@news.csie.nctu.edu.tw>, Shih-Hsu Chang wrote: >Is there any fast (less than O(n^2)) algorithm for finding the N closest >points out of a set of M 2D points? Here is one recent paper on the topic: @article{dds-saeid-92 , author = "M. T. Dickerson and R. L. Drysdale and J. R. Sack" , title = "Simple algorithms for enumerating interpoint distances and finding $k$ nearest neighbors" , journal = "Internat. J. Comput. Geom. Appl." , volume = 2 , number = 3 , year = 1992 , pages = "221--239" , keywords = "enumeration, selection, Delaunay triangulation, nearest neighbors" } ============================================================================== From: g.sande@worldnet.att.net (Gordon Sande) Newsgroups: sci.math.num-analysis Subject: Re: Multi-D Clustering: quick find closest vectors Date: 15 Nov 1998 22:17:15 GMT In article <72n8lf$cjq$1@apakabar.cc.columbia.edu>, shenkin@still2.chem.columbia.edu (Peter Shenkin) wrote: >Subject: Re: Multi-D Clustering: quick find closest vectors >From: Peter Shenkin >Organization: MacroModel Development Group, Chemistry, Columbia U., NY, NY >Date: 15 Nov 1998 19:05:19 GMT >Newsgroups: sci.math.num-analysis > >In article <72ljtf$cr0$1@fddinewz.oit.unc.edu>, >Greg Newby wrote: >>I have a fairly large ( > 500,000 ) number of points, or vectors, >>in a multidimensional space ( > 1000 dimensions). It's >>a real Euclidean space. >> >>Question: is there a way to identify the closest points to a >>given point, withOUT needing to measure the distance to all >>other points? >> >> >>The specific distance measure (for closest) hopefully >>won't matter much. I typically use an Euclidean distance, >>but could use the dot product, cosine, or other measures. > >I'm partly thinking out loud here. You need to say more >specifically what you mean by "identifying the closest". >For instance, do you need to find exactly the N closest, >where you specify N in advance, or do you have an a-priori >notion of "how close is close enough", and just need >to find enough points within this neighborhood, or >what? Can you tolerate error? For instance, is it >OK if N=1000, but 100 of the ones you found are actually >farther than another 100 that you missed? Is it OK if >you ask for 1000, but you only get 900, but these 900 >are guaranteed to be the 900 closest? > >Second, you state that "any" distance measure should do, >but there are some tricks that work for some measures >but not for others. Here's a trick that would work for >Euclidean distances or Manhattan distances, but certainly >not for all distance measures. > >Sort the x, y and z coords of the M=500000 points into 3 lists. >Suppose you want to find the N points closest to the point >with coords x0,y0,z0. Find x0, y0, z0 on the three lists. >By exploring each list in both directions, find the N >points from each list with the smallest absolute delta-x, >delta-y, delta-z values from x0, y0, z0, respectively. Sorting >is O(M log M) -- where M is 500000 in your case --, so you're >ahead of an N^2 computation so far. > >Suppose this gives you n points; we know n<=3N. Then >compute the n(n-1)/2 pairwise distances among these n >points and pick the N smallest. We know that your N >points must be among the n selected in the preprocessing >step, because of the definition of Euclidean and Manhattan >distance. > >If N=1000, this requires computing about (3000^2)/2, >=~ 5*10^6 distances for each query. > >As a heuristic optimization, you might start by looking not >at the nearest N points on each list, but, say, the nearest >0.5 N points, giving n<1.5N; if this gives less than N points, >go out further. This minimizes the number of distances you'll >have to compute. > >Of course, if you use Euclidean distances, it suffices >to compute d^2, which is faster. > >Depending on the nature of the database and how common >this query form is, you could imagine storing the sorted >list and updating it by insertion each time a new entry is added. >Then up-to-date sorted lists would always be available for >every new query. > >Aside from the above, which won't work for every distance >measure, the following optimizations are of the nature storage vs. >time, and should work for any measure. > >For each query, you could cache the N closest points, together with >the date the list was stored. If the same query comes in later, >you can update it quickly by just examining database entries on >arrived after the caching of this particular neighborhood. > >If you cached some finite number of these query results, it >might save you considerable time if in fact some queries >are considerably more common than others. For example, if >it is a literature database, as you imply, it is likely >that certain "key" articles will be used as the basis for >queries more often than others. Naturally, you purge >the cache on a "least-recently-used" basis. > >Finally, you could update cached neighbor lists in background. >If there are M entries in the database, and the whole matrix is >stored, the M+1st entry requires only the computation of M >distances to fill out the new matrix. If you do this only for >the C cached queries, the new entry requires only C distances >to be computed. > > -P. K-D trees do this problem. Setup is N log N and queries are log N. They work in any metric with minor change in efficiency. They were first described in 1976 in TOMS. There are other variants on range searching, some of which use more storage. There has been various levles of interest in this problem over time. The recent Computing Reviews has a review of an article in IEEE PAMI with some extra references. Gordon Sande > >-- >*** "Freedom's just another word for nothing left to lose." (B. Yeltsin)*** >*Peter Shenkin; Chemistry, Columbia U.; shenkin@columbia.edu (212)854-5143* >*MacroModel WWW page: http://www.columbia.edu/cu/chemistry/mmod/mmod.html * ============================================================================== Newsgroups: sci.math.num-analysis From: saswss@hotellng.unx.sas.com (Warren Sarle) Subject: Re: Multi-D Clustering: quick find closest vectors Date: Sun, 15 Nov 1998 22:11:22 GMT In article <72ljtf$cr0$1@fddinewz.oit.unc.edu>, Greg Newby writes: |> I have a fairly large ( > 500,000 ) number of points, or vectors, |> in a multidimensional space ( > 1000 dimensions). It's |> a real Euclidean space. |> |> Question: is there a way to identify the closest points to a |> given point, withOUT needing to measure the distance to all |> other points? The classic method is the k-d tree: J.L. Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Communication of the ACM, 18(9), September 1975. J. H. Friedman, J. H. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209--226, Sept. 1977. Some other references: Dasarathy, B.V., ed., (1991) _Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques_, IEEE Computer Society Press: Los Alamitos, CA. M. Dickerson, R. L. Drysdale and J.-R. Sack, "Simple algorithms for enumerating interpoint distances and finding k nearest neighbors", Internat. J. Computational Geometry & Applications 2:3 (1992) 221--239. P. B. Callahan and S. R. Kosaraju, "A decomposition of multi-dimensional point-sets with applications to k-nearest-neighbors and n-body potential fields", 24th ACM Symp. Theory of Computing, 1992, 546--556. M. T. Dickerson and D. Eppstein, "Algorithms for proximity problems in higher dimensions", Comp. Geom. Theory & Applications, to appear. P. M. Vaidya, "An $O(n \log n)$ algorithm for the all-nearest-neighbors problem", Discrete and Computational Geometry 4 (1989) 101--115. You could also try http://wagga.cs.umass.edu/~dhirvone/research.html after their server gets repaired. -- Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 677-8000 Cary, NC 27513, USA those of SAS Institute. ============================================================================== From: Greg Heath Newsgroups: comp.ai.neural-nets,sci.math.num-analysis Subject: Re: Multi-D Clustering: quick find closest vectors Date: Sun, 15 Nov 1998 20:39:54 -0500 On 15 Nov 1998, Greg Newby wrote: > Question: is there a way to identify the closest points to a > given point, withOUT needing to measure the distance to all > other points? One obvious technique is to abort the ||x-ci||^2 = Din^2 = SUM(j=1,n){(xj - cij)^2} accumulation after m terms whenever Dim^2 > min(k=1,i-1){||x-ck||^2}. > The goal is to prevent needing to do all 500,000 comparisons > in order to find the closest points. Note that I'm typically > interested in the closest 1000 or so, not just the single closest. Replace "min" by "rank1000". > Pre-computing the [ (N * N-1) / 2 ] or so inter-point distances > is viable, if it would help, but the problem is to find the > distance for a NEW point (one that was previously unknown). There are other methods using this or similar information. One is an algorithm by Fukunaga. Another has a name like k-d-tree (k points in d dimensions). Both can probably be found, among other acceleration algorithms, in Belur Dasarthy's nearest-neighbor classification book. I don't have the citation handy but think it might be an IEEE press publication. If your library supports computerized search you shouldn't have any problem finding it. Greg Hope this helps. P.S. Note Area-Code change from 617 to 781. P.P.S. Note Zip-Code change from 02173 to 02420. Gregory E. Heath heath@ll.mit.edu The views expressed here are M.I.T. Lincoln Lab (781) 981-2815 not necessarily shared by Lexington, MA (781) 981-0908(FAX) M.I.T./LL or its sponsors 02420-9185, USA ============================================================================== Date: Tue, 17 Nov 1998 09:42:59 +0100 From: Dirk Schwammkrug Newsgroups: comp.ai.neural-nets,sci.math.num-analysis Subject: Re: Multi-D Clustering: quick find closest vectors Greg Heath wrote: > There are other methods using this or similar information. One is an > algorithm by Fukunaga. Another has a name like k-d-tree (k points in > d dimensions). The k-d-tree can be found inJ.H. Bentley, Multidimensional binary search trees in database applications, IEEE Trans. on Softw. Engin 5(4):333-340, 1979 and J.H. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM 18(9):505-517, 1975 (quite old, but maybe the references can be useful) regards, Dirk ------------------------------------------------------------------------ Dirk Schwammkrug Research Institute for Applied Knowledge Processing FFFFFF AAA WW WW faw-addr: Helmholtzstr. 16, 89081 Ulm FF AAAA WW WWW WW email: schwammk_at_faw.uni-ulm.de FFFFFF AA AA WWWWWWWW ULM phone: ++49 731 501 8990 FF AAAAAA WWW WWW fax: ++49 731 501 999 FF AA AA WW WW Germany (www: http://www.faw.uni-ulm.de) ============================================================================== From: Greg Heath Newsgroups: comp.ai.neural-nets,sci.math.num-analysis Subject: Re: Multi-D Clustering: quick find closest vectors Date: Wed, 18 Nov 1998 00:38:32 -0500 On Sun, 15 Nov 1998, Greg Heath wrote: > On 15 Nov 1998, Greg Newby wrote: > > > Question: is there a way to identify the closest points to a > > given point, withOUT needing to measure the distance to all > > other points? --SNIP > > Pre-computing the [ (N * N-1) / 2 ] or so inter-point distances > > is viable, if it would help, but the problem is to find the > > distance for a NEW point (one that was previously unknown). > > There are other methods using this or similar information. One is an > algorithm by Fukunaga. K. Fukunaga, Introduction to Statistical pattern Recognition, 2nd ed., Academic Press, 1990, p 360-362 (Branch and bound procedure). See also the last three references on page 366. > Another has a name like k-d-tree (k points in > d dimensions). Both can probably be found, among other acceleration > algorithms, in Belur Dasarthy's nearest-neighbor classification book. B. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, 1991, Ch.7. Greg Hope this helps. Gregory E. Heath heath@ll.mit.edu The views expressed here are M.I.T. Lincoln Lab (781) 981-2815 not necessarily shared by Lexington, MA (781) 981-0908(FAX) M.I.T./LL or its sponsors 02420-9185, USA