From: orourke@grendel.csc.smith.edu (Joseph O'Rourke)
Newsgroups: comp.graphics.algorithms
Subject: Re: Fast  2D neighbor point algorithm wanted
Date: 19 Feb 1996 19:15:25 GMT

In article <4fvkq5$j4g@news.csie.nctu.edu.tw>,
Shih-Hsu Chang <jang@csie.nctu.edu.tw> wrote:
>Is there any fast (less than O(n^2)) algorithm for finding the N closest
>points out of a set of M 2D points?  

Here is one recent paper on the topic:

@article{dds-saeid-92
, author =      "M. T. Dickerson and R. L. Drysdale and J. R. Sack"
, title =       "Simple algorithms for enumerating interpoint distances 
		and finding $k$ nearest neighbors"
, journal =     "Internat. J. Comput. Geom. Appl."
, volume =      2
, number =      3
, year =        1992
, pages =       "221--239"
, keywords =    "enumeration, selection, Delaunay triangulation, nearest neighbors"
}
==============================================================================

From: g.sande@worldnet.att.net (Gordon Sande)
Newsgroups: sci.math.num-analysis
Subject: Re: Multi-D Clustering: quick find closest vectors
Date: 15 Nov 1998 22:17:15 GMT

In article <72n8lf$cjq$1@apakabar.cc.columbia.edu>,
shenkin@still2.chem.columbia.edu (Peter Shenkin) wrote:

>Subject: Re: Multi-D Clustering: quick find closest vectors
>From: Peter Shenkin
>Organization: MacroModel Development Group, Chemistry, Columbia U., NY, NY
>Date: 15 Nov 1998 19:05:19 GMT
>Newsgroups: sci.math.num-analysis
>
>In article <72ljtf$cr0$1@fddinewz.oit.unc.edu>,
>Greg Newby  <gbnewby@ils.unc.edu> wrote:
>>I have a fairly large ( > 500,000 ) number of points, or vectors,
>>in a multidimensional space ( > 1000 dimensions).  It's
>>a real Euclidean space.
>>
>>Question: is there a way to identify the closest points to a
>>given point, withOUT needing to measure the distance to all
>>other points?
>>
>>
>>The specific distance measure (for closest) hopefully
>>won't matter much.  I typically use an Euclidean distance,
>>but could use the dot product, cosine, or other measures.
>
>I'm partly thinking out loud here.  You need to say more
>specifically what you mean by "identifying the closest".
>For instance, do you need to find exactly the N closest,
>where you specify N in advance, or do you have an a-priori 
>notion of "how close is close enough", and just need
>to find enough points within this neighborhood, or
>what?  Can you tolerate error?  For instance, is it
>OK if N=1000, but 100 of the ones you found are actually 
>farther than another 100 that you missed?  Is it OK if
>you ask for 1000, but you only get 900, but these 900
>are guaranteed to be the 900 closest?
>
>Second, you state that "any" distance measure should do,
>but there are some tricks that work for some measures
>but not for others.  Here's a trick that would work for
>Euclidean distances or Manhattan distances, but certainly
>not for all distance measures.
>
>Sort the x, y and z coords of the M=500000 points into 3 lists.
>Suppose you want to find the N points closest to the point
>with coords x0,y0,z0.  Find x0, y0, z0 on the three lists.
>By exploring each list in both directions, find the N
>points from each list with the smallest absolute delta-x,
>delta-y, delta-z values from x0, y0, z0, respectively.  Sorting
>is O(M log M) -- where M is 500000 in your case --, so you're
>ahead of an N^2 computation so far.
>
>Suppose this gives you n points;  we know n<=3N.  Then 
>compute the n(n-1)/2 pairwise distances among these n
>points and pick the N smallest.  We know that your N
>points must be among the n selected in the preprocessing
>step, because of the definition of Euclidean and Manhattan
>distance.
>
>If N=1000, this requires computing about (3000^2)/2,
>=~ 5*10^6 distances for each query.
>
>As a heuristic optimization, you might start by looking not
>at the nearest N points on each list, but, say, the nearest 
>0.5 N points, giving n<1.5N; if this gives less than N points,
>go out further.  This minimizes the number of distances you'll
>have to compute.
>
>Of course, if you use Euclidean distances, it suffices
>to compute d^2, which is faster.
>
>Depending on the nature of the database and how common
>this query form is, you could imagine storing the sorted
>list and updating it by insertion each time a new entry is added.
>Then up-to-date sorted lists would always be available for
>every new query.
>
>Aside from the above, which won't work for every distance
>measure, the following optimizations are of the nature storage vs.
>time, and should work for any measure.
>
>For each query, you could cache the N closest points, together with
>the date the list was stored.  If the same query comes in later,
>you can update it quickly by just examining database entries on
>arrived after the caching of this particular neighborhood.
>
>If you cached some finite number of these query results, it
>might save you considerable time if in fact some queries
>are considerably more common than others.  For example, if
>it is a literature database, as you imply, it is likely
>that certain "key" articles will be used as the basis for
>queries more often than others.  Naturally, you purge
>the cache on a "least-recently-used" basis.
>
>Finally, you could update cached neighbor lists in background.
>If there are M entries in the database, and the whole matrix is
>stored, the M+1st entry requires only the computation of M
>distances to fill out the new matrix.  If you do this only for
>the C cached queries, the new entry requires only C distances
>to be computed.
>
>	-P.


K-D trees do this problem. Setup is N log N and queries are
log N. They work in any metric with minor change in efficiency.
They were first described in 1976 in TOMS. There are other variants
on range searching, some of which use more storage. There has been
various levles of interest in this problem over time. The recent
Computing Reviews has a review of an article in IEEE PAMI with
some extra references.

Gordon Sande

>
>-- 
>*** "Freedom's just another word for nothing left to lose." (B. Yeltsin)***
>*Peter Shenkin; Chemistry, Columbia U.; shenkin@columbia.edu (212)854-5143*
>*MacroModel WWW page: http://www.columbia.edu/cu/chemistry/mmod/mmod.html *
==============================================================================

Newsgroups: sci.math.num-analysis
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Multi-D Clustering: quick find closest vectors
Date: Sun, 15 Nov 1998 22:11:22 GMT

In article <72ljtf$cr0$1@fddinewz.oit.unc.edu>, Greg Newby <gbnewby@ils.unc.edu> writes:
|> I have a fairly large ( > 500,000 ) number of points, or vectors,
|> in a multidimensional space ( > 1000 dimensions).  It's
|> a real Euclidean space.
|> 
|> Question: is there a way to identify the closest points to a
|> given point, withOUT needing to measure the distance to all
|> other points?

The classic method is the k-d tree:

   J.L. Bentley. Multidimensional Binary Search Trees Used for
   Associative Searching. Communication of the ACM, 18(9), September
   1975.

   J. H. Friedman, J. H. Bentley, and R. A. Finkel. An algorithm for
   finding best matches in logarithmic expected time. ACM
   Transactions on Mathematical Software, 3(3):209--226, Sept. 1977.

Some other references:

   Dasarathy, B.V., ed., (1991) _Nearest Neighbor (NN) Norms: NN
   Pattern Classification Techniques_, IEEE Computer Society Press:
   Los Alamitos, CA.

   M. Dickerson, R. L. Drysdale and J.-R. Sack, "Simple algorithms
   for enumerating interpoint distances and finding k nearest
   neighbors", Internat. J. Computational Geometry & Applications
   2:3 (1992) 221--239.

   P. B. Callahan and S. R. Kosaraju, "A decomposition of
   multi-dimensional point-sets with applications to
   k-nearest-neighbors and n-body potential fields", 24th ACM Symp.
   Theory of Computing, 1992, 546--556.

   M. T. Dickerson and D. Eppstein, "Algorithms for proximity
   problems in higher dimensions", Comp. Geom. Theory &
   Applications, to appear.

   P. M. Vaidya, "An $O(n \log n)$ algorithm for the
   all-nearest-neighbors problem", Discrete and Computational
   Geometry 4 (1989) 101--115.

You could also try http://wagga.cs.umass.edu/~dhirvone/research.html
after their server gets repaired.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
==============================================================================

From: Greg Heath <heath@ll.mit.edu>
Newsgroups: comp.ai.neural-nets,sci.math.num-analysis
Subject: Re: Multi-D Clustering: quick find closest vectors
Date: Sun, 15 Nov 1998 20:39:54 -0500

On 15 Nov 1998, Greg Newby wrote:

> Question: is there a way to identify the closest points to a
> given point, withOUT needing to measure the distance to all
> other points?

One obvious technique is to abort the 

||x-ci||^2 = Din^2 = SUM(j=1,n){(xj - cij)^2} 

accumulation after m terms whenever 

Dim^2 > min(k=1,i-1){||x-ck||^2}. 

> The goal is to prevent needing to do all 500,000 comparisons
> in order to find the closest points.  Note that I'm typically
> interested in the closest 1000 or so, not just the single closest.

Replace "min" by "rank1000".

> Pre-computing the [ (N * N-1) / 2 ] or so inter-point distances
> is viable, if it would help, but the problem is to find the
> distance for a NEW point (one that was previously unknown).

There are other methods using this or similar information. One is an 
algorithm by Fukunaga. Another has a name like k-d-tree (k points in 
d dimensions). Both can probably be found, among other acceleration 
algorithms, in Belur Dasarthy's nearest-neighbor classification book. 
I don't have the citation handy but think it might be an IEEE press 
publication. If your library supports computerized search you shouldn't 
have any problem finding it.

Greg

Hope this helps.

P.S.     Note Area-Code change from 617 to 781.
P.P.S.   Note Zip-Code change from 02173 to 02420.

Gregory E. Heath     heath@ll.mit.edu      The views expressed here are
M.I.T. Lincoln Lab   (781) 981-2815        not necessarily shared by 
Lexington, MA        (781) 981-0908(FAX)   M.I.T./LL or its sponsors
02420-9185, USA
==============================================================================

Date: Tue, 17 Nov 1998 09:42:59 +0100
From: Dirk Schwammkrug <schwammk@faw.uni-ulm.de>
Newsgroups: comp.ai.neural-nets,sci.math.num-analysis
Subject: Re: Multi-D Clustering: quick find closest vectors

Greg Heath wrote:

> There are other methods using this or similar information. One is an
> algorithm by Fukunaga. Another has a name like k-d-tree (k points in
> d dimensions).

The k-d-tree can be found inJ.H. Bentley, Multidimensional binary search
trees in database
applications, IEEE Trans. on Softw. Engin 5(4):333-340, 1979
and
J.H. Bentley, Multidimensional binary search trees used for
associative searching, Communications of the ACM
18(9):505-517, 1975

(quite old, but maybe the references can be useful)

regards,
Dirk

------------------------------------------------------------------------
Dirk Schwammkrug
Research Institute for Applied Knowledge Processing

    FFFFFF AAA WW      WW         faw-addr: Helmholtzstr. 16, 89081 Ulm
   FF     AAAA WW WWW WW          email:    schwammk_at_faw.uni-ulm.de
  FFFFFF AA AA WWWWWWWW    ULM    phone:    ++49 731 501 8990
 FF     AAAAAA WWW WWW            fax:      ++49 731 501 999
FF     AA   AA WW  WW  Germany    (www:     http://www.faw.uni-ulm.de)
==============================================================================

From: Greg Heath <heath@ll.mit.edu>
Newsgroups: comp.ai.neural-nets,sci.math.num-analysis
Subject: Re: Multi-D Clustering: quick find closest vectors
Date: Wed, 18 Nov 1998 00:38:32 -0500

On Sun, 15 Nov 1998, Greg Heath wrote:

> On 15 Nov 1998, Greg Newby wrote:
> 
> > Question: is there a way to identify the closest points to a
> > given point, withOUT needing to measure the distance to all
> > other points?

--SNIP

> > Pre-computing the [ (N * N-1) / 2 ] or so inter-point distances
> > is viable, if it would help, but the problem is to find the
> > distance for a NEW point (one that was previously unknown).
> 
> There are other methods using this or similar information. One is an 
> algorithm by Fukunaga. 

K. Fukunaga, Introduction to Statistical pattern Recognition, 2nd ed.,
Academic Press, 1990, p 360-362 (Branch and bound procedure). See also 
the last three references on page 366.

> Another has a name like k-d-tree (k points in 
> d dimensions). Both can probably be found, among other acceleration 
> algorithms, in Belur Dasarthy's nearest-neighbor classification book. 

B. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification 
Techniques, IEEE Computer Society Press, 1991, Ch.7.

Greg

Hope this helps.

Gregory E. Heath     heath@ll.mit.edu      The views expressed here are
M.I.T. Lincoln Lab   (781) 981-2815        not necessarily shared by 
Lexington, MA        (781) 981-0908(FAX)   M.I.T./LL or its sponsors
02420-9185, USA