This article is part of the series Advanced Image Processing for Defense and Security Applications.

Open Access Research Article

Investigating the Bag-of-Words Method for 3D Shape Retrieval

Xiaolan Li1* and Afzal Godil2

Author Affiliations

1 College of Computer Science & Information Engineering, Zhejiang Gongshang University, Hangzhou, Zhejiang 310018, China

2 Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA

For all author emails, please log on.

EURASIP Journal on Advances in Signal Processing 2010, 2010:108130  doi:10.1155/2010/108130

The electronic version of this article is the complete one and can be found online at:

Received:1 December 2009
Revisions received:27 February 2010
Accepted:3 March 2010
Published:15 April 2010

© 2010 The Author(s)

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper investigates the capabilities of the Bag-of-Words (BWs) method in the 3D shape retrieval field. The contributions of this paper are (1) the 3D shape retrieval task is categorized from different points of view: specific versus generic, partial-to-global retrieval (PGR) versus global-to-global retrieval (GGR), and articulated versus nonarticulated (2) the spatial information, represented as concentric spheres, is integrated into the framework to improve the discriminative ability (3) the analysis of the experimental results on Purdue Engineering Benchmark (PEB) reveals that some properties of the BW approach make it perform better on the PGR task than the GGR task (4) the BW approach is evaluated on nonarticulated database PEB and articulated database McGill Shape Benchmark (MSB) and compared to other methods.

1. Introduction

With recent advances in scanning and modeling technologies, large number of 3D models are created and stored in databases. For these databases be used effectively require methods for indexing, retrieval, and clustering. Therefore, retrieval and classification of 3D objects are becoming an increasingly important task in modern applications such as computer vision, computer aided design/computer aided manufacturing, multimedia, molecular biology, biometric, security, and robotics.

Because of its simplicity, flexibility, and effectiveness, the Bag-of-Words (BWs) method, which originated from the document retrieval field, has recently attracted large amount of interests in the computer vision fields. It has been applied in the applications such as image/video classification [1], 3D shape analysis, and retrieval [25]. We will explore its performance especially for the 3D shape retrieval task in this paper.

A typical 3D shape retrieval task can be defined as: giving a query 3D shape, to obtain a list of 3D shapes ordered by the similarity between the query object and the one on the list. Several methods are proposed to solve the problem, such as Light Field descriptors [6], spherical harmonics descriptor [7], D2 shape distribution [8], Reeb Graph-based descriptors [9], Local Feature-based methods [4, 5]. The performance of the methods varies mainly according to the specific tasks. In fact, from different points of view, the 3D shape retrieval task can be further refined as follows.

() Differentiated from the object category extent, the task can be discussed in "specific" and "generic" domain, which depends on the purpose and interest of the specialists. The representative benchmarks of the latter one include Princeton Shape Benchmark [10], NIST 3D Benchmark [11], while CAD [12], Protein [13], and Biometrics [14] analysis are several important "specific" domains, which have their own properties. For example, CAD models have more complicated structure with holes and other local features. Only using global information, these subtle details could be neglected and lead to less ideal retrieval results.

() Based on the completeness of the query shape, the task can be divided into two subtasks as "Partial-to-Global Retrieval (PGR)" and "Global-to-Global Retrieval (GGR)". For the former one, every query shape is regarded as an incomplete object, which is used to obtain similar complete objects from the database. This happens in many cases. For example, when using the 3D range scanners to capture 3D data in real time, because of the limitation of the view angle, the occlusion in the scene, and the real time requirement, only parts of the object can be captured during scanning. Then this incomplete point clouds may be used as the query shape to retrieve the corresponding complete model from an existing database. Solving this problem will also benefit several other applications, such as data registration [15] and model fixing [16]. Most of the global-based shape retrieval methods [6, 8], which require the complete geometry of a 3D object, cannot be applied directly to PGR. To our knowledge there are only a few literature [3, 17] contributions that solve the PGR problem.

() Based on the deformability of the shape, there exist "Articulated Shape Retrieval (ASR)" and "Nonarticulated Shape Retrieval (NASR)". Lots of the natural and man-made objects are deformable. For instance, in CAESAR [14], each person is scanned in three different postures: standing, sitting with arms open, and sitting with arms down. When performing shape retrieval using a sitting model of person A as the query model, the preferred result is to obtain the other two different gestured model of person A than to retrieve the sitting models of other persons. According to the results in [4], Light Field method [6], which performs greatly when dealing with NASR problem, produces poor results for ASR task.

To some extent, in [3, 4, 17], the above three different tasks are discussed within the BWs framework, but there still lacks thorough investigation. Several open problems remain unsolved, such as how to integrate spatial information into the BWs framework to improve the performance. In this paper, we investigate deeply into these three different cases within the framework of the BWs method with spin images [18] as the low-level features, and provide profound experimental results to support the discussion.

The organization of the paper is as follows. Several related works are summarized in Section 2. The performance measure is discussed in Section 3. In Section 4, we first introduce the ordinary procedure for BWs framework in 3D domain. Then take the CAD database [12] as an example, a Concentric Bag-of-Words (CBWs) approach is proposed to enhance the discriminative ability of the original BWs method. Several interesting phenomena are studied for PGR problem in Section 5. As for ASR task, McGill articulated shape benchmark [19] is adopted to test the effectiveness of our approach in Section 6. Finally, we conclude the paper in Section 7.

2. Related Work

Many efforts have been taken to perform 3D shape retrieval recently. Among them, the BWs method, which represents a 3D shape as an orderless collection of local features, has demonstrated impressive level of performance.

In [2, 3], BWs method is explored to accomplish PGR task, in which a visual feature dictionary is constituted by clustering spin images [18]. Then, Kullback-Leibler divergence is proposed as a similarity measurement in [3], while a probabilistic framework is introduced in [2].

For the ASR task, Ohbuchi et al. [4] apply the SIFT algorithm to depth buffer images of the model captured from uniformly sampled locations on a view sphere to collect visual words. After vector quantization, Kullbak-Leibler divergence measures the similarities of the models. It also demonstrates that a) given enough samples, the BWs method can reach a comparable retrieval result as a vision based method like Light Field [6], when dealing with NASR task; b) the BWs method performs better than Light Field [6] when dealing with ASR task. In this paper, spin images are used as local features, which can be directly extracted as many as you want in 3D domain. On the other hand, according to [1], dense features, such as spin images, perform better than sparse features, such as SIFT.

Although the BWs method has many advantages, it suffers from its lack of spatial information. Some methods focus on integrating the spatial layout information into the BWs method. Lazebnik et al. [20] proposes a spatially enriched Bags-of-Words approach. It works by partitioning the image into increasingly fine subregions and computing histograms of local features found inside each subregion. Implicitly geometric correspondences of the subregions are built in the pyramid matching scheme [21]. In [22], the object is an ensemble of canonical parts linked together by an explicit homographic relationship. Through an optimization procedure, the model, corresponding to the lowest residual error, gives the class label to the query object along with the localization and pose estimation. Yuan and Wu [23] describes a context aware clustering method, which captures the contextual information between data. For the BWs method, it means the visual dictionary is constructed based on both the primitive visual features and spatial contexts. Li et al. [5] propose to treat the model in two different domains, named the feature domain and the spatial domain. The visual word dictionary is built in the feature domain as in the ordinary BWs method. On the other side, the whole model is partitioned into several pieces in the spatial domain. Thereafter, each piece of the model is represented as a word histogram. The whole model is recorded as several word histograms along with a geometry matrix which stores the relative distances between every pairs of the pieces. The weighted sums of dissimilarity measurements from these two domains are used to measure the differences between models.

3. Performance Measure

The performance measure used in this study is the precision-recall curve. Precision-recall curve is the most common metric to evaluate 3D shape retrieval system. Precision is the ratio of retrieved objects that are relevant to all retrieved objects in the ranked list. Recall is the ratio of relevant objects retrieved in the ranked list to all relevant objects.

Let be the set of all relevant objects, and be the set of all retrieved objects, then


Basically, recall evaluates how well a retrieval algorithm finds what we want and precision evaluates how well it weeds out what we do not want. There is a tradeoff between recall and precision, one can increase recall by retrieving more, while can decrease precision.

4. Bag-of-Words and Concentric Bag-of-Words Methods

We first describe the original formulation of BWs representation [1, 3], and then introduce the whole procedure of Concentric Bags-of-Words (CBW) method. Their difference is demonstrated in Figure 1. Its effectiveness is shown by the experiment performed on "specific" shape database PEB [12] to reveal its effectiveness.

thumbnailFigure 1. Comparing BWs and CBW representations. (a) Representing shapes with one global Bag-of-Words model. The left and the right shapes are both composed with 5 different words: a, b, c, d, e. Both feature vectors are , which count the occurrences of each word. That means using BWs representation, the left and the right shapes are regarded as the same. (b) Representing shapes with Concentric Bags-of-Words model. Even there are the same two shapes as shown in (a), because of the concentric sphere partitioning, the left and the right shapes are different. Along the arrow's direction, counting from the outer sphere to the inner one, their feature vectors are [2 3 1 3 3; 1 1 5 1 1; 0 1 1 0 1] and [0 3 3 2 3; 3 1 2 2 2; 0 1 2 0 0], respectively.

4.1. Bag-of-Words Descriptor

Let us use the ordinary 3D shape retrieval as an example to give an explanation of the BWs framework. Denote N be the total number of labels ("visual words") in the learned visual dictionary. The 3D shape can be represented as a vector with length N, in which the elements count the occurrences of the corresponding label. The procedure can be completed in three steps.

(1)Local feature descriptors, such as spin image [18], are applied to the 3D model to acquire low-level features.

(2)Visual  words, denoted as the discrete set , are formed by clustering the features into N clusters, so that each local feature is assigned to a discrete label.

(3)The shape of the 3D model is summarized with a global histogram ("Bag-of-Words"), denoted as a vector , by counting the occurrences of each visual word.

4.2. Concentric Bag-of-Words Method

Rather than using only a global histogram, this paper advocates using more than one histogram along with its related spatial information to reveal the 3D shape in more detail. Specifically, the model is partitioned with several concentric spheres, and all the parts between two neighboring spheres are recorded with original BWs descriptor, which leads to the name Concentric Bag-of-Words. A schematic description of the approach is given in Figure 2.

thumbnailFigure 2. A schematic description of CBW method.

The first block in Figure 2 represents low-level feature extraction. Although several local features, such as depth buffer image, can be adopted to extract low-level features from 3D models, spin image is the one adopted here. Compared to using depth buffer image, adopting spin image has at least two advantages. First, it is quicker to compute. Second, it can capture the details from the concave area and the self-hidden area. As shown in Figure 3, it characterizes the local properties around its basis point p within the support range r. It is a two-dimensional histogram accumulating the number of points located at the coordinate (, ), where and are the lengths of the two orthogonal edges of the triangle formed by the oriented-basis point p, whose orientation is defined by the normal n, and support point q. The final size of the spin images is defined by the width w and the height h of the spin plane. We uniformly sample N b oriented-basis points and N s support points on the surface of the model, which satisfies insensitivity to the tessellation and resolution of the mesh.

thumbnailFigure 3. The demonstration of spin image.

After extracting a set of spin images for each model, we construct a shape dictionary as shown in the second block, whose size is predetermined as N, by clustering all spin images acquired from the whole training dataset with -means method.

Instead of representing one model with a histogram of the words from the dictionary, it is partitioned into M regions by grouping the oriented-basis points with Mconcentric spheres as demonstrated in the third block. Thereafter, the model is recorded as a set of histograms.

Because all the models are scaled into unified scale and the partitioning is also unified, the correspondence between the regions of two models is obvious. It can be constructed from outer sphere to inner sphere, as shown in Figure 1(b), or reverse. Thus, the CBW feature vector is recorded as


When performing 3D shape retrieval, the CBW representation of the query shape is constructed on line, and compared with those in the database. An ordered retrieval list is obtained according to the dissimilarity metric, which is


where , are two objects A and B, respectively; measures the dissimilarity between two feature vectors, which can be KL divergence [3], cosine distance, and distance. Thus, for every query object, the objects in the database are all assigned a metric value based on (3), which results a sorted retrieval list.

4.3. Experimental Results

According to the discussions in [3, 17], several parameters related to the CBW approach are defined as follows.

(1)The support range r: , where is the radius of the model.

(2)The width w and the height of the spin plane: .

(3)The number of oriented-basis points for one model : .

(4)The number of oriented-basis points for one model : .

(5)The size of the dictionary : .

(6)The number of the concentric spheres : .

The CBW approach can be applied both in "specific" and "generic" domains. Here we demonstrate it on Purdue Engineering Benchmark [12], which contains 866 3D CAD models and is classified into 42 classes such as, "Discs", "T-shaped parts", and "Bracket-like parts". In Figure 4 we compare the Precision-Recall curves obtained with CBW and BWs using as dissimilarity measurement to those methods defined in [12], such as Light Field Descriptor, 2.5D Spherical harmonics, 2D Shape Histogram, 3D Spherical Harmonics, Solid Angle Histogram, 3D Shape Distribution, Surface Area and Volume. Here, for CBW, . Obviously, the concentric sphere partition improves the PR rate, and makes the local Feature-based method comparable to the global Feature-based method, such as 2.5D Spherical Harmonics listed as the second best method in [12].

thumbnailFigure 4. Precision-recall (PR) plots of CBW, BWs, and other methods listed in [12].

5. Partial-to-Global Retrieval

In CAD domain, the PGR task is specifically important. Suppose that the query partial model is a screw, the target complete model we want to obtain is the same size screw with screw cap on. Here, we design an experiment on PEB to simulate the case described above: () Represent the models with BWs method and save the descriptors for the following usage as block 1 and 2 in Figure 2. () When performing the Partial-to-Global Retrieval, the sampled oriented-basis points are grouped into regions according to their geometric positions first. Then one of the groups is chosen as the partial query shape. The BWs representation is constructed on line and used to compare with the saved BWs descriptors of the complete models. The requirements for dissimilarity measure for the partial-to-global retrieval task are quite different from the global-to-global retrieval problem. As described in [3], the dissimilarity between the query data and the target model is not equal to that between the target model and the query data. It means that the dissimilarity metric should be asymmetry. An ordinary symmetric distance measurement, such as , , is not a suitable choice. KL divergence is chosen here to satisfy the asymmetric property. When using one sixth of the model to be the query shape, the two PR curves in Figure 5 demonstrate the improvement introduced by KL comparing to .

thumbnailFigure 5. Precision-recall (PR) plots obtained with KL and dissimilarity measurement when doing PGR.

Figure 6 provides two examples comparing the retrieval results of Global-to-Global Retrieval and Partial-to-Global Retrieval, in which one sixth of a gear is used as the query shape. It shows that the PGR is better than the GGR, since PGR lists more gears on the top of the list than GGR does. Why does PGR perform better? Recalling the definition of the feature vector will provide some clues to the answer. The feature vector describes the frequency of the visual words appearing in the shape. When using the entire gear model to be the query data, the plane-kind of visual word overwhelm the other features. However, using partial of the object to be the query data, the gear teeth shape dominates the whole shape. So more gears are picked out and listed on the top of the list.

thumbnailFigure 6. The example to show the difference between GGR and PGR. (a) First example to show the difference between Global-to-Global retrieval (GGR) and Partial-to-Global retrieval (PGR). The left group shows the GGR result using a complete model (the top-left image) as the query. The right group shows the PGR result using 1/6 part of the complete model (the small image shown in the bottom-left corner of the first image) as the query. The top 20 models are listed orderly according to the dissimilarity measurement. (b) The second example to show the difference between GGR and PGR. The layout of the images is the same as that of (a).

6. Articulated Shape Retrieval

The Articulated Shape Retrieval requires that the shape descriptor should be deformation invariant, which is not satisfied by several previous methods [68]. They perform well when dealing with rigid objects, but manifest poor performance when dealing with deformable ones [4]. BWs method can still be used effectively for ASR task. The descriptors for the models are constructed by following the procedure shown as block 1 and 2 in Figure 2.

We applied the BWs method for ASR task on McGill Shape Benchmark (MSB) [19]. The configuration of the parameters is almost the same as those listed in Section 4.3, except that (a) the width and the height h of the spin plane: , (b) the number of oriented-basis points for one model : . Since all of the models in MSB are regarded as complete, distance is chosen to measure the dissimilarity.

In Figure 7, the BW-based retrieval result is compared with several methods described in [25]. BWs method is comparable to the best method EVD. However, except BWs, all the other methods are based on geodesic distance computation, which is computational expensive. On the contrary, our method is constrained on local area and can be applied for on-line retrieval.

thumbnailFigure 7. Precision-recall (PR) plots for various descriptors when applied to the McGill database of articulated shapes [19]. Except BWs, all of the other results can be found in [25].

Figure 8 shows three visual results of articulated shape retrieval. Only the top 18 results are listed here, in which the green bold framed shape is the query shape, and the orange bold framed shapes are the false recall. Figure 8(a) shows the results of retrieving a spider shape from the database. Among these 18 retrieval shapes, only two shapes do not belong to the spider class but the ant class. Figures 8(b) and 8(c) are the results using a spectacles and a human shape as query model, respectively. Even though there are quite large amount of bending in the shapes, the performance is quite good.

thumbnailFigure 8. Some retrieval results from the McGill database [19]: green bold frame defines the query shape, and orange bold frame defines the false pick up. Three groups show three different shapes, in which from top to bottom they are spider, spectacles, and human.

7. Conclusion and Discussion

In this paper, we explore the BWs framework to solve several different tasks in 3D shape retrieval field, which are classified as specific versus generic, partial-to-global versus global-to-global retrieval, and articulated versus Nonarticulated. For each type, the effectiveness of BWs method is discussed in detail. First, CBW method is introduced to improve the discrimination ability of original BWs representation. Second, BWs is applied on PEB to perform partial-to-global retrieval task. And several results revealed, for some shape (gear-like shape), that PGR performs better than GGR. Finally, we compared the results of BWs to several other methods on McGill articulated shape database. Our results are comparable to the best results in [25]. More experiments need to be done to verify the influence of the parameters listed in Section 4.3.


The authors would like to thank the SIMA program and the IDUS program for supporting this work. This work has also been partially supported by NSF Grants of China (60873218) and NSF Grants of Zhejiang (Z1080232).


  1. L Fei-Fei, P Perona, A bayesian hierarchical model for learning natural scene categories. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005 2, 524–531

  2. Y Shan, HS Sawhney, B Matei, R Kumar, Shapeme histrogram projection and matching for partial object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4), 568–577 (2006). PubMed Abstract | Publisher Full Text OpenURL

  3. Y Liu, H Zha, H Qin, Shape topics: a compact representation and new algorithms for 3D partial shape retrieval. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), 2006 2, 2025–2032

  4. R Ohbuchi, K Osada, T Furuya, T Banno, Salient local visual features for shape-based 3D model retrieval. Proceedings of the IEEE International Conference on Shape Modeling and Applications (SMI '08), 2008, Stony Brook, NY, USA, 93–102

  5. X Li, A Godil, A Wagan, Spatially enhanced bags of words for 3D shape retrieval. Proceedings of the 4th International Symposium on Advances in Visual Computing (ISVC '08), 2008, Las Vegas, Nev, USA, Lecture Notes in Computer Science 5358, 349–358

  6. D-Y Chen, M Ouhyoung, X-P Tian, Y-T Shen, On visual similarity based 3D model retrieval. Computer Graphics Forum 22(3), 223–232 (2003). Publisher Full Text OpenURL

  7. M Kazhdan, T Funkhouser, S Rusinkiewicz, Rotation invariant spherical harmonic representation of 3D shape descriptors. Proceedings of the ACM International Conference Symposium on Geometry Processing, June 2003, Aachen, Germany 43, 156–164

  8. R Osada, T Funkhouser, B Chazelle, D Dobkin, Shape distributions. ACM Transactions on Graphics 21(4), 807–832 (2002). Publisher Full Text OpenURL

  9. S Biasotti, Reeb graph representation of surfaces with boundary. Proceedings of the Shape Modeling International (SMI '04), 2004, 371–374

  10. P Shilane, P Min, M Kazhdan, T Funkhouser, The princeton shape benchmark. Proceedings of the International Conference on Shape Modeling and Applications (SMI'04), June 2004, Genova, Italy, 167–178

  11. R Fang, A Godil, X Li, A Wagan, A new shape benchmark for 3D object retrieval. Proceedings of the 4th International Symposium on Visual Computing, 2008, Las Vegas, Nev, USA, Lecture Notes in Computer Science 5358, 381–392

  12. S Jayanti, Y Kalyanaraman, N Iyer, K Ramani, Developing an engineering shape benchmark for CAD models. Computer Aided Design 38(9), 939–953 Shape Similarity Detection and Search for CAD/CAE Applications Publisher Full Text OpenURL

  13. HM Berman, J Westbrook, Z Feng, et al. The protein data bank. Nucleic Acids Research 28(1), 235–242 (2000). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. CAESAR Anthopometric Database, (http://store), . webcite

  15. NJ Mitra, L Guibas, J Giesen, M Pauly, Probabilistic fingerprints for shapes. Proceedings of the 4th Eurographics Symposium on Geometry Processing, 2006, Sardinia, Italy 256, 121–130

  16. T Funkhouser, M Kazhdan, P Shilane, et al. Modeling by example. Proceedings of the 31st International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '04), 2004, 652–663

  17. X Li, A Godil, A Wagan, 3D part identification based on local shape descriptors. Proceedings of the Performance Metrics for Intelligent Systems Workshop (PerMIS '08), August 2008, Gaithersburg, Md, USA

  18. AE Johnson, M Hebert, Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(5), 433–449 (1999). Publisher Full Text OpenURL

  19. J Winn, A Criminisi, T Minka, Object categorization by learned universal visual dictionary. Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), 2005 2, 1800–1807

  20. S Lazebnik, C Schmid, J Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Proceedings of the Computer Vision and Pattern Recognition (CVPR '06), 2006 2, 2169–2178

  21. K Grauman, T Darrell, The pyramid match kernel: discriminative classification with sets of image features. Proceedings of the IEEE International Conference on Computer Vision (ICCV '05), October 2005 2, 1458–1465

  22. S Savarese, L Fei-Fei, 3D generic object categorization, localization and pose estimation. Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), October 2007, Rio de Janeiro, Brazil, 1–8

  23. J Yuan, Y Wu, Context aware clustering. Proceedings of the Computer Vision and Pattern Recognition (CVPR '08), June 2008, Anchorage, Alaska, USA, 1–8

  24. R Gal, D Cohen-Or, Salient geometric features for partial shape matching and similarity. ACM Transactions on Graphics 25(1), 130–150 (2006). Publisher Full Text OpenURL

  25. V Jain, H Zhang, A spectral approach to shape-based retrieval of articulated 3D models. Computer Aided Design 39(5), 398–407 (2007). Publisher Full Text OpenURL