So far we have described methods for reasoning about a robot's spatial reference frames, but have not yet considered the geometric content of those frames, such as the shape and size of the physical structure of each link. Certainly, it would be wise to consider these geometries when developing motions and choosing postures, since a robot should avoid unintentional collision with the environment, and should make contact when desired. Geometry is also important in robot design, since the shape of a robot dictates how well it can squeeze into tight locations, and how much of the surrounding workspace it can reach without self-colliding.
Mathematical and computational representations of geometry have long been studied, and a rich set of techniques are in active use today. They come from a variety of fields, including computer-aided design (CAD), computational geometry, computer graphics, and scientific visualization — and hence this survey only scratches the surface of the vast variety of geometric representations and calculations.
Geometry is used pervasively throughout in robotics, including:
Mechanism design
Physics simulation
Collision avoidance
Proximity detection
Motion planning
Calibration
3D mapping
Object recognition
Visualization and performance evaluation
These applications will require some type of geometric operations to be performed on one or more objects, such as:
Visualization
Collision detection
Distance or closest-point computation
Ray casting
Contact surface computation
Penetration depth computation
Simplification
Shape fitting
Storage and transmission
Due largely to decades of work in high-performance, realistic 3D visualization, many of these operations are extremely fast due to advances in computing hardware (graphics processing units, or GPUs) as well as advanced geometric algorithms and data structures. Nevertheless, it is important for robotics practitioners to understand the performance tradeoffs in choosing representations appropriate for the desired operations.
As an example, consider the problem of computing a collision-free inverse kinematics solution. Assuming we are given 1) an analytical IK solver, 2) geometric models of each robot link, and 3) a geometric model of the environment, this problem can be addressed as follows:
Compute all IK solutions.
For each IK solution, run forward kinematics to determine the robot's link transforms.
Determine whether the robot's link geometries, placed at these transforms, would collide with themselves or the environment.
If any configuration passes the collision detection test, it is a collision-free solution.
Otherwise, no solution exists.
The key step is the collision detection in Step 3. If the models are geometrically complex, collision detection may be a computational bottleneck. But, if they are too simple to faithfully represent the robot's geometry, then the computed solution may either miss a collision (the geometric model is optimistically small) or fail to find valid solutions (the geometric model is conservatively large).
A geometry $G$ is considered to be some region of space $G \subset \mathbb{R}^2$ or $\mathbb{R}^3$ depending on whether we are considering planar or 3D robots. Geometry representations can be categorized by which part of $G$ is actually represented:
Geometric primitives like points, lines, spheres, and triangles represent a shape $G$ in terms of a fixed number of parameters.
Surface representations approximate only the boundary of $G$, denoted $\partial G$.
Volumetric representations approximate the entirety of $G$: interior, boundary, and exterior.
Point-based representations store a sampling of individual points from $\partial G$.
There are a number of tradeoffs involved in choosing representations. In general, surface representations are most widely used for visualization and CAD, while volumetric representations are most useful for representing 3D maps and determining whether points are inside or outside. Point-based representations are widely used to represent raw data from 3D sensors like lidar and depth cameras. Geometric primitives are preferred when speed of geometric operations is prioritized over fidelity.
Primitives are the simplest form of geometric representation, and are compact, mathematically convenient, and lend themselves to fast geometric operations: exact calculations can be calculated in constant ($O(1)$) time. (For more information about Big-O notation, consult Appendix C.1.) However, they are the least flexible representation and do not represent most real geometries with high fidelity.
Common primitives include points, line segments, triangles, spheres, cylinders, and boxes. For each primitive type, a primitive's shape is represented by a fixed number of parameters.
A point is specified by its coordinates $(x,y)$ or $(x,y,z)$.
A line segment is specified by the coordinates $\mathbf{a}$, $\mathbf{b}$ of its end points.
A circle (in 2D) or sphere (in 3D) is specified by its center $\mathbf{c}$ and radius $r > 0$.
A triangle is specified by its vertices $\mathbf{a}$, $\mathbf{b}$, $\mathbf{c}$.
A cylinder is specified by its base $\mathbf{b}$, primary axis $\mathbf{a}$, height $h > 0$, and radius $r > 0$.
A box is represented by a coordinate frame $T$ with origin at its corner and axes aligned to the box's edges, and its dimensions on each axis, width/height $(w,h)$ in 2D and width/height/depth $(w,h,d)$ in 3D.
An axis-aligned box with lower coordinate $\mathbf{l}$ and upper coordinate $\mathbf{u}$ contains all points $\mathbf{x} = (x_1,x_2,x_3)$ such that $x_i \in [l_i,u_i]$ for $i=1,2,3$.
These representations are not unique, and are chosen by convention. For example, the origin of a box may be chosen to be its center point rather than a corner.
Surface representations (also known as boundary representations, or b-reps) store the boundary of a geometry $\partial G$ without explicitly representing its interior / exterior.
By far, the most common 3D surface representation is a triangle mesh: a collection of triangles $(\mathbf{a}_1,\mathbf{b}_1,\mathbf{c}_1),\ldots,(\mathbf{a}_N,\mathbf{b}_N,\mathbf{c}_N)$ that mesh together to approximate the geometry's surface. Triangle meshes have two major advantages over other representations: 1) with a sufficient number of triangles, they can approximate arbitrary surfaces, and 2) graphics hardware is highly optimized for visualization of large triangle meshes (as of writing, top-end commodity graphics cards can render over a billion triangles per second). The 2D equivalent is a polygonal chain.
It is important to note that in order to represent a solid surface, the edges of neighboring triangles in the mesh should be coincident, so that most vertices will be represented in several triangles. To save on storage space / representational complexity, triangle meshes are often represented as a vertex list $\mathbf{v}_1,\ldots,\mathbf{v}_M$ and a triangle index list $(i_{a1},i_{b1},i_{c1}),\ldots,(i_{aN},i_{bN},i_{cN})$, where the index triple $(i_{ak},i_{bk},i_{ck})$ indicates that the $k$'th triangle is composed of vertices $\mathbf{v}_{i_{ak}},\mathbf{v}_{i_{bk}},\mathbf{v}_{i_{ck}}$. For example, the cube in Fig. 2 represents a solid, but only the vertices and triangles of the surface are represented.
# Code for the triangle mesh visualization↔
The visualization below loads a triangle mesh from disk, and visualizes it using WebGL. Try spinning it by dragging on it with the left mouse button, and translating it by dragging with the right mouse button.
#Load a chair mesh from disk and visualize it↔
Other common surface representations include:
Convex polygons (in 2D) / Convex polytopes (in 3D): represented by a list of connected faces, edges, and vertices. Also has a volumetric representation in terms of halfplanes $(\mathbf{a}_i,b_i)$, $i=1,\ldots,N$ such that the geometry is determined by the set $G = \{ \mathbf{x} \quad |\quad \mathbf{a}_i \cdot \mathbf{x} \leq b_i, i=1,\ldots,N \}$.
Parametric curves (in 2D) / Parametric surfaces (in 3D): In the 2D case, a function $f(u) : D \rightarrow \mathbb{R}^2$ sweeps out a curve as the parameter varies over a domain $D \subseteq \mathbb{R}$. In 3D, a function $f(u,v): D \rightarrow \mathbb{R}^3$ sweeps out the surface as the parameters vary over a domain $D \subseteq \mathbb{R}^2$.
Subdivision surfaces: Defines the surface as a limit of smoothing / refinement operations on a polygonal mesh. Used primarily in computer graphics.
An important caveat of surface representations is that they may fail to properly enclose a volume. For example, a triangle floating in space is a valid triangle mesh, and two intersecting triangles is also a valid triangle mesh, but neither represents a surface of any enclosed volume. A triangle mesh that does truly bound a volume is known as watertight. Non-watertight meshes are also known as polygon soup.
Volumetric representations have the main advantage of being able to perform inside / outside tests quickly, and to be able to quickly modify 2D / 3D maps with new information. All of these items store some function $f$ over a ($d=2$ or 3)-D space that indicates whether a point is inside or outside the geometry.
The function values can be interpreted in several ways, including:
Occupancy map: a binary function $f : \mathbb{R}^d \rightarrow \{0,1\}$ where 0 indicates empty space, and 1 indicates occupied space.
Probabilistic occupancy map: a function $f : \mathbb{R}^d \rightarrow [0,1]$ where 0 indicates certain empty space, 1 indicates certain occupied space, and intermediate values reflect the level of certainty that the space is occupied.
Signed distance field (SDF): an implicit surface where $f$ denotes the Euclidean distance to the surface if outside of the geometry, or the negative penetration depth if inside the geometry.
Truncated signed distance field (TSDF): A common variant in 3D mapping, which accepts a truncation distance $\tau$. The values of the SDF with absolute value $> \tau$ are truncated.
Attenuation / reflectance map: a function $f : \mathbb{R}^d \rightarrow [0,1]$ indicates how much a point in space interferes with light, sound, or other signal.
Height field / elevation map: Sort of a cross between a surface and volume representation, height fields store a 2D function $f(x,y)$ that indicates the height of the surface at the point $(x,y)$. This defines a 3D volume over $(x,y,z)$ such that $z > f(x,y)$ indicates empty space, $z=f(x,y)$ indicates the surface, and $z < f(x,y)$ indicates the interior. Frequently used by mobile robots to represent terrain.
Implicit surface: a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$, defined such that $f(\mathbf{x}) < 0$ denotes the interior of the object, $f(\mathbf{x}) = 0$ is the surface, and $f(\mathbf{x}) > 0$ is the exterior.
Neural Rendering Field (NeRF): a function $f : \mathbb{R}^d \rightarrow \mathbb{R}^k$ that models the visual properties of each point (opacity, color) in the volume using a neural network. The appearance of a NeRF for each pixel from a given viewpoint is calculated by casting a ray through the volume.
The computational representation of the function itself can also vary, including but not limited to:
Combination of basis functions: $f(x)$ is a linear combination of basis functions $f(x) = \sum_i^m c_i b_i(x)$, where $b_i$ are the (fixed) basis functions. Evaluation is $O(m)$ complexity.
Dense pixel / voxel grid: $f$ is stored a 2D or 3D grid, and evaluating $f(x)$ performs a table lookup. For implicit surfaces, $f(x)$ is typically evaluated using bilinear / trilinear interpolation of the values at the grid vertices. Evaluation is $O(1)$.
Hierarchical grid (quadtree / octree): a hierarchical 2D grid (quadtree) or 3D grid (octree)that stores the function at progressively finer levels, up to a given resolution. Evaluation is $O(\log n)$, with $n$ the number of divisions on each axis.
Sparse grid / hash grid: Stores only important cells or blocks of volumetric data, e.g., the occupied cells of an occupancy map, or non-truncated cells of a TSDF. Evaluation is $O(1)$.
Some 2D examples are illustrated in Fig. 3.
Various combinations of value types and representations are useful for certain tasks. Popular representations include:
All grid-based representations are defined over a volume, usually an axis-aligned bounding box with corners $\mathbf{l}$, $\mathbf{u}$. The grid will have size $m \times n \times p$, and converting between world space and grid space requires a shift and a scale. In particular, the cell indices $(i,j,k)$ of a point $(x_1,x_2,x_3)$ are determined by a transform \begin{equation} \begin{split} i &= \lfloor m (x_1 - l_1) / (u_1 - l_1) \rfloor \\ j &= \lfloor n (x_2 - l_2) / (u_2 - l_2) \rfloor \\ k &= \lfloor p (x_3 - l_3) / (u_3 - l_3) \rfloor \end{split} \label{eq:PointToGridIndices} \end{equation} where $\lfloor \cdot \rfloor$ is the floor operation that returns the largest integer less than or equal to its argument. (Here we assume zero-based indices.)
Point-based representations are widely used in robotics to represent environment geometries because they are the type of data directly produced by depth sensors. These representations store a sampling of points $\mathbf{p}_1,\ldots,\mathbf{p}_N$ from the geometry surface $\partial G$. Individual points may be associated with colors, normals, and other information. Usually, this sampling is highly incomplete, exhibiting irregular density or large gaps caused by occlusion.
There are two major types of point-based representations: point clouds and depth images (aka range images). A point cloud is simply an unstructured list of points with no assumption of regularity. Points are not a priori identified to objects, nor to each other. This type of representation results from moving lidar sensors and naively merged depth scans. Variants of point clouds include surfels, which include normal information to represent the geometry as many little disks, and gaussian splats, in which each point is represented by an ellipsoidal distribution around it.
A depth image is a 2D grayscale image indicating the distance from the camera sensor to the surface observed by each pixel. (Some invalid dpeth value, such as 0 or $\infty$, indicates no data is available at that pixel). This is the raw form of data returned from depth cameras (Fig. 4). The advantage of this form of representation over point clouds is that nearness in the image implies some notion of spatial proximity in 3D space; furthermore, the line between the camera's focal point and the read point must be free of other geometry. Algorithms can exploit this knowledge for faster geometric operations, which we shall see when we return to the topic of 3D mapping. Knowledge about the camera's position and orientation in space is required to correlate the 2D image to 3D points in a global reference frame.
#Code for showing an RGB-D sensor's color and depth image (axis units in pixels, depth units in m)
%matplotlib inline
from klampt import *
from klampt.io import numpy_convert
from matplotlib import pyplot as plt
g = Geometry3D()
g.loadFile("data/objects/point_cloud_scene.pcd")
width = 640
height = 480
pc = g.getPointCloud()
pts = numpy_convert.to_numpy(pc)
assert pts.shape[0] == width*height
"""
vis.init('HTML')
vis.add('g',g,size=0.03)
vis.setViewport({'up':{'x':0,'y':0,'z':1},'position':{'x':-2,'y':0,'z':0},'target':{'x':3,'y':0,'z':0}})
vis.nativeWindow().kw.width = 500
vis.nativeWindow().kw.height = 500
print("Point cloud (click and drag to view)")
vis.show()
"""
from klampt.model import geometry
import numpy as np
rgbs = geometry.point_cloud_colors(pc,('r','g','b'))
rgbs = np.array(rgbs).reshape((height,width,3))
zs = pts[:,0].reshape((height,width))
fig,axs = plt.subplots(1,2,figsize=(16,5))
axs[0].imshow(rgbs)
depth = axs[1].imshow(zs)
fig.colorbar(depth)
plt.show()