SpatialLM-Llama-1B

Property	Value
Model Type	3D Language Model
Base Architecture	Llama-3.2-1B-Instruct
License	Llama3.2 License
Author	ManyCore Research
Framework	PyTorch

What is SpatialLM-Llama-1B?

SpatialLM-Llama-1B is a groundbreaking 3D large language model designed to bridge the gap between unstructured 3D geometric data and structured scene understanding. Built on the Llama-3.2 architecture, this model specializes in processing point cloud data from various sources including monocular video sequences, RGBD images, and LiDAR sensors.

Implementation Details

The model operates with axis-aligned point clouds where the z-axis serves as the up axis. It leverages advanced techniques to process 3D data and generate comprehensive scene understanding outputs. The implementation requires Python 3.11, PyTorch 2.4.1, and CUDA 12.4, utilizing the TorchSparse framework for efficient point cloud processing.

Processes point clouds from multiple input sources
Generates structured 3D layout predictions
Achieves 78.62% mean IoU for wall detection
Supports real-time visualization through Rerun framework

Core Capabilities

Architectural element recognition (walls, doors, windows)
Object detection and classification with oriented bounding boxes
High performance on challenging scenarios (95.24% F1 score for bed detection)
Support for both 3D and 2D thin object detection
Integration with popular 3D reconstruction tools like MASt3R-SLAM

Frequently Asked Questions

Q: What makes this model unique?

SpatialLM-Llama-1B stands out for its ability to process various types of 3D input data without requiring specialized equipment, making it more accessible and versatile than traditional 3D understanding systems. Its multimodal architecture effectively handles both geometric and semantic understanding tasks.

Q: What are the recommended use cases?

The model is ideal for applications in embodied robotics, autonomous navigation, architectural analysis, and complex 3D scene understanding. It's particularly effective for processing indoor environments where accurate object and structural element detection is crucial.