Quickstart - LanceDB

As described in the landing page, LanceDB provides one data layer for curation, feature engineering, search and retrieval, and model training. Whether you are preparing training data, building a RAG or agentic retrieval system, reviewing examples, or adding model-generated features, you’ll work with the same underlying table and search primitives. Let’s get started in just a few steps!

1. Install LanceDB

Install LanceDB in your client SDK.

pip install lancedb

Python pre-release builds

To pick up the latest features and bug fixes before the next stable release, install a pre-release from LanceDB’s Fury index.

pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ lancedb

Pre-release builds receive the same level of testing as stable releases, but their availability is not guaranteed for more than 6 months after release. For real-world workloads, we recommend you use the latest stable release as far as possible.

2. Connect to a LanceDB database

LanceDB supports several URI patterns to connect to a database.

A local filesystem path (when using it as an embedded library)
A db://... URI (when using LanceDB Enterprise)
An object storage URI: s3://..., gs://..., or az://... (when connecting directly from the client SDK)

Connect via local directory path

The simplest way to begin is to use LanceDB as an embedded library. Import LanceDB in your client SDK of choice and point to a local directory path.

Connect via object storage URIs

You can also connect directly to object storage from the client SDK: For credentials, endpoints, and provider-specific options, see Configuring storage.

Connect to LanceDB Enterprise

If you’re using LanceDB Enterprise, you can connect to the remote database using the db:// URI along with the API key, region, and cluster endpoint you received from the LanceDB team. Pass the cluster endpoint via host_override so the client routes requests to your deployment.

host_override is the full URL of your cluster endpoint, including the scheme (https://) and a port if your deployment listens on a non-default one (e.g. https://your-enterprise-endpoint.com:443). If you don’t have the endpoint, contact the LanceDB team.

To learn more about RemoteTable semantics and how Enterprise differs operationally from embedded LanceDB, see the Enterprise overview.

3. Create a new table

Let’s create a small table of characters from the kingdom of Camelot. Each row stores source text, metadata, structured fields, and a vector embedding in the same LanceDB table.

The embeddings we use in this example are synthetic and for demonstration purposes only. In a real AI data workflow, you would generate them from text, images, audio, or video using an embedding model of choice.

Each row has source text, metadata, structured fields, and a vector:

{
  "id": "2",
  "name": "Merlin",
  "role": "Wizard",
  "description": "Advisor and prophet with deep magical knowledge.",
  "stats": { "strength": 2, "magic": 5, "leadership": 4, "wisdom": 5 },
  "vector": [0.2, 0.9, 0.4, 0.9]
}

The full raw records are included below: You can now create a LanceDB table from those records. The code below creates a LanceDB table with the appropriate schema and ingests the data.

4. Semantic search

Search is a useful capability for all kinds of AI data pipelines. Below, we do a vector similarity search for samples similar to a “wise magical advisor” (transforming the natural language query to an embedding), and project only the columns needed by the next step. Search (which requires random access) is a ubiquitous access pattern that appears in many workloads: whether you’re building a RAG or recommendation system, serving agent memory, or curating a training dataset. The example for Python above shows how to convert results to a Polars DataFrame. Depending on your language, you can collect query results as a list/array of objects or DataFrames to be used downstream in your application.

Pandas users in Python can get results as a Pandas DataFrame

Use the to_pandas() method to convert query results into a Pandas DataFrame.

5. Curation

Searching for relevant results can be more useful when combined with metadata filters. In this tiny example, we filter to examples with high magic stats. When working with large datasets, it’s common to use the same pattern to filter on quality labels, train/eval splits, numeric fields, categorical values, timestamp windows, or generated tags and labels.

6. Add a derived feature

Feature engineering is the process of cleaning up your data and creating new signals that help your model learn, make better predictions, or your agent retrieve more useful information. In the example below, we add a power_score column from the structured stats fields. Lance supports data evolution, so you can add new columns without rewriting the entire table. Next, you can query a compact view of the new feature:

name	role	power_score
King Arthur	King	3.5
Merlin	Wizard	4.0
Sir Lancelot	Knight	3.0

The same workflow is used for data preparation tasks when adding derived features, cached model signals, review scores, or dataset quality indicators.

7. Store multimodal data

Multimodal data is a first-class citizen in LanceDB. Binary data (image, audio, video, etc.) is stored as blobs or inline Arrow binary types in a LanceDB column, and they benefit from the same table operations and data versioning semantics as other data types. All the data is governed in the same table, so you can search, filter, and retrieve multimodal records together with structured fields, metadata, and embeddings. In this example, the lancedb/magical_kingdom dataset stores character images, descriptions, structured stats, image embeddings, and text embeddings together. Say we downloaded the image for Sir Lancelot from that dataset locally. You can read the image bytes in your client SDK and store them in a LanceDB column. The image bytes can be used for downstream tasks like retrieval, evaluation, or training.

Sir Lancelot from the lancedb/magical_kingdom dataset

These snippets load the local image file and store the bytes in an image column: For more examples, see the multimodal data section.

Code

See the full code for these examples (including helper functions) in the quickstart file for the appropriate client language in the files provided in the repo.

What’s next?

You’ve learned how to install LanceDB, connect, create one table for AI data, retrieve related examples, curate with metadata, add a derived feature, and represent multimodal records. These same primitives apply across the AI data lifecycle, from data preparation and feature engineering to retrieval, evaluation, and training. Continue to the table and search guides to build on this example with schema options, appends, updates, versioning, indexing, full-text search, hybrid search, and reranking.

Basic table operations

Build on this quickstart with table creation, updates, and schema tips.

Build a RAG App

Learn how to build Retrieval-Augmented Generation (RAG) applications using LanceDB.

Indexing

Create vector, full-text, and scalar indexes to speed up queries on larger datasets.

Data loading and shuffles

Use LanceDB for projected, shuffled, random-access reads in training workflows.

​1. Install LanceDB

​Python pre-release builds

​2. Connect to a LanceDB database

​Connect via local directory path

​Connect via object storage URIs

​Connect to LanceDB Enterprise

​3. Create a new table

​4. Semantic search

​5. Curation

​6. Add a derived feature

​7. Store multimodal data

​Code

​What’s next?

Basic table operations

Build a RAG App

Indexing

Data loading and shuffles

1. Install LanceDB

Python pre-release builds

2. Connect to a LanceDB database

Connect via local directory path

Connect via object storage URIs

Connect to LanceDB Enterprise

3. Create a new table

4. Semantic search

5. Curation

6. Add a derived feature

7. Store multimodal data

Code

What’s next?