Skip to main content

Lakebase Integration – PostgreSQL-Compatible OLTP Database for Databricks Apps

This document describes how the AI Slide Generator integrates with Databricks Lakebase for persistent data storage when deployed to Databricks Apps.


Stack / Entry Points

ComponentPathPurpose
Lakebase Modulesrc/core/lakebase.pyInstance management, credentials, schema setup
Database Modulesrc/core/database.pySQLAlchemy engine, session management, auto-detection
Deployment Librarypackages/databricks-tellr/databricks_tellr/deploy.pyOrchestrates instance + app + schema creation
Local Deploy Scriptscripts/deploy_local.pyLocal development deployment using locally-built wheels
App Entrypointpackages/databricks-tellr-app/databricks_tellr_app/run.pyinit_db() (via database.py), main()
App Config Templatepackages/databricks-tellr/databricks_tellr/_templates/app.yaml.templateGenerated app.yaml with env var substitution
Environment Configconfig/deployment.yamlPer-environment Lakebase settings

Key Dependencies:

  • databricks-sdk – Instance management and credential generation
  • psycopg2-binary – PostgreSQL connections
  • sqlalchemy – ORM and connection pooling

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ Databricks Apps Runtime │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ FastAPI App │───▶│ database.py │ │
│ │ (src/api/*) │ │ (SQLAlchemy) │ │
│ └─────────────────┘ └────────┬────────┘ │
│ │ │
│ Auto-injected: │ Builds connection URL │
│ PGHOST, PGUSER │ using OAuth token │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Lakebase OLTP │ │
│ │ Instance │ │
│ │ (PostgreSQL) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ Deployment Time │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ deploy.py │───▶│ lakebase.py │ │
│ │ (CLI) │ │ (SDK calls) │ │
│ └─────────────────┘ └────────┬────────┘ │
│ │ │
│ 1. Create instance │ │
│ 2. Create app │ │
│ 3. Setup schema │ │
│ 4. Grant permissions │ │
│ 5. Initialize tables ▼ │
└─────────────────────────────────────────────────────────────────┘

Key Concepts

Lakebase Hierarchy

Lakebase Instance (e.g., "ai-slide-generator-dev-db")
└── Database: "databricks_postgres" (default, always exists)
└── Schema: "app_data" (created by deployment)
└── Tables: user_sessions, config_profiles, etc.
  • Instance: The OLTP server (capacity units: CU_1, CU_2, CU_4, CU_8)
  • Database: Always databricks_postgres (Lakebase default)
  • Schema: Application namespace (default: app_data)

Authentication Flow

When a database resource is attached to a Databricks App:

  1. Databricks auto-injects PGHOST and PGUSER environment variables (provisioned), or the deployment sets LAKEBASE_TYPE, LAKEBASE_PG_HOST, and LAKEBASE_ENDPOINT_NAME (autoscaling)
  2. A Postgres role is created for the app's service principal
  3. The app generates short-lived OAuth tokens via _generate_lakebase_token()

Token generation routes to the correct SDK API based on LAKEBASE_TYPE:

# Token generation (database.py – _generate_lakebase_token)
lakebase_type = _get_lakebase_type() # reads LAKEBASE_TYPE env var

if lakebase_type == "autoscaling":
# Autoscaling Lakebase – uses ws.postgres API
endpoint_name = os.getenv("LAKEBASE_ENDPOINT_NAME")
cred = ws.postgres.generate_database_credential(endpoint=endpoint_name)
else:
# Provisioned Lakebase – uses ws.database API
instance_name = os.getenv("LAKEBASE_INSTANCE")
cred = ws.database.generate_database_credential(
request_id=str(uuid.uuid4()),
instance_names=[instance_name],
)

password = cred.token # Use as PostgreSQL password

Automatic Token Refresh

OAuth tokens expire after 1 hour. To maintain continuous database connectivity, the application implements automatic token refresh:

  1. Background Task: An asyncio task refreshes tokens every 50 minutes (10-minute buffer before expiry)
  2. Event-Based Injection: SQLAlchemy's do_connect event injects fresh tokens for each new connection
  3. Lifecycle Integration: Token refresh starts/stops with FastAPI's lifespan
┌─────────────────────────────────────────────────────────────────┐
│ Token Refresh Flow │
│ │
│ FastAPI Startup │
│ │ │
│ ▼ │
│ start_token_refresh() │
│ │ │
│ ├──► Generate initial OAuth token │
│ │ │
│ └──► Start background task ─────────────────────┐ │
│ │ │
│ ┌──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Every 50 minutes: │ │
│ │ 1. Generate fresh token │ │
│ │ 2. Update global _postgres_token │ │
│ └─────────────────────────────────────────┘ │
│ │
│ Connection Request │
│ │ │
│ ▼ │
│ do_connect event ──► Inject current token ──► Connect │
│ │
└─────────────────────────────────────────────────────────────────┘

Key Functions (in src/core/database.py):

  • start_token_refresh() - Start background refresh task (called in lifespan startup)
  • stop_token_refresh() - Stop background task (called in lifespan shutdown)
  • is_lakebase_environment() - Check if running in Lakebase environment (checks LAKEBASE_TYPE first, falls back to PGHOST and PGUSER)
  • _refresh_token_background() - Async task that refreshes every 50 minutes

Reference: Databricks Apps Cookbook - Lakebase Connection

Provisioned vs. Autoscaling Lakebase

The codebase supports two Lakebase deployment modes controlled by the LAKEBASE_TYPE environment variable:

AspectProvisionedAutoscaling
LAKEBASE_TYPE(empty or provisioned)autoscaling
Host sourcePGHOST (auto-injected)LAKEBASE_PG_HOST (set in app.yaml)
User sourcePGUSER (auto-injected)PGUSER or resolved via SDK
Token APIws.database.generate_database_credential(instance_names=[...])ws.postgres.generate_database_credential(endpoint=...)
Required env varsPGHOST, PGUSER, LAKEBASE_INSTANCELAKEBASE_TYPE, LAKEBASE_PG_HOST, LAKEBASE_ENDPOINT_NAME
SDK importdatabricks.sdk.service.databasedatabricks.sdk.service.postgres (optional; guarded by try/except)

All lakebase.py helper functions — generate_lakebase_credential(), get_lakebase_connection_info(), get_lakebase_connection_url(), setup_lakebase_schema(), and initialize_lakebase_tables() — accept endpoint_name, host, and lakebase_type parameters so callers can route to the autoscaling path.


Component Responsibilities

ModuleResponsibilitySDK/API Used
get_or_create_lakebase_instance()Create/get OLTP instancews.database.create_database_instance_and_wait()
generate_lakebase_credential()Generate OAuth tokenws.database (provisioned) or ws.postgres (autoscaling)
get_lakebase_connection_info()Build connection paramsws.database.get_database_instance() or pre-resolved host
get_lakebase_connection_url()Build SQLAlchemy URLCombines above functions
setup_lakebase_schema()Create schema + grant permsDirect SQL via psycopg2
initialize_lakebase_tables()Create SQLAlchemy tablesBase.metadata.create_all()
_get_database_url() (database.py)Auto-detect environmentChecks LAKEBASE_TYPE, PGHOST env vars
_generate_lakebase_token() (database.py)Generate fresh OAuth tokenRoutes between ws.postgres and ws.database
start_token_refresh() (database.py)Start background token refreshasyncio task
stop_token_refresh() (database.py)Stop background token refreshasyncio task cancellation
is_lakebase_environment() (database.py)Check for Lakebase envLAKEBASE_TYPE first, then PGHOST + PGUSER

Deployment Flow

When deploying with scripts/deploy_local.sh create --env development --profile <profile>:

1. Load config from deployment.yaml
└── LakebaseConfig(database_name, schema, capacity)

2. Create Lakebase instance (if not exists)
└── get_or_create_lakebase_instance()
└── Returns: instance name, DNS hostname, state

3. Build and stage artifacts
└── Inject LAKEBASE_INSTANCE into app.yaml

4. Upload to Databricks workspace

5. Create app with database resource
└── AppResourceDatabase(
instance_name="ai-slide-generator-dev-db",
database_name="databricks_postgres",
permission=CAN_CONNECT_AND_CREATE
)

6. Deploy source code
└── apps.deploy_and_wait()

7. Setup schema and permissions
└── setup_lakebase_schema(instance, schema, client_id)
└── SQL: CREATE SCHEMA, GRANT permissions

8. Initialize tables
└── initialize_lakebase_tables()
└── SQLAlchemy: Base.metadata.create_all()

Runtime Connection Detection

database.py auto-detects the environment. The URL is built without a password; the password is injected dynamically via SQLAlchemy's do_connect event so that expired tokens are replaced transparently (see Automatic Token Refresh above).

def _get_database_url() -> str:
# Priority 1: Explicit DATABASE_URL
if os.getenv("DATABASE_URL"):
return os.getenv("DATABASE_URL")

# Priority 2a: Autoscaling Lakebase (LAKEBASE_TYPE=autoscaling)
if _get_lakebase_type() == "autoscaling":
pg_host = os.getenv("LAKEBASE_PG_HOST") # required
pg_user = os.getenv("PGUSER") or ws.current_user.me().user_name
return f"postgresql://{pg_user}@{pg_host}:5432/databricks_postgres?sslmode=require"

# Priority 2b: Provisioned Lakebase (PGHOST auto-set by Databricks)
if os.getenv("PGHOST") and os.getenv("PGUSER"):
return f"postgresql://{pg_user}@{pg_host}:5432/databricks_postgres?sslmode=require"

# Priority 3: Local development
return "postgresql://localhost/ai_slide_generator"

Note: both Lakebase branches also append &options=-csearch_path%3D{schema} when LAKEBASE_SCHEMA is set.


Configuration

deployment.yaml

environments:
development:
lakebase:
database_name: "ai-slide-generator-dev-db" # Instance name
schema: "app_data" # PostgreSQL schema
capacity: "CU_1" # CU_1, CU_2, CU_4, CU_8

Environment Variables (Runtime)

VariableSourceDescription
LAKEBASE_TYPEapp.yaml / deploy.pyautoscaling or provisioned (empty = auto-detect via PGHOST)
PGHOSTAuto-injected (provisioned)Lakebase DNS hostname
PGUSERAuto-injected (provisioned)Service principal as Postgres role
LAKEBASE_INSTANCEapp.yaml / deploy.pyInstance name for credential generation (provisioned)
LAKEBASE_PG_HOSTapp.yaml / deploy.pyPostgreSQL host (autoscaling only)
LAKEBASE_ENDPOINT_NAMEapp.yaml / deploy.pyFull endpoint resource name (autoscaling only)
LAKEBASE_SCHEMAapp.yamlSchema name (default: app_data)

Permissions Granted

The deployment grants these permissions to the app's service principal:

-- Schema access
GRANT USAGE ON SCHEMA "app_data" TO "<client_id>";
GRANT CREATE ON SCHEMA "app_data" TO "<client_id>";

-- Table access (existing)
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA "app_data" TO "<client_id>";

-- Sequence access (for auto-increment)
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA "app_data" TO "<client_id>";

-- Default privileges (future objects)
ALTER DEFAULT PRIVILEGES IN SCHEMA "app_data"
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO "<client_id>";
ALTER DEFAULT PRIVILEGES IN SCHEMA "app_data"
GRANT USAGE, SELECT ON SEQUENCES TO "<client_id>";

Error Handling

ErrorCauseResolution
connection to localhost refusedPGHOST not setEnsure database resource attached to app
Failed to decode token for roleMissing LAKEBASE_INSTANCECheck app.yaml has instance injected
permission denied for sequenceMissing sequence grantsRedeploy to apply sequence permissions
schema does not existSchema not createdRun deployment with --create
Background token refresh failedToken generation errorCheck service principal credentials and network
password authentication failedExpired or invalid tokenVerify token refresh task is running (check logs)

Local Development

For local development without Lakebase:

# Use local PostgreSQL
export DATABASE_URL="postgresql://localhost/ai_slide_generator"

# Or start local PostgreSQL
brew services start postgresql
createdb ai_slide_generator
python scripts/init_database.py

Extension Guidance

  • Adding new tables: Define in src/database/models/, they'll be created on next deployment via init_db()Base.metadata.create_all()
  • Adding columns to existing tables: create_all() does not alter existing tables. Add an idempotent ALTER TABLE ... ADD COLUMN IF NOT EXISTS statement to _run_migrations() in src/core/database.py. This is called by init_db() on every app startup (via run.py).
  • Changing schema name: Update deployment.yaml and redeploy (creates new schema)
  • Scaling capacity: Update capacity in deployment.yaml (CU_1 → CU_8)
  • Multiple environments: Each env gets its own instance (dev, staging, prod)
  • App startup command: Generated from app.yaml.template (not the local app.yaml). Update the template in packages/databricks-tellr/databricks_tellr/_templates/ for deployment changes.

Cross-References


External Documentation