Configuring Databricks for Panther

Overview

This page describes how to configure Databricks for use as your Panther data storage backend. As you complete the steps below, you will collect and store various configuration values, then provide them to Panther.

This process will:

  • Create a Databricks workspace for Panther (along with associated Databricks infrastructure in AWS)

  • Create an IAM role in AWS to allow Databricks to read from the Panther S3 staging bucket.

  • Create an external storage credential.

  • Create an external storage integration so Databricks can read data from S3 for loading.

  • Create service principals—one for loading (read/write) and one for querying (read-only).

  • Create secrets with KMS keys in AWS to hold OAuth credentials for the service principals.

  • Create a catalog in Databricks for Panther tables, with permissions for the service principals.

  • Create load, optimize, query, and scheduled query warehouses.

How to configure Databricks for Panther

Prerequisites

  • You have a Databricks account.

  • You have completed the instructions on Setting Up a Cloud Connected Panther Instance and can log in to the Panther Console.

  • You are logged into the AWS console in the AWS account you'd like to use for Panther compute. This is needed because Databricks will load a CloudFormation template to create a workspace.

Step 1: Make a copy of the configuration table

Throughout the configuration process, you'll collect values that you'll send to Panther at the end. To organize theses values, make a copy of the table below.

Parameter
Value

databricks_load_role_arn

databricks_load_secret_kms_key_arn

databricks_query_secret_kms_key_arn

databricks_load_secret_arn

databricks_query_secret_arn

databricks_load_warehouse_id

databricks_optimize_warehouse_id

databricks_query_warehouse_id

databricks_scheduled_query_warehouse_id

Step 2: Create a Databricks workspace

For additional support while creating a workspace, see the Databricks Create a workspace using the AWS Quickstart (Recommended) documentation.

  1. Log in into the Databricks console.

  2. In the left-hand navigation menu, click Workspaces.

  3. Click Create workspace.

  4. Fill out the Create Workspace modal:

    • Workspace name: enter a memorable name.

    • Region: select the region that matches your AWS deployment of Panther.

    • Storage and compute: select Use your existing cloud account.

    • How would you like to deploy the workspace?: select Automatically with Quickstart.

  5. Click Continue.

    • A new browser tab will open in AWS, on a Quick create stack screen with the CloudFormation template pre-loaded.

  6. Without making any changes, deploy the CloudFormation template by clicking Create stack.

  7. Return to your Databricks browser tab, and wait a few minutes for the new workspace to appear in the Workspaces list. When it appears, click Open to enter the workspace environment.

Step 3: Enable variant shredding in your workspace

For additional support while enabling variant shredding, see the Databricks Enable shredding documentation.

  1. In your newly created Databricks workspace, in the upper-right corner, click your profile icon, then Previews.

  2. To the right of Variant Shredding for Optimized Read Performance on Semi-Structured Data, click the toggle On.

Step 4: Create a Panther role for the storage credential

For additional support while creating an IAM role, see the Databricks Step 1: Create an IAM role documentation.

  1. In the AWS account where you created the Databricks workspace infrastructure, create an IAM role named panther-databricks-s3-reader-role-<region>, accepting all defaults.

  2. In your Panther Console, retrieve the Processed Data Bucket value:

    1. Click the gear icon (Settings) > General.

    2. Click Data Lake.

    3. Under Databricks Configuration, copy the Processed Data Bucket value.

  3. Update the role's trust relationship:

    1. In the AWS console, in the Roles list, click the newly created role to view its details page.

    2. Click Trust relationships.

    3. Click Edit trust policy.

    4. Replace the JSON in the code editor with the JSON below:

The following trust policy sets "sts:ExternalId": "TBD" as a placeholder—you will update this later. You will also later add a self-assumption statement.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL",
                    "arn:aws:iam::<your account for this role>:role/panther-databricks-s3-reader-role-<region>"
                ]
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "TBD"
                }
            }
        }
    ]
}

e. Click Update policy.

  1. Update the role's permissions:

    1. On the role's details page, click Permissions.

    2. Click Add permissions > Create inline policy.

    3. In the Policy editor section, click JSON.

    4. Replace the JSON in the code editor with the JSON below:

In the policy below, replace <Processed Data Bucket from Panther settings> with the Processed Data Bucket value you retrieved above.

{ 	
"Statement": [ 		
  { 			
   "Action": [ 				
     "s3:ListBucket", 				
      "s3:GetBucketLocation" 			
   ], 			
   "Effect": "Allow", 			
   "Resource": "arn:aws:s3:::<Processed Data Bucket from Panther settings>" 		
 }, 		
 { 			
   "Action": "s3:GetObject", 			
    "Effect": "Allow", 			
    "Resource": "arn:aws:s3:::<Processed Data Bucket from Panther settings>/*" 		
 } 	
 ], 	
 "Version": "2012-10-17" 
}

e. Click Next.

f. Under Policy details, enter a Policy name.

g. Click Create policy.

  1. On the role's details page, copy the ARN, and add it as the databricks_load_role_arn value in your configuration table.

    • Leave the browser window with the role details page open, as you will return to it in Step 5.

Step 5: Create a storage credential

For additional support while creating a storage credential, see the Databricks Step 2: Give Databricks the IAM role details documentation.

Create a Databricks storage credential to represent the AWS IAM role you just created:

  1. In the Databricks workspace you created above, click Catalog, then External Data.

  2. Click Credentials.

  3. Click Create credential.

  4. Fill in the Create a new credential form:

    1. Credential Type: select AWS IAM Role.

    2. Credential name: enter panther-storage-credential.

    3. IAM role (ARN): enter the ARN of the IAM role you created above (which is databricks_load_role_arn in the configuration table).

  5. Click Create.

    • On the Credential created page, copy the External ID value, and store it in a secure location, as you will need it in the next step.

Step 6: Update the IAM role trust relationship policy

For additional support while updating the IAM role, see the Databricks Step 3: Update the IAM role trust relationship policy documentation.

  1. Return to the AWS console, to the details page for the panther-databricks-s3-reader-role-<region> IAM role you created above.

  2. Click Trust relationships.

  3. Click Edit trust policy.

  4. In the "sts:ExternalId": "TBD" line, replace TBD with the External ID value you copied in Databricks above.

  5. Click Update policy.

Step 7: Create an external storage location

For additional support while updating the IAM role, see the Databricks Create an external location for an AWS S3 bucket documentation.

  1. In the Databricks workspace you created above, click Catalog, then External Data.

  2. Click Create external location.

  3. Click Manual, then Next.

  4. Fill in the Create a new external location manually form:

    • External location name: enter panther-processed-data.

    • Storage type: select S3.

    • URL: enter the Processed Data Bucket value you retrieved from the Settings page in the Panther Console in Step 3.

    • Storage credential: select panther-storage-credential.

  5. Click Create.

  6. You will be routed to a page with a Permission Denied warning box—click Force create.

Step 8: Create a load service principal in Databricks

  1. Access your Databricks workspace settings:

    1. In the upper-right corner, click your initial.

    2. Click Settings.

  2. In the Settings navigation bar, under Workspace admin, click Identity and access.

  3. To the right of Service principals, click Manage.

  4. Click Add service principal.

  5. In the Add service principal modal, click Add new.

  6. In the Service principal name field, enter panther-load.

  7. Click Add.

  8. In the table, click panther-load to view its details page.

  9. Click Secrets.

  10. Click Generate secret.

  11. Under Lifetime (days), enter 730 (the maximum).

  12. Click Generate.

  13. Copy the Secret and Client ID values and store them in a secure location, as you'll need them in a later step (as an alternative to copying these values, you can leave this browser tab open).

Step 9: Create a load secret KMS key in AWS

  1. In your AWS console, ensure you are in the correct region. Navigate to Key Management Service.

  2. In the left-hand navigation menu, click Customer managed keys.

  3. Click Create Key.

  4. Under Key type, select Symmetric. Under Key usage, select Encrypt and decrypt.

  5. Click Next

  6. Enter an Alias value, then click Next.

  7. Under Key administrators, optionally select users and/or roles, then click Next.

  8. On the Define key usage permissions - optional page, under Other AWS accounts, click Add another AWS account.

    1. In the field that appears, enter the AWS account ID for the account your Panther deployment is in. You can find this value in the Panther Console, in the general settings footer.

    2. Click Next.

  9. Switch to a browser tab with the Panther Console open, and retrieve the Delta Controller Role ARN and Delta Admin Role ARN values:

    1. Click the gear icon (Settings) > General.

    2. Click Data Lake.

    3. Under Databricks Configuration, note the Delta Controller Role ARN and Delta Admin Role ARN values.

  10. In the AWS console, under Key policy, click Edit, then replace the JSON in the code editor with the JSON below:

In the policy below, replace:

  • <Delta Controller Role ARN from Panther settings> with the Delta Controller Role ARN value you retrieved above

  • <Delta Admin Role ARN from Panther settings> with the Delta Admin Role ARN value you retrieved above

  • <AWS Account ID you are working in> with the Account ID of the account you are working in

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "Panther",
			"Effect": "Allow",
			"Principal": {
				"AWS": [
					"<Delta Controller Role ARN from Panther settings>",
					"<Delta Admin Role ARN from Panther settings>"
				]
			},
			"Action": "kms:Decrypt",
			"Resource": "*"
		},
		{
			"Sid": "root",
			"Effect": "Allow",
			"Action": [
				"kms:*"
			],
			"Resource": "*",
			"Principal": {
				"AWS": "arn:aws:iam::<AWS Account ID you are working in>:root"
			}
		}
	]
}
  1. Click Next.

  2. On the Review page, review the configuration, then click Finish.

  3. In the Customer managed keys list, click the alias of the key you just created, to view its detail page.

  4. Copy the key ARN into the table above for the databricks_load_secret_kms_key_arn row.

Step 10: Create a load secret in AWS

  1. In your AWS console, ensure you are in the correct region. Navigate to Secrets Manager.

  1. Click Store a new secret.

  2. Under Secret type, select Other type of secret.

  3. Under Key/value pairs, in the Key/value tab, enter the following key value pairs:

    Key

    Value

    secret

    <the Secret value you generated in Databricks in Step 8>

    client-id

    <the Client ID value you generated in Databricks in Step 8>

    databricks-host

    <the URL of your Databricks workspace> While viewing the workspace you created above in your Databricks console, copy the URL of the page. For example, https://dbc-023ca860-3666.cloud.databricks.com

  4. Under Encryption key, select the databricks_load_secret_kms_key_arn KMS key you created in the previous step.

  5. Click Next.

  6. In the Secret name field, enter panther-databricks-admin-access, then click Next.

  7. Without making any changes on the Configure rotation - optional page, click Next.

  8. Review the secret settings, then click Store.

  9. In the Secrets list, click panther-databricks-admin-access, to view its details page.

  10. In the Resource permissions tile, click Edit permissions.

  11. Under Resource permissions, replace the JSON in the code editor with the JSON below:

In the policy below, replace:

  • <Delta Controller Role ARN from Panther settings> with the Delta Controller Role ARN value you retrieved above

  • <Delta Admin Role ARN from Panther settings> with the Delta Admin Role ARN value you retrieved above

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Panther",
      "Effect": "Allow",
      "Principal": {
        "AWS": [
            "<Delta Controller Role ARN from Panther settings>",
            "<Delta Admin Role ARN from Panther settings>"
        ]
      },
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*"
    }
  ]
}
  1. Click Save.

  2. Copy the ARN of the newly created secret and add it as the databricks_load_secret_arn value in your configuration table.

  3. In the Databricks console, return to the External Data page (click Catalog > External Data).

  4. Under External Locations, click the panther-processed-data location you created above.

  5. Click Permissions.

  6. Click Grant.

  7. Under Principals, search for and select panther-load.

  8. Under Privileges, check the boxes for BROWSE and READ FILES.

  9. Click Confirm.

Step 11: Create a query service principal in Databricks

  1. Access your Databricks workspace settings:

    1. In the upper-right corner, click your initial.

    2. Click Settings.

  2. In the Settings navigation bar, under Workspace admin, click Identity and access.

  3. To the right of Service principals, click Manage.

  4. Click Add service principal.

  5. In the Add service principal modal, click Add new.

  6. In the Service principal name field, enter panther-query.

  7. Click Add.

  8. In the table, click panther-query to view its details page.

  9. Click Secrets.

  10. Click Generate secret.

  11. Under Lifetime (days), enter 730 (the maximum).

  12. Click Generate.

  13. Copy the Secret and Client ID values and store them in a secure location, as you'll need them in a later step (as an alternative to copying these values, you can leave this browser tab open).

Step 12 (Optional): Create a query secret KMS key

In the next step, you'll create an additional secret in AWS. You can either create a new KMS key to associate to this secret, or reuse the KMS key you created in Step 8 (added to your configuration table as databricks_load_secret_kms_key_arn).

  • If you'd like to reuse the KMS key you created above, copy the value of databricks_load_secret_kms_key_arn to databricks_query_secret_kms_key_arn in the configuration table above.

  • If you'd like to create a new KMS key, repeat Step 8: Create a load secret KMS key in AWS, then add the ARN for the key as databricks_query_secret_kms_key_arn in the configuration table above.

Step 13: Create a query secret in AWS

  1. In your AWS console, ensure you are in the correct region. Navigate to Secrets Manager.

  1. Click Store a new secret.

  2. Under Secret type, select Other type of secret.

  3. Under Key/value pairs, in the Key/value tab, enter the following key value pairs:

    Key

    Value

    secret

    <the Secret value you generated in Databricks in Step 11>

    client-id

    <the Client ID value you generated in Databricks in Step 11>

    databricks-host

    <the URL of your Databricks workspace> While viewing the workspace you created above in your Databricks console, copy the URL of the page. For example, https://dbc-023ca860-3666.cloud.databricks.com

  4. Under Encryption key, select the databricks_query_secret_kms_key_arn KMS key you created in the previous step (or the databricks_load_secret_kms_key_arn KMS key, if you are reusing that one).

  5. Click Next.

  6. In the Secret name field, enter panther-databricks-query-access, then click Next.

  7. Without making any changes on the Configure rotation - optional page, click Next.

  8. Review the settings, then click Store.

  9. In the Secrets list, click panther-databricks-query-access, to view its details page.

  10. In the Resource permissions tile, click Edit permissions.

  11. Under Resource permissions, replace the JSON in the code editor with the JSON below:

In the policy below, replace:

  • <Delta Controller Role ARN from Panther settings> with the Delta Controller Role ARN value you retrieved above

  • <Delta Admin Role ARN from Panther settings> with the Delta Admin Role ARN value you retrieved above

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Panther",
      "Effect": "Allow",
      "Principal": {
        "AWS": [
            "<Delta Controller Role ARN from Panther settings>",
            "<Delta Admin Role ARN from Panther settings>"
        ]
      },
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*"
    }
  ]
}
  1. Click Save.

  2. Copy the ARN of the newly created secret and add it as the databricks_query_secret_arn value in your configuration table.

Step 14: Create an S3 bucket and external location

  1. In your AWS console, ensure you are in the correct region. Navigate to S3.

  1. Click Create bucket.

  2. Enter a Bucket name.

  3. Click Create bucket.

  4. In the Databricks workspace you created above, click Catalog, then External Data.

  5. Click Create external location.

  6. Click AWS Quickstart (Recommended), then Next.

  7. In the Bucket Name field, enter the name of the bucket you just created.

  8. Under Personal Access Token, click Generate new token.

    • Copy this value, as you'll need it in the follow steps. Alternatively, you can leave this page open.

  9. Click Launch in Quickstart.

    • A new browser tab will open in AWS, on a Quick create stack screen with the CloudFormation template pre-loaded.

  10. In the Parameters section, in the Databricks Personal Access Token field, enter the Personal Access Token you generated above in Databricks.

  11. Click Create stack.

  12. After the stack has completed deploying, return to your Databricks console browser tab. On the Create external location with Quickstart screen, click Ok.

    • Verify that the External Locations list contains the one you just created.

Step 15: Create a Databricks catalog

  1. In the Databricks workspace you created above, click Catalog.

  2. Click Add data > Create a catalog.

  3. Fill in the Create a new catalog form:

    • Catalog name: enter panther.

    • Type: select Standard.

    • Select external location: choose the external location you created in Step 14.

  1. Click Create.

  2. On the Catalog created! modal, click View catalog.

  3. Click Permissions.

  4. Click Grant.

  5. In the Grant on panther modal, fill in the form:

    • Principals: type and select panther-load.

    • Select the following permissions:

      • USE CATALOG

      • USE SCHEMA

      • BROWSE

      • SELECT

      • MODIFY

      • CREATE SCHEMA

      • CREATE TABLE

  6. Click Confirm.

  7. Click Grant.

  8. In the Grant on panther modal, fill in the form:

    • Principals: type and select panther-query.

    • Select the following permissions:

      • USE CATALOG

      • USE SCHEMA

      • BROWSE

      • SELECT

  9. Click Confirm.

Step 16: Create a panther-load SQL warehouse

For additional SQL warehouse creation support, see the Databricks Create a SQL warehouse documentation.

  1. In the Databricks workspace you created above, click Compute.

  2. Click SQL warehouses.

  3. Click Create SQL warehouse.

  4. Fill out the New SQL warehouse form:

    • Name: enter panther-load.

    • Cluster size: select 2X-Small.

    • Scaling: set the Max value to 40 (the maximum allowed).

    • Type: select Pro.

  1. Click Create.

  2. In the Manage permissions modal, add the panther-load user, then select Can use permissions.

  3. Click Add.

  4. Click the X in the upper-right corner to close the Manage permissions modal.

  5. On the panther-load warehouse details page, copy the ID (next to the name) and add it as the databricks_load_warehouse_id value in your configuration table.

Step 17: Create a panther-optimize SQL warehouse

This warehouse runs nightly table maintenance jobs.

For additional SQL warehouse creation support, see the Databricks Create a SQL warehouse documentation.

  1. In the Databricks workspace you created above, click Compute.

  2. Click SQL warehouses.

  3. Click Create SQL warehouse.

  4. Fill out the New SQL warehouse form:

    • Name: enter panther-optimize.

    • Cluster size: select 2X-Small.

    • Scaling: set the Max value to 40 (the maximum allowed).

    • Type: select Serverless.

  5. Click Create.

  6. In the Manage permissions modal, add the panther-load user, then select Can use permissions.

  7. Click Add.

  8. Click the X in the upper-right corner to close the Manage permissions modal.

  9. On the panther-optimize warehouse details page, copy the ID (next to the name) and add it as the databricks_optimize_warehouse_id value in your configuration table.

Step 18: Create a panther-query SQL warehouse

For additional SQL warehouse creation support, see the Databricks Create a SQL warehouse documentation.

  1. In the Databricks workspace you created above, click Compute.

  2. Click SQL warehouses.

  3. Click Create SQL warehouse.

  4. Fill out the New SQL warehouse form:

    • Name: enter panther-query.

    • Cluster size: select Medium.

    • Scaling: set the Max value to 40 (the maximum allowed).

    • Type: select Serverless or Pro.

This SQL warehouse can be Serverless or Pro, but Serverless is recommended. Pro warehouses start up slowly.

  1. Click Create.

  2. In the Manage permissions modal, add the panther-query user, then select Can use permissions.

  3. Click Add.

  4. Click the X in the upper-right corner to close the Manage permissions modal.

  5. On the panther-query warehouse details page, copy the ID (next to the name) and add it as the databricks_query_warehouse_id value in your configuration table.

Step 19: Create a panther-scheduled-query SQL warehouse

For additional SQL warehouse creation support, see the Databricks Create a SQL warehouse documentation.

  1. In the Databricks workspace you created above, click Compute.

  2. Click SQL warehouses.

  3. Click Create SQL warehouse.

  4. Fill out the New SQL warehouse form:

    • Name: enter panther-scheduled-query.

    • Cluster size: select 3X-Large.

    • Scaling: set the Max value to 40 (the maximum allowed).

    • Type: select Serverless or Pro.

This SQL warehouse can be Serverless or Pro, but Serverless is recommended. Pro warehouses start up slowly.

  1. Click Create.

  2. In the Manage permissions modal, add the panther-query user, then select Can use permissions.

  3. Click Add.

  4. Click the X in the upper-right corner to close the Manage permissions modal.

  5. On the panther-scheduled-query warehouse details page, copy the ID (next to the name) and add it as the databricks_scheduled_query_warehouse_id value in your configuration table.

Step 20: Send configuration values to Panther

Step 21: Return to the post-setup recommendations

Last updated

Was this helpful?