IBM VEST Workshops
45 min
Last updated 10/14/2023

Federated Queries

Heads Up! Quiz material will be flagged like this!

Federated Queries

Unlike traditional database systems, Presto doesn’t have its own native database storage. Instead, Presto supports separation of compute and storage, with dozens of connectors that let Presto access data where it lives – which could be in relational databases, NoSQL databases, data warehouses, data lakes, data lakehouses, and more.

Although Presto supports a wide range of connectors, watsonx.data officially only supports a subset of them. This is because IBM wants to ensure quality, performance, and security of connectors before adding support (which may require updating connector source code to do this). More connectors will be added over time.

The most current list of connectors and the types of SQL statements supported can be found here. The list of supported connectors will grow over time.

In this section you will combine data from watsonx.data’s object storage with data in Db2 and PostgreSQL databases. To avoid you having to provision these databases yourself, they’ve been installed in the same VM as watsonx.data and are pre-populated with data.

Note: : The password of the embedded Db2 server expired in October 2023. Before proceeding with the instructions in this section, please open a terminal command window and run the following commands to fix the password issue.

ssh -p <port#> watsonx@<hostname>
bash

When prompted for the password, enter watsonx.data (you will not see the password as you type it).

sudo su -
bash
docker exec db2server chage -I -1 -m 0 -M 99999 -E -1 db2inst1
bash

With the password issue fixed, please continue with the steps below.

  1. Select the Infrastructure manager icon from the left-side menu.

  2. Click the Add component dropdown menu at the right-side of the screen. Select Add Database.

Quiz material: pay attention to the Add components menu

  1. In the Add database pop-up window, select/enter the following pieces of information in the 1. Database definition section:
  • Database type: IBM Db2
  • Database name: GOSALES
  • Display name: Db2DW
  • Hostname: 192.168.252.2
  • Port: 50000
  • Username: db2inst1
  • Password: db2inst1
  • SSL connection: Uncheck

When a new bucket or database is added to watsonx.data, a new catalog is created as well.

  1. Scroll down to section 2. Associated catalog definition. Enter db2catalog for the Catalog name. Then, click Add.

The Db2DW database and db2catalog catalog have been added to watsonx.data and are now reflected in the topology view of the infrastructure components.

The db2catalog catalog is automatically associated with the Db2DW database, but to be able to query data from this database, the db2catalog catalog must also be associated with the presto-01 engine.

  1. Hover your mouse pointer over the db2catalog catalog tile and the Manage associations icon will appear. Click the Manage associations icon.
  1. In the Manage associations pop-up window, select the checkbox for the presto-01 engine and then click Save and restart engine.

A line now connects presto-01 with db2catalog, indicating that they’re associated.

Next you will add the PostgreSQL database. The password for the PostgreSQL admin account is specific to your environment and the following steps will extract it for you.

  1. Open a terminal command window to the watsonx.data server as the root user.

  2. The administrative user for the PostgreSQL database is admin. Run the following command to extract and display the password. Copy the value shown to a location you can refer to later.

docker exec ibm-lh-postgres printenv | grep POSTGRES_PASSWORD | sed 's/.*=//'
bash
  1. Repeat Steps 2-6 from above to add the PostgreSQL database to watsonx.data. Use the following information:
  • Database type: PostgreSQL(it can be found under Other Databases)
  • Database name: gosales
  • Display name: PostgreSQLDB
  • Hostname: 192.168.252.2
  • Port: 5432
  • Username: admin
  • Password: The password generated in the previous step
  • Catalog name: pgcatalog

With both the Db2 and PostgreSQL databases added, the topology should look like the image below.

  1. Select the Data manager icon from the left-side menu.

  2. Expand both the db2catalog and pgcatalog catalogs.

NOTE: If you get a 'db2inst1 password expired' error when trying to toggle the 'db2catalog' and the schema and table information do not show up, the following steps will help you remedy it:

  1. Go back to the Infrastructure Manager page and disconnect Presto from db2catalog
  2. Hover over the db2catalog tile and click the trashcan icon to remove it.
  3. SSH into your watsonx instance on the command line, and change to root user. (sudo su -)
  4. Run the following Docker command to extend the password expiry: docker exec db2server chage -I -1 -m 0 -M 99999 -E -1 db2inst1
  5. Go back and follow steps 2-6 to re-add the db2 database.

Schema and table information are shown for both of these databases. Both include an identical gosalesdw schema with tables containing sales data for the fictional Great Outdoors company. Remember that there is also a copy of this same data in object storage, managed by the hive_data catalog.

If you do not see any tables associated with the gosalesdw schema in either of the catalogs, you can refresh the schema.

Note: If for some reason the list of tables does not refresh, or if you see a message appear in the upper-right corner saying “An error occurred while loading the tables in gosalesdw”, similar to the image below, you can still proceed. The table information is in the catalog and queries that reference these tables will still work. Please proceed with the rest of the lab.

You now have copies of the Great Outdoors company sales tables in object storage (hive_data), Db2 (db2catalog), and PostgreSQL (pgcatalog). This isn’t something you’d have in a real-world scenario (after all, the benefit of being able to federate access to multiple data sources is to avoid data duplication). However, it’s being done here for the purposes of highlighting Presto’s federation capabilities.

You will now see how you can run a federated query that combines data from all three of these data sources.

  1. Select the Query workspace icon from the left-side menu.

  2. Copy and paste the query below into the SQL worksheet. Click Run on presto-01.

This sample query could be used by the fictional business to determine which purchasing method is associated with the largest orders. The query accesses five tables, one of which is in PostgreSQL, two are in Db2, and two are in watsonx.data’s object storage.

select pll.product_line_en as product, md.order_method_en as order_method, sum(sf.quantity) as total from pgcatalog.gosalesdw.sls_order_method_dim as md, db2catalog.GOSALESDW.SLS_PRODUCT_DIM as pd, db2catalog.GOSALESDW.SLS_PRODUCT_LINE_LOOKUP as pll, hive_data.gosalesdw.sls_product_brand_lookup as pbl, hive_data.gosalesdw.sls_sales_fact as sf where pd.product_key = sf.product_key and md.order_method_key = sf.order_method_key and pll.product_line_code = pd.product_line_code and pbl.product_brand_code = pd.product_brand_code group by pll.product_line_en, md.order_method_en order by product, order_method;
bash

Note: If the SQL statement pastes into the worksheet as a single line, you can nicely format it by clicking the Format worksheet icon.

The Result set for the query is found at the bottom of the screen.

  1. Click the Explain button in the menu bar of the worksheet.

The visual explain output for this query looks a lot more interesting than the one shown earlier. If you scroll through the visual explain output, you’ll see five ScanProject leaf nodes in the tree. These correspond to the five tables being read.

  1. Click the X in the upper-right corner of the Explain window to close it.

Optional step : Here are two other queries you can try running as well, which combine data from the same three data sources.

  1. The following query displays all Canadian and Mexican employees, along with their region and country. This is the kind of query that a reporting tool might generate, based on input from the user.
select distinct branch_region_dim.region_en region, branch_region_dim.country_en country, emp_employee_dim.employee_name employee from hive_data.gosalesdw.go_region_dim branch_region_dim, pgcatalog.gosalesdw.emp_employee_dim emp_employee_dim, db2catalog.GOSALESDW.GO_BRANCH_DIM go_branch_dim where branch_region_dim.country_en in ('Canada', 'Mexico') and branch_region_dim.country_code = go_branch_dim.country_code and emp_employee_dim.branch_code = go_branch_dim.branch_code order by region, country, employee;
bash
  1. In many businesses, departments (or organizations, as they’re called in this dataset) are hierarchical, in that a department falls under another department, which in turn falls under another one. This query displays the two parent departments for a given set of departments. As in the previous query, this is the kind of query that a reporting tool might generate.
select gosalesdw.go_org_dim.organization_key, go_org_dim_1.organization_parent as org_level1_code, go_org_name_lookup_1.organization_name_en as org_level1_name, gosalesdw.go_org_dim.organization_parent as org_level2_code, go_org_name_lookup_2.organization_name_en as org_level2_name, gosalesdw.go_org_dim.organization_code as org_code, gosalesdw.go_org_name_lookup.organization_name_en as org_name from pgcatalog.gosalesdw.go_org_name_lookup go_org_name_lookup_2 inner join hive_data.gosalesdw.go_org_dim inner join pgcatalog.gosalesdw.go_org_name_lookup on hive_data.gosalesdw.go_org_dim.organization_code = pgcatalog.gosalesdw.go_org_name_lookup.organization_code on go_org_name_lookup_2.organization_code = hive_data.gosalesdw.go_org_dim.organization_parent inner join pgcatalog.gosalesdw.go_org_name_lookup go_org_name_lookup_1 inner join hive_data.gosalesdw.go_org_dim go_org_dim_1 on go_org_name_lookup_1.organization_code = go_org_dim_1.organization_parent on hive_data.gosalesdw.go_org_dim.organization_parent = go_org_dim_1.organization_code where (hive_data.gosalesdw.go_org_dim.organization_code between '1700' and '5730') order by org_code;
bash

Note: Try running the two queries above from within the Presto CLI as well. You should get the same results in the Presto CLI as you did in the Query workspace.

Watsonx.data L3 Course Quiz

Congratulations! You have completed the necessary labs needed to complete the watsonx.data L3 course quiz. The quiz can be found here. Note: You do not need to watch the videos as the labs have covered all course content except for the quiz. You can click through the lessons and take the quiz.

Congratulations, you've reached the end of lab 106.

Click, lab 107 to start next lab.