In Watson Query, you can virtualize and use data that is stored as
files on object storage.
You can virtualize data in files in the following object
storages data sources:
- IBM® Cloud Object Storage
- Amazon S3
- Ceph®
- Generic S3
Segment or combine data from one or more files to create a virtual table.
Before you begin
To access data in cloud object storage, you must create a connection to the data source where the
files are located. For more information about object storage data sources, see Data sources in object storage in Watson Query.
About this task
Watch the following video for an overview of
virtualized files in Cloud Object Storage in Watson Query.
This video provides a visual method
as an alternative to following the written steps in this documentation.
Procedure
- On the navigation menu, click .
The service menu opens to the Data
sources page by default.
- On the service menu, click
and then click the Files tab.
The list of available data sources appears. You can narrow down the displayed assets by
using the available filters.
If you specified a bucket name when you set up the data source
connection, click to expand the object storage connection details to see the
Service type and Bucket information. If you didn't
specify a bucket name when you set up the data source connection, you can use the
Bucket input field to find a specific bucket in the endpoint.
If the
list of data sources does not appear, click Refresh.
- Select the object storage endpoint where you want to browse files and file
paths.
For Cloud Object Storage, the endpoint
is the URL for the object storage.
A list of file paths or buckets in the
endpoint appears. You can navigate through the file path structure or you can click
to see
details of the contents of the first file in the file path.
You cannot add the bucket to your
cart. To add a file path to your cart, you must select the file path to preview the files in the
path, and then click Add to cart. You cannot select a file at the bucket
level, you must add the file to a file path in the bucket.
- Select the file or file path that you want to virtualize and click Add to
cart.
Important:
- You cannot virtualize a single file
in a file path that contains multiple files. The URL resolves to the parent path where the file is
located and the entire file path is virtualized. If you want to virtualize a single file, you can
move it to a separate file path. The separate file path must not be a sub-file path of any other
file path that is, or will be, virtualized.
- Files that you want to virtualize must be within a file path and not at the same level as the
bucket. For example, you cannot virtualize a file
s3a://mynewbigsqlbucket/mydata.csv; you must put
mydata.csv into a file path and virtualize
s3a://mynewbigsqlbucket/fi1epath1/mydata.csv because the virtualization process
cannot create an external table by using only a bucket name without a path.
-
Click View cart to preview your file data selections as a virtual file.
From this window, you can edit schema names, preview files that participate in a merged table,
or remove a selection from your cart.
If you have IBM Knowledge
Catalog installed, you can publish your virtual table
to a catalog. For more information, see Publishing virtual data to the catalog in Watson Query.
- Recommended: Update the type of partitioned columns from STRING to something more
appropriate. Manually inspect and specify proper types for partitioning columns for best
performance.
- Optional: Click and
select Edit columns.
You can edit any column name that is not tagged as a
Partitioning column and change column types by using the drop-down menu. When
you are happy with your edits, click Apply. Updated column names are shown
after you virtualize the table.Note: When you virtualize JSON files with Japanese data on
IBM Cloud Object Storage and the Japanese column names are not
displayed correctly, you can use the
allownonalphanumeric
option to view the
virtualized Japanese column headers properly. This option is disabled by default and you must enable
it. For more information, see
Japanese column names are not displayed correctly in virtualized data.
- Select the appropriate option to assign the virtual table to be created from file data:
Assign to |
When to use this option |
Project |
Select Project if you created the virtual table to use in a specific
project. Then, choose the appropriate project. The table also appears in Virtualized
data. |
Virtualized data |
Select Virtualized data if the table was not created to use in a
specific project. This setting is the default if no projects exist. |
- Select Publish to catalog if you also
want to publish to a selected catalog.
A list of available catalogs is shown in the
drop-down menu. Each catalog is tagged as
Governed or
Not
governed.
Note: You must have at least one catalog in
IBM Knowledge
Catalog.
You must have permission to publish to a
catalog. An administrator can enable whether all virtual objects are published to a selected
governed catalog, which prevents a user from publishing to a specified
catalog.
- Specify a schema in the Schema
field.
You can also create a schema by following these
steps.
- If you have the Watson Query
Engineer or User role, leave the
Schema field as default to create a schema with your user ID.
- If you have the Watson Query
Manager role, leave the
Schema field as default to create a schema with your user ID or enter the new
schema name in the Schema field.
-
Click Virtualize to complete the process.
When the status window appears, you can select to view your virtualized data
or virtualize more data.
What to do next
- View the table structure and metadata.
- Manage access to the table.
- Edit column names and the types of your object storage assets so that you can prepare accurate
data for virtualization.
- Collect statistics for your virtualized table to optimize query performance. For more
information, see Collecting statistics in Watson Query.
-
Optionally, on the Virtualized data page, publish
your virtual object to the catalog. For more information, see Publishing virtual data to the catalog in Watson Query.