Limitations for collecting statistics in Watson Query
Data sources differ in their support of local statistics collection and support different column statistics. They also support different methods for collecting statistics.
- remote-catalog
- This type of statistics collection is only supported for virtualized tables in remote data sources that support a local method of statistics collection. Statistics that are stored in the catalog tables at the remote data source are retrieved and then stored in the Watson Query statistics catalog. It is critical to ensure that accurate statistics are available in the remote data source. The remote-catalog type of statistics collection is not supported for grouped tables.
The following table identifies the data sources that support local statistics collection and shows which column statistics are available.
Data source | Basic statistics available | Supports remote-catalog
|
Supports remote-query
|
---|---|---|---|
Amazon Redshift | CARD, COLCARD, NUMNULLS | ✓ | ✓ |
Amazon S3 | See Collecting statistics for data in object storage. | No | ✓ |
Ceph® | See Collecting statistics for data in object storage. | No | ✓ |
Cloudera Impala | CARD, COLCARD, NUMNULLS. Recent versions of Impala (1.4 and later) do not collect NUMNULLS. | ✓ | ✓ |
Databases for PostgreSQL | CARD, COLCARD, NUMNULLS | ✓ | ✓ |
Db2® | CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS | ✓ | ✓ |
Db2 Big SQL | CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS | ✓ | ✓ |
Db2 for z/OS | CARD, COLCARD, HIGH2KEY, LOW2KEY | ✓ | ✓ |
Db2 on Cloud | CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS | ✓ | ✓ |
Generic S3 | See Collecting statistics for data in object storage. | No | ✓ |
Greenplum | CARD, COLCARD, NUMNULLS | ✓ | ✓ |
Hive | CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS | ✓ (Not recommended)
|
✓ |
IBM® Cloud Object Storage | See Collecting statistics for data in object storage. | No | ✓ |
Informix | CARD, COLCARD | ✓ | ✓ |
MariaDB | CARD, HIGH2KEY, LOW2KEY, NUMNULLS | ✓ | ✓ |
Microsoft SQL Server | CARD, COLCARD | ✓ | ✓ |
MySQL | CARD, COLCARD | ✓ | ✓ |
Netezza Performance Server | CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS | ✓ | ✓ |
Oracle | CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS | ✓ | ✓ |
SAP ASE | CARD | ✓ | ✓ |
SAP HANA | CARD, COLCARD | ✓ | ✓ |
Snowflake | CARD | ✓ | ✓ |
Teradata | CARD, COLCARD, NUMNULLS | ✓ | ✓ |
Amazon Redshift
To collect local table statistics, use the ANALYZE command, as described in the Amazon Redshift product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
Amazon Redshift does not support the HIGH2KEY or LOW2KEY statistic. Also, the ANALYZE command does not collect statistics for CHAR and NCHAR column data if the length of the column data exceeds 1020 bytes. If a column contains only NULL values, column statistics are not collected.
If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Cloudera Impala
To collect local table statistics, use the COMPUTE STATS statement, as described in the Cloudera Impala product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
Cloudera Impala does not support the HIGH2KEY or LOW2KEY statistic, and more recent versions of Impala do not support the NUMNULLS statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Databases for PostgreSQL
To collect local table statistics, use the ANALYZE command, as described in the PostgreSQL product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
PostgreSQL does not support the HIGH2KEY or LOW2KEY statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Db2 and Db2 on Cloud
To collect local table statistics for these data sources, use the RUNSTATS command. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With this type, the HIGH2KEY and LOW2KEY values (representing the second highest and second lowest values in the table) are copied from the Db2 catalog. With the remote-query type, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.
Db2 Big SQL
To collect local table statistics, use the ANALYZE command. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With both types, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.
Db2 for z/OS
To collect local table statistics, use the RUNSTATS command. Statistics are not collected for LOB types. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With this method, the HIGH2KEY and LOW2KEY values (representing the second highest and second lowest values in the table) are copied from the Db2 catalog. With the remote-query collection type, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.
Db2 for z/OS® does not support the NUMNULLS statistic. If your queries would benefit from this statistic, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Greenplum
To collect local table statistics, use the ANALYZE command, as described in the Greenplum product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
Greenplum does not support the HIGH2KEY or LOW2KEY statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Hive
To collect local table statistics, use the ANALYZE command, as described in the Hive product documentation. The remote-catalog statistics collection type in Hive is not recommended because of known issues with statistics collection in Hive. Use the remote-query statistics collection type instead. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Informix
To collect local table statistics, use the UPDATE STATISTICS statement, as described in the Informix® product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
Informix does not support the HIGH2KEY, LOW2KEY, or NUMNULLS statistic. It collects only CARD (table cardinality) and COLCARD (column cardinality). If your queries would benefit from these statistics, use the remote-query statistics collection type.. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
MariaDB
To collect local table statistics, use the ANALYZE command, as described in the MariaDB product documentation. Statistics for TEXT type or BLOB type columns are not collected. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. You must hold administrative privileges to access catalog statistics, which is required when you use the remote-catalog statistics collection type.
MariaDB does not support the COLCARD statistic. If your queries would benefit from this statistic, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Microsoft SQL Server
To collect local table statistics, use the UPDATE STATISTICS statement, as described in the Microsoft SQL Server product documentation. Statistics for XML type columns are not collected. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
Microsoft SQL Server collects only CARD (table cardinality) and COLCARD (column cardinality). It does not support the HIGH2KEY, LOW2KEY, or NUMNULLS statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
MySQL
To collect local table statistics, use the ANALYZE TABLE statement, as described in the MySQL product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
MySQL collects only CARD (table cardinality) and COLCARD (column cardinality). It does not support the HIGH2KEY, LOW2KEY, or NUMNULLS statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Netezza Performance Server
To collect local table statistics, use the GENERATE STATISTICS command, as described in the Netezza® product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With both types, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.
Oracle
To collect local table statistics, use the ANALYZE command, as described in the Oracle product documentation. Alternatively, you can collect statistics by using the DBMS_STATS.GATHER_TABLE_STATS procedure. You can also configure automatic statistics collection in the Oracle remote data source. Statistics are not collected for LOB types. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With both types, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.
SAP ASE
To collect local table statistics, use the UPDATE STATISTICS statement, as described in the SAP ASE product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
SAP ASE collects only CARD (table cardinality). It does not support the COLCARD, HIGH2KEY, LOW2KEY, or NUMNULLS statistic. Use the remote-query statistics collection type for key columns that are used in predicates. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
SAP HANA
To collect local table statistics, use the CREATE STATISTICS ON statement, as described in the SAP HANA product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
SAP HANA collects only CARD (table cardinality) and COLCARD (column cardinality). It does not support the HIGH2KEY, LOW2KEY, or NUMNULLS statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Snowflake
Snowflake does not support an explicit method for collecting statistics. However, the ROW_COUNT column in the INFORMATION_SCHEMA.TABLES catalog table stores the table cardinality (CARD) value as part of Watson Query statistics collection. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
Snowflake collects only CARD (table cardinality). It does not support the COLCARD, HIGH2KEY, LOW2KEY, or NUMNULLS statistic. Use the remote-query statistics collection type for key columns that are used in predicates. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
Teradata
To collect local table statistics, use the COLLECT STATISTICS command, as described in the Teradata product documentation. Specify the THRESHOLD option to avoid redundant statistics collection. With this option, if the amount of data that changed since the last statistics collection is under a specified threshold, or the statistics are newer than a specified age, Teradata does not recollect statistics. Statistics are not collected for the PERIOD type (except for the BEGIN and END expressions), the XML, BLOB, or CLOB type, or any user-defined type (other than a geospatial type). Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.
Teradata does not support the HIGH2KEY or LOW2KEY statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
IBM Cloud Object Storage, Amazon S3, Ceph, or Generic S3 data sources
To collect statistics on virtualized tables in IBM Cloud Object Storage, Amazon S3, Ceph, or Generic S3 data sources, use the ANALYZE command.
If the table has a large amount of data, specify the TABLESAMPLE option to collect statistics on a subset of the data so that the analyze operation can complete sooner.