Google BigQuery connector (DataStage)
Use the Google BigQuery connector in DataStage® to write data to a BigQuery data warehouse
Prerequisite
Create the connection. For instructions, see Connecting to a data source in DataStage and the Google BigQuery connection.
Configure the Google BigQuery connector for write mode
Specify how the Google BigQuery connector writes data to a BigQuery table. Select the Write mode on the target connector under General target properties. The following table shows the ways that you can configure the connector for the Write mode.
Method | Description |
---|---|
Insert | The Insert mode inserts data into the table. The data will be instantly available in the table. After that, you can immediately perform any other operations. |
Streaming insert | The Streaming insert mode inserts data into the specified table by
using the BigQuery streaming API. Select this method if you want to query data in the table for
real-time analysis. It can take up to 90 minutes for the data in the table to be available for
subsequent operations like Update, Replace, or Truncate. Meanwhile, if you do any of these
operations, you might receive error messages because these operations can cause inconsistencies in
table data.
|
Merge | The Merge mode first updates the table based on the key columns that you specify, and then inserts the data into the table. |
Update | The Update mode updates only the table data based on the key columns. |
Delete | The Delete mode deletes only the table data based on the key columns. |
Delete then Insert | The Delete then Insert mode first deletes the table data based on the key columns and the inserts the data into the table. |
Update statement | The Update statement mode writes rows to the table when you provide an
SQL statement that contains the temporary staging table TEMP_EXTERNAL_TABLE . The
data from the input link is written to the temporary table. If TEMP_EXTERNAL_TABLE
is missing, then the job fails to operate.Example:
|
Call procedure statement | The Call procedure statement mode runs your own procedure statement syntax on the database. You can provide the procedure statement syntax to the text box. |
Configure the Google BigQuery connector table action
Specify the action to take on the target table to handle the new data set. The following table shows the ways that you can configure the Table action.
Method | Description |
---|---|
Append | Append creates the table if the table doesn’t exist. If the table already exists, no action is performed. |
Replace | Replace drops the existing table and creates a new table with the existing job design schema. An existing table is dropped only when there is no existing streaming buffer attached to the table. |
Truncate | Truncate deletes all the records from a table, but it does not remove the table itself. |
Partitioned reads
You can configure the Google BigQuery to run on multiple processing nodes and read data from the data source. Each of the processing nodes for the stage retrieves a set (partition) of records from the data source. The partitions are then combined to produce the entire result set for the output link of the stage.
To enable partitioned reads, in the Source stage under General source properties, for Read method, select Select statement. Then select Enable partitioned reads.
When the connector is configured for partitioned reads, it runs the statement that is specified in the Select statement property on each processing node. You can use special placeholders in the statements to ensure that each of the processing nodes retrieves a distinct partition of records from the data source.
You can use the following placeholders in the statements to specify distinct partitions of records to retrieve in individual processing nodes:
[[node-number]]
: The node number is replaced at run time with the index of the current processing node. The indexes are zero-based. The placeholder is replaced with the value 0 on the first processing node, value 1 on the second processing node, value 2 on the third processing node, and so forth.[[node-count]]
: The total number of processing nodes for the stage. By default, this number is the number of nodes in the parallel configuration file. The location of the parallel configuration file is specified in the APT_CONFIG_FILE environment variable.
Example SQL statements:
select * from testds.testPartitionRead where MOD(C1, [[node-count]]) = [[node-number]]
select * from testds.testPartitionRead where MOD(ABS(FARM_FINGERPRINT(C2)), [[node-count]]) = [[node-number]]
TIME data types
The Google BigQuery connector supports these TIME data types:
- TIMESTAMP
- Extended options:
- Microseconds
- Timezone
- Microseconds + Timezone
For information about how Google uses the TIMESTAMP data type, see the Timestamp type.
- TIME
- Extended option:
- Microseconds
For information about how Google uses the TIME data type, see Time type.
Lookup stage support (table)
In Stage tab under General you can find Lookup type, where you can choose if you want to use Sparse or Normal lookup method.
Lookup method | Description |
---|---|
Sparse Lookup |
|
Normal Lookup |
|
For more information about Lookup Stage, see: https://www.ibm.com/docs/en/iis/11.7?topic=data-lookup-stage.