For this entry I assume you already know how to configure SOLR’s Data Import Handler as that is how we’ll configure SOLR to use BigQuery: https://wiki.apache.org/solr/DataImportHandler
Google’s Service Account File
Download the service account file as described here: https://cloud.google.com/docs/authentication/getting-started I used the JSON version of the file. For the sake of this entry I’ll call this file: service_account.json
Download the Simba JDBC driver as described here: https://cloud.google.com/bigquery/partners/simba-drivers/ I used version 188.8.131.524 of their JDBC 4.2 drivers. The zip file I downloaded contains these JAR files:
Copy them to SOLR’s server/lib/ext directory and restart solr.
SOLR Configuration Files
Create a schema.xml file that contains the BigQuery fields that you’ll be importing.
The solr-data-config.xml will look something like the following (adjust your query appropriately).
<dataConfig> <dataSource autoCommit="true" driver="com.simba.googlebigquery.jdbc42.Driver" name="bq" type="JdbcDataSource" url="jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=<YOUR PROJECT ID>;OAuthType=0;OAuthServiceAcctEmail=<YOUR PROJECT'S EMAIL ADDRESS>;OAuthPvtKeyPath=/path/to/service_account.json;LogLevel=6;LogPath=/tmp/bq-log" /> <document name="bq-doc"> <entity dataSource="bq" name="what-ever" onError="continue" query="select id from `your-dataset-name.your-table`"> </entity> </document> </dataConfig>
It is picky about long queries. I had maybe a 60 line query with many spaces before each line so that it lines up nicely. The import would not work. There was no error about long query or too many spaces. It simply would not work. I just happened upon the solution of removing extra spaces. This https://cloud.google.com/bigquery/quotas says that the max unresolved query length is 256 KB. My query even with spaces was not that long so I have to conclude there’s something in the Simba driver.
One other thing to note is that the SOLR and Simba logs will show another exception but it will not stop the indexing process. You’ll see this when executing the data import.
Jan 16 19:54:19.731 ERROR 62 com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException: [Simba][JDBC](10040) Cannot use commit while Connection is in auto-commit mode. java.sql.SQLException: [Simba][JDBC](10040) Cannot use commit while Connection is in auto-commit mode. at com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException(Unknown Source) at com.simba.googlebigquery.jdbc.common.SConnection.commit(Unknown Source) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:571) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:560) ...
In JdbcDataSource.java:571 there is a comment “//SOLR-2045“. Evidently, because of DB2 the SOLR developers added a commit so that the connections are released. The problem with this is that you have to set autoCommit to “true”. Hence, the above error. Luckily the commit is in a try/catch block where the catch is ignored and the SOLR code just continues with closing the connection.
The autoCommit=”true” is needed. The Simba JDBC drivers will give you issues if you don’t include it and set it to “true”:
java.sql.SQLFeatureNotSupportedException: [Simba][JDBC](10220) Driver does not support this optional feature. at com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException(Unknown Source) at com.simba.googlebigquery.jdbc.common.SConnection.setAutoCommit(Unknown Source) at org.apache.solr.handler.dataimport.JdbcDataSource$1.initializeConnection(JdbcDataSource.java:223) ...
It doesn’t support the feature but you have to have it.
You can look here to get details about the connection URL starting here: https://www.simba.com/products/BigQuery/doc/JDBC_InstallGuide/content/jdbc/bq/authenticating/intro.htm I’ll provide some info here.
To turn logging off, set LogLevel in the connection URL to 0. The LogPath points to a directory under which a couple of log files will be written: BigQueryJDBC_driver.log and BigQuery_connection_0.log. Note that nothing will be written when the LogLevel is 0. The directory won’t even be created. The example above sets the level to 6, the highest level. I figure that’s a good setting for getting started so you can see everything that’s logged.