Indexing SOLR Using Data from Google’s BigQuery

For this entry I assume you already know how to configure SOLR’s Data Import Handler as that is how we’ll configure SOLR to use BigQuery: https://wiki.apache.org/solr/DataImportHandler

Steps

Google’s Service Account File

Download the service account file as described here: https://cloud.google.com/docs/authentication/getting-started  I used the JSON version of the file. For the sake of this entry I’ll call this file: service_account.json

JDBC Driver

Download the Simba JDBC driver as described here: https://cloud.google.com/bigquery/partners/simba-drivers/ I used version 1.1.4.1004 of their JDBC 4.2 drivers. The zip file I downloaded contains these JAR files:

GoogleBigQueryJDBC42.jar
google-api-client-1.22.0.jar
google-api-services-bigquery-v2-rev355-1.22.0.jar
google-http-client-1.22.0.jar
google-http-client-jackson2-1.22.0.jar
google-oauth-client-1.22.0.jar
jackson-core-2.1.3.jar

Copy them to SOLR’s server/lib/ext directory and restart solr.

SOLR Configuration Files

Create a schema.xml file that contains the BigQuery fields that you’ll be importing.

The solr-data-config.xml will look something like the following (adjust your query appropriately).

<dataConfig>
   <dataSource 
     autoCommit="true"
     driver="com.simba.googlebigquery.jdbc42.Driver"
     name="bq"
     type="JdbcDataSource"
     url="jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=<YOUR PROJECT ID>;OAuthType=0;OAuthServiceAcctEmail=<YOUR PROJECT'S EMAIL ADDRESS>;OAuthPvtKeyPath=/path/to/service_account.json;LogLevel=6;LogPath=/tmp/bq-log" />
   <document name="bq-doc">
     <entity
       dataSource="bq"
       name="what-ever"
       onError="continue"
       query="select id from `your-dataset-name.your-table`">
     </entity>
   </document>
</dataConfig>

Notes

Long Queries

It is picky about long queries. I had maybe a 60 line query with many spaces before each line so that it lines up nicely. The import would not work. There was no error about long query or too many spaces. It simply would not work. I just happened upon the solution of removing extra spaces. This https://cloud.google.com/bigquery/quotas says that the max unresolved query length is 256 KB. My query even with spaces was not that long so I have to conclude there’s something in the Simba driver.

Commit Exception

One other thing to note is that the SOLR and Simba logs will show another exception but it will not stop the indexing process. You’ll see this when executing the data import.

Jan 16 19:54:19.731 ERROR 62 com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException: [Simba][JDBC](10040) Cannot use commit while Connection is in auto-commit mode.
java.sql.SQLException: [Simba][JDBC](10040) Cannot use commit while Connection is in auto-commit mode.
   at com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException(Unknown Source)
   at com.simba.googlebigquery.jdbc.common.SConnection.commit(Unknown Source)
   at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:571)
   at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:560)
   ...

In JdbcDataSource.java:571 there is a comment “//SOLR-2045“. Evidently, because of DB2 the SOLR developers added a commit so that the connections are released. The problem with this is that you have to set autoCommit to “true”. Hence, the above error. Luckily the commit is in a try/catch block where the catch is ignored and the SOLR code just continues with closing the connection.

autoCommit

The autoCommit=”true” is needed. The Simba JDBC drivers will give you issues if you don’t include it and set it to “true”:

java.sql.SQLFeatureNotSupportedException: [Simba][JDBC](10220) Driver does not support this optional feature.
   at com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException(Unknown Source)
   at com.simba.googlebigquery.jdbc.common.SConnection.setAutoCommit(Unknown Source)
   at org.apache.solr.handler.dataimport.JdbcDataSource$1.initializeConnection(JdbcDataSource.java:223)
   ...

It doesn’t support the feature but you have to have it.

Connection URL

You can look here to get details about the connection URL starting here: https://www.simba.com/products/BigQuery/doc/JDBC_InstallGuide/content/jdbc/bq/authenticating/intro.htm I’ll provide some info here.

Logging

To turn logging off, set LogLevel in the connection URL to 0. The LogPath points to a directory under which a couple of log files will be written: BigQueryJDBC_driver.log and BigQuery_connection_0.log. Note that nothing will be written when the LogLevel is 0. The directory won’t even be created. The example above sets the level to 6, the highest level. I figure that’s a good setting for getting started so you can see everything that’s logged.

 

Adding SOLR to WordPress

There are a couple of fully developed plugins that bring the magic of SOLR into WordPress. However, if you want to just get something up and running without having to install a plugin this article may be for you. Here I describe how to bring a better search experience to your users by editing your theme.

First thing to do is to create a function that will make a call to SOLR and parse the results. For example:

    function do_solr_search($query)
    {
      if (is_null($query) || strlen(trim($query)) === 0)
      {
        error_log('query is empty', 4);
        return array();
      }
      $query = urlencode($query);
    
      $url = "http://solr_host:port/solr/my-core/select/?&wt=json&json.nl=map&q={$query}";
      $results = file_get_contents($url);
      $json_data = json_decode($results, 1);
      if (count($json_data) === 0)
      {
        error_log("no results for: $url", 4);
        $json_data = array();
      }
    
      $docs = array();
      if (array_key_exists('response', $json_data))
      {
        $response = $json_data['response'];
        if (array_key_exists('docs', $response))
        {
          $docs = $response['docs'];
        }
      }
    
      return $docs;
    }

Then modify the search.php script to use this new function:

    <ul>
    <?php
      $docs = do_solr_search($query);
      if (count($docs) > 0)
      {
        foreach($docs as $doc)
        {
          // passes the doc along to the template
          set_query_var( 'doc', $doc );
          get_template_part( 'template-parts/content', 'search' );
        }
      }
      else 
      {
        get_template_part( 'template-parts/content', 'none' );
      }
    ?>
    </ul>

Bash Programming

Redirection

Put these at the top of your scripts to direct bash to send all standard out and standard error created after the statement  to $log_file.  One way or another the following statements will send everything created by the current script, sub-shells, etc. to $log_file.  Your code will be cleaner and easier to maintain.

Overwrite the contents of $log_file
exec > $log_file 2>&1
Append to $log_file
exec >> $log_file 2>&1
Overwrite the contents of $log_file and send output to screen using tee
exec > >(tee $log_file) 2>&1
Append to the contents of $log_file and send output to screen using tee
exec > >(tee -a $log_file) 2>&1

Number Format (see: http://stackoverflow.com/questions/9374868/number-formatting-in-bash-with-thousand-separator

Print value with a comma-delimiting the 1,000s place
$ printf "%'.3f\n" 12345678.901
12,345,678.901

Process IDs

process id of current script
$$
one script put another script in the background (using &) – pid of background process
$!

Variables

use variables for ranges
for x in $(eval echo "{$min..$max}");
better way to use variables for ranges using seq
for i in $(seq $min $max)

Return Status (see http://www.unix.com/shell-programming-scripting/92163-command-does-not-return-exit-status-due-tee.html

When using something like

script.sh 2>&1 | tee -a $FILE

Since pipe is used to pipe to tee checking $? for the return will not give you the exit status of the script. To get the exit status use PIPESTATUS Example:

ret=${PIPESTATUS: -1}

Functions

When writing functions getting a return status may not be as simple as using echo (see: ttp://www.linuxjournal.com/content/return-values-bash-functions).

Write your function so that it accepts a variable name as part of its command line and then set that variable to the result of the function:

  function myfunc()
  {
    local  __resultvar=$1
    local  myresult='some value'
    eval $__resultvar="'$myresult'"
  }

  myfunc result
  echo $result

For more flexibility, you may want to write your functions so that they combine both result variables and command substitution:

  function myfunc()
  {
      local  __resultvar=$1
      local  myresult='some value'
      if [[ "$__resultvar" ]]; then
          eval $__resultvar="'$myresult'"
      else
          echo "$myresult"
      fi
  }

  myfunc result
  echo $result
  result2=$(myfunc)
  echo $result2

Here, if no variable name is passed to the function, the value is output to the standard output.

History

To access the last element from the previous command line use $! Example:

$ ls -l /tmp/data.txt
-rw-r--r--  1 dan.brown  wheel  95 Nov 26 22:39 /tmp/data.txt
$ rm !$
rm /tmp/data.txt
$ ls -l /tmp/data.txt
ls: /tmp/data.txt: No such file or directory

To access the entire previous line, use !! Example:

$ which ls
/bin/ls
$ ls -l `!!`
ls -l `which ls`
-rwxr-xr-x  1 root  wheel  34736 Oct 31  2013 /bin/ls

To access other elements of the previous line, use “word designators” separated by a colon (zero-based).

$ echo a b c d e
a b c d e
$ echo !!:2
echo b
b

Ranges can also be accessed:

  $ echo a b c d e
  a b c d e
  $ echo !!:3-4
  c d

There are various shortcuts, such as, ‘:$’ to refer to the last argument, ‘:^’ to refer to the first argument, ‘:*’ to refer to all the arguments (synonym to ‘:1-$’), and others. See the cheat sheet for a complete list.

Modifiers can be used to modify the behavior of a word designators (see: http://www.catonmat.net/blog/the-definitive-guide-to-bash-command-line-history/. For example:

  $ tar -xvzf software-1.0.tgz
  software-1.0/file
  ...
  $ cd !!:$:r
  software-1.0$

Here the “r” modifier was applied to a word designator which picked the last argument from the previous command line. The ‘r’ modifier removed the trailing suffix ‘.tgz’.

The “h” modifier removes the trailing pathname component, leaving the head:

  $ echo /usr/local/apache
  /usr/local/apache
  $ echo !!:$:h
  /usr/local

The “e” modifier removes all but the trailing suffix:

  $ ls -la /usr/src/software-4.2.messy-Extension
  ...
  $ echo /usr/src/*!!:$:e
  /usr/src/*.messy-Extension    # ls could have been used instead of echo

Another interesting modifier is the substitute ‘:s/old/new/’ modifier which substitutes new for old. It can be used in conjunction with ‘g’ modifier to do global substitution. For example,

  $ ls /urs/local/software-4.2 /urs/local/software-4.3
  /usr/bin/ls: /urs/local/software-4.2: No such file or directory
  /usr/bin/ls: /urs/local/software-4.3: No such file or directory
  $ !!:gs/urs/usr/
ls /usr/local/software-4.2 /usr/local/software-4.3
ls: /usr/local/software-4.2: No such file or directory
ls: /usr/local/software-4.3: No such file or directory

This example replaces all occurances of ‘urs’ to ‘usr’ and makes the command correct.

There are a few other modifiers, such as ‘p’ modifier which prints the resulting command after history expansion but does not execute it. See the cheat sheet for all of the modifiers.

Remote Copying

Copy from hdfs to local laptop

hadoop fs -cat /path/dir/file.txt | ssh [email protected] 'cat > /tmp/file.txt'

copy from hdfs to hdfs on another machine

hadoop fs -cat /path/dir/file.txt | ssh [email protected] 'hadoop fs -put - /data/file.txt'

AWK Scripts

I need to process text every day.  At times it is useful to find counts and I found that using AWK is the easiest way to do so.  Say I had a data file that looks like this:

row col1 col2 col3
0 3500.352 10l1.1 2356.4
1 292.2 3100.0 1997.99
2 1.354 2.3001 3354.2342523

This AWK statement will print information about col1:

$ awk ' NR == 1  {next} { s += $2 } END {  print "sum: ", s, " average: ", s/(NR-1), " samples: ", NR-1 }' /tmp/data.txt

sum:  3793.91  average:  1264.64  samples:  3

To do the same thing but print out comma-delimited numbers:

$ awk ' NR == 1 {next} { s += $2 } END { printf("sum: %'\''d average: %'.\''2f samples: %'\''d\n", s, s/NR, NR)}' /tmp/data.txt

sum: 3,793 average: 948 samples: 4

To print out the line that has the greatest number of characters:

awk '{ if (length($0) > max) {max = length($0); maxline = $0} } END { print maxline }' /tmp/data.txt

2 1.354 2.3001 3354.2342523

To print out the number of characters that appear in the line with the greatest number of characters:

awk '{ if (length($0) > max) max = length($0) } END { print max }' /tmp/data.txt

27

Installing A Single Instance of SolrCloud

After quite a bit of searching I can’t seem to find a simple example of setting up a single instance of SolrCloud on a local machine. I work with Solr every day. I do most of my development on my laptop and when everything is good I commit it and deploy. We recently hired someone who will be working with Solr and I wanted to get his laptop set up to run SolrCloud locally, too. However, I found it difficult to locate a document that I could just point him to. This is that document.

Note that I understand the installation described below is likely to be useful only when doing development. What’s the point of having a distributed SolrCloud when it’s only running on one machine?

See https://cwiki.apache.org/confluence/display/solr/SolrCloud+with+Legacy+Configuration+Files for a list of required SolrCloud configurations.

 ZooKeeper

ZooKeeper (https://zookeeper.apache.org/) is centralized service that is used to coordinate configuration information. You will be telling ZooKeeper where to find SolrCloud configuration files.

SolrCloud comes with an embedded ZooKeeper. However, our production configuration uses ZooKeeper as a stand-alone system and I want to mimic production.

Installation

  • Download ZooKeeper from Apache’s site https://zookeeper.apache.org/
  • Extract the downloaded file.
  • Follow the steps outlined in the getting started guide https://zookeeper.apache.org/doc/r3.3.4/zookeeperStarted.html.  Here are the basics.  Be aware that this may change with future versions of ZooKeeper.
    • Copy ZOOKEEPER_DIR/conf/zoo_sample.cfg to ZOOKEEPER_DIR/conf/zoo.cfg
    • I changed the value of dataDir in ZOOKEEPER_DIR/conf/zoo.cfg to an existing empty directory
    • Start zookeeper: ZOOKEEPER_DIR/bin/zkServer.sh start

Verification

Verify that ZooKeeper is running:

ZOOKEEPER_DIR/bin/zkCli.sh -server 127.0.0.1:2181

You should see a command prompt that looks something like this:

[zk: 127.0.0.1:2181(CONNECTED) 0]

Enter quit to exit the client

[zk: 127.0.0.1:2181(CONNECTED) 0] quit

If you get a java.net.ConnectException: Connection refused error you know the server is not running.

ZooKeeper and Solr’s Configuration Files

Using the SOLR_DIR/example/scripts/cloud-scripts/zkcli.sh script upload Solr configuration files to ZooKeeper:

SOLR_DIR/example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd upconfig -confdir SOLR_DIR/example/solr/my-collection/conf -confname my-collection-config

From this you should see a bunch of output including a list of the configuration files found in the directory pointed to by the “-confdir” flag.

Start SolrCloud

Start up an instance of SolrCloud:

java -jar SOLR_DIR/bin/solr start -z localhost:2181 -cloud

The “-cloud” flag tells Solr to start in a cloud configuration.  The “-z localhost:2181” flag tells Solr how to connect to ZooKeeper where it will find configuration information.

You may now look at the SolrCloud admin page found here: http://localhost:8983/solr/#/

Create New Solr Collection

So far we’ve uploaded a set of Solr configuration files to ZooKeeper and started an instance of SolrCloud.  Next we need to create a new Solr collection telling Solr how to find its configuration in ZooKeeper.

curl 'localhost:8983/solr/admin/collections?action=CREATE&numShards=1&name=my-collection&collection.configName=my-collection-config'

Notice how the value of “collection.configName” is the same as what was used in “upconfig” command that was sent to ZooKeeper: my-collection-config  This tells Solr to use that name when asking ZooKeeper for the configuration for this new collection.

The “numShards” parameter is required.  The documentation (https://cwiki.apache.org/confluence/display/solr/Collections+API) is a little confusing.  The table says it is not required but the description says otherwise.  I found that if I do not provide the “numShards” parameter the response from Solr is

org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: numShards is a required param

As such, I just set the value to 1 and everything works as expected.

To make this process easier I created a few scripts which can be found here on github: https://github.com/likethecolor/solr-scripts

Using Hadoop to Create SOLR Indexes

One of the most challenging projects I faced at work recently was to create a Apache SOLR index consisting of approx 15 million records. This index had been created once in the history of the company using a MySQL database and SOLR’s Data Import Handler (DIH). It had not been attempted since then because the original indexing process was time consuming (12-14 hours), required human supervision, and on failure had to be restarted from the very beginning.
Continue reading

Android apps on my phone

I was the first one at work to get an Android phone. As word got around I have become the guy to go to when it comes to Android. I get asked questions about Android before they buy. I also get questions about Android after they’ve bought their new phone. One thing that people seem to appreciate is when I provide them with a list of apps to get them started.

Without further ado, here is a list of most of the apps I have on my phone. I’ve not included some of the apps (e.g., OEM pre-installed crap).
Continue reading