Advanced Query Operators |
Example Query | Description |
gbfieldmatch:strings.vendor:"My Vendor Inc." | Matches all the meta tag or JSON or XML fields that have the name "strings.vendor" and contain the exactly provided value, in this case, My Vendor Inc.. This is CASE SENSITIVE and includes punctuation, so it's exact match. In general, it should be a very short termlist, so it should be fast. |
gburl:www.feline.com/page2 | Matches the page with that exact url. Uses the first url, not the url it redirects to, if any. |
url:http://www.feline.com/page2 | Same as gburl above. |
gbext:doc | Match documents whose url ends in the .doc file extension. |
ext:doc | Match documents whose url ends in the .doc file extension. |
gblink:www.gigablast.com/help.html | Matches all the documents that have a link to http://www.gigablast.com/help.html |
link:www.gigablast.com/help.html | Same as gblink: operator above. |
gbsitelink:www.gigablast.com | Matches all documents that link to any page on the www.gigablast.com site. |
sitelink:www.ibm.com | Matches all documents that link to any page on the www.gigablast.com site. |
gbsite:mysite.com | Matches all documents on the mysite.com domain. |
gbsite:abc.mysite.com/dir1/ | Matches all documents whose url starts with abc.mysite.com/dir1/ |
site:abc.mysite.com/dir1/ | Same as gbsite: operator above. |
gbip:1.2.3.4 | Matches all documents whose IP is 1.2.3.4. |
gbip:1.2.3 | Matches all documents whose IP STARTS with 1.2.3. |
ip:1.2.3 | Live above for gbip: operator. |
gbinurl:page1 | Matches all documents that have the word page1 in their url, like Will only match individual words in the url. It must be delineated by punctuation. |
inurl:page1 | Same as gbinurl. |
suburl:page1 | Same as gbinurl. |
gbsubmiturl::www.gigablast.com/submit.php | Matches all documents that have a form submit to the specified url or their submit url contains the specified text as a substring. |
gbtitle:cat | Matches all the documents that have the word cat in their title. |
intitle:cat | Same as gbtitle operator above. |
intitle:"cat food" | Matches all the documents that have the phrase "cat food" in their title. |
gbinrss:1 | Matches all documents that are in RSS feeds. Likewise, use gbinrss:0 to match all documents that are NOT in RSS feeds. |
gbtype:json | Matches all documents that are in JSON format. Other possible types include html, text, xml, pdf, doc, xls, ppt, ps, css, json, status. status matches special documents that are stored every time a url is spidered so you can see all the spider attempts and when they occurred as well as the outcome. |
filetype:json | Same as gbtype: above. |
gbisadult:1 | Matches all documents that have been detected as adult documents and may be unsuitable for children. Likewise, use gbisadult:0 to match all documents that were NOT detected as adult documents. |
gbisdead:1 | Matches all documents that are dead but have other pages still linking to them, so that we were able to index some data for them, and provide a link to archive.org. Such pages can be excluded from the search using -gbisdead:1 which is probably more efficient than requiring gbisdead:0. |
gbimageurl:http://www.site.com/image.jpg | Matches all documents that contain the specified image. |
gbimageterm:1 | Used for doing image search. GB indexes one of these per good thumbnailed image, using proximity info. |
gbsitenuminlinks:0 | Matches all documents whose tag named * have the specified value in the tagdb entry for the url. Example: gbtagsitenuminlinks:2 matches all documents that have 2 qualified inlinks pointing to their site based on the tagdb record. You can also provide your own tags in addition to the tags already present. See the tagdb menu for more information. WARNING: the first urls indexed for a site will be missing this info because they only index the tags already in tagdb. |
gbcountryboost:uk | Give results whose primary country is the UK a boost in the search results. |
gbsortbyint:gbpagenuminlinks, gbpagenuminlinks:4 | The number of inlinks to the page. |
gbfacetboost:gbpagenuminlinks. boost by this term but don't require it. | |
gbcharset:windows-1252 | Matches all documents originally in the Windows-1252 charset. Available character sets are listed in the iana_charset.cpp file. There are a lot. Some more popular ones are: us, latin1, iso-8859-1, csascii, ascii, latin2, latin3, latin4, greek, utf-8, shift_jis. |
gblang:de | Matches all documents in german. The supported language abbreviations are at the bottom of the url filters page. Some more common ones are gblang:en, gblang:es, gblang:fr, gblang:"zh_cn" (note the quotes for zh_cn!). |
gbpathdepth:2 | Matches all documents whose url has 2 path components to it like http://somedomain.com/dir1/dir2/foo.html |
gbhopcount:2 | Matches all documents that are a minimum of two link hops away from the root url from which the were spidered. Injected documents always have a hopcount of 0, unless overridden with the &hopcount= parameter. |
gbhasfilename:1 | Matches all documents whose url ends in a filename like http://somedomain.com/dir1/myfile and not http://somedomain.com/dir1/dir2/. Likewise, use gbhasfilename:0 to match all the documents that do not have a filename in their url. |
gbiscgi:1 | Matches all documents that have a question mark in their url. Likewise gbiscgi:0 matches all documents that do not. |
gbhasext:1 | Matches all documents that have a file extension in their url. Likewise, gbhasext:0 matches all documents that do not have a file extension in their url. |
gbhasdate:1 | Matches all documents that have a date in the content somewhere. |
gbhaspubdate:1 | Matches all documents that have a pub date. |
gbminint:gbpubdate:1513836788 | Matches all documents that have a pub date after the specified UTC timestamp. |
gbminint:gbpubdate:1513836788 | Matches all documents that have a pub date after the specified UTC timestamp AND whose pub date included a time of day, at least the hour of the day, the minute is not required. |
gbhasemailaddress:1 | Matches all documents that have an email address in the content. |
gbparenturl:www.xyz.com/abc.html | Diffbot only. Match the json urls that were extract from this parent url. Example: gbparenturl:www.gigablast.com/addurl.htm |
gbcountry:us | Matches documents determined by Gigablast to be from the United States. Some more popular examples include: de, fr, uk, ca, cn. |
gbpermalink:1 | Matches documents that are permalinks. Use gbpermalink:0 to match documents that are NOT permalinks. |
gbcontentlen:0 | The length of the content after being encoded in UTF-8. |
gbdocid:243768044086 | Matches the document with the docid 243768044086 |
Numeric Field Query Operators |
Example Query | Description |
cameras gbsortbyfloat:price | Sort all documents that contain 'camera' by price. price can be a root JSON field or in a meta tag, or in an xml <price> tag. |
cameras gbsortbyfloat:product.price | Sort all documents that contain 'camera' by price. price can be in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product> |
cameras gbrevsortbyfloat:product.price | Like above example but sorted with highest prices on top. |
pilots gbsortbyint:employees | Sort all documents that contain 'pilots' by employees. employees can be a root JSON field or in a meta tag, or in an xml <price> tag. The value it contains is interpreted as a 32-bit integer. |
gbsortbyint:gbdocspiderdate | Sort all documents by the date they were spidered/downloaded. |
gbsortbyint:company.employees | Sort all documents by employees. Documents can contain employees in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product> |
gbsortbyint:gbsitenuminlinks | Sort all documents by the number of distinct inlinks the document's site has. |
gbrevsortbyint:gbdocspiderdate | Sort all documents by the date they were spidered/downloaded but with the oldest on top. |
cameras gbminfloat:price:109.99 | Matches all documents that contain 'camera' or 'cameras' and have a price of at least 109.99. price can be a root JSON field or in a meta tag name price, or in an xml <price> tag. |
cameras gbminfloat:product.price:109.99 | Matches all documents that contain 'camera' or 'cameras' and have a price of at least 109.99 in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product> |
cameras gbmaxfloat:price:109.99 | Like the gbminfloat examples above, but is an upper bound. |
gbequalfloat:product.price:1.23 | Similar to gbminfloat and gbmaxfloat but is an equality constraint. |
gbminint:gbspiderdate:1391749680 | Matches all documents with a spider timestamp of at least 1391749680. Use this as opposed th gbminfloat when you need 32 bits of integer precision. |
gbmaxint:company.employees:20 | Matches all companies with 20 or less employees in a JSON document like { "company":{"employees":13}} or, alternatively, an XML document like <company><employees>13</employees></company> |
gbequalint:company.employees:13 | Similar to gbminint and gbmaxint but is an equality constraint. |
gbhas:company.employees | Matches all documents which have a field. |
Facet Related Query Operators |
Example Query | Description |
gbfacetstr:color | Returns facets in the search results by their color field. color is case INsensitive. |
gbfacetstr:product.color | Returns facets in the color field in a JSON document like { "product":{"color":"red"}} or, alternatively, an XML document like <product><color>red</price></product>. product.color is case INsensitive. |
gbfacetstr:gbtagsite cat | Returns facets from the site names of all pages that contain the word 'cat' or 'cats', etc. gbtagsite is case insensitive. |
gbfacetint:product.cores | Returns facets in of the cores field in a JSON document like { "product":{"cores":10}} or, alternatively, an XML document like <product><cores>10</price></product>. product.cores is case INsensitive. |
gbfacetint:gbhopcount | Returns facets in of the gbhopcount field over the documents so you can search the distribution of hopcounts over the index. gbhopcount is case INsensitive. |
gbfacetint:gbtagsitenuminlinks | Returns facets in of the sitenuminlinks field for the tag sitenuminlinksin the tag for each site. Any numeric tag in tagdb can be facetizeed in this manner so you can add your own facets this way on a per site or per url basis by making tagdb entries. Case Insensitive. |
gbfacetint:size,0-10,10-20,30-100 | Returns facets in of the size field (either in json, field or a meta tag) and cluster the results into the specified ranges. size is case INsensitive. |
gbfacetint:gbsitenuminlinks | Returns facets based on # of site inlinks the site of each result has. gbsitenuminlinks is case INsensitive. |
gbfacetfloat:product.weight | Returns facets of the weight field in a JSON document like { "product":{"weight":1.45}} or, alternatively, an XML document like <product><weight>1.45</price></product>. product.weight is case INsensitive. |
gbfacetfloat:product.price,0-1.5,1.5-5 | Similar to above but cluster the pricess into the specified ranges. product.price is case insensitive. |
Spider Status Documents |
Example Query | Description |
gbssUrl:com | Query the url of a spider status document. |
gbssFinalRedirectUrl:abc.com/page2.html | Query on the last url redirect to, if any. |
gbssStatusCode:0 | Query on the status code of the index attempt. 0 means no error. |
gbssStatusMsg:"Tcp timed" | Like gbssStatusCode but a textual representation. |
gbssHttpStatus:200 | Query on the HTTP status returned from the web server. |
gbssWasIndexed:0 | Was the document in the index before attempting to index? Use 0 or 1 to find all documents that were not or were, respectively. |
gbssIsDiffbotObject:1 | This field is only present if the document was an object from a diffbot reply. Use gbssIsDiffbotObject:0 to find the non-diffbot objects. |
gbsortby:gbssAgeInIndex | If the document was in the index at the time we attempted to reindex it, how long has it been since it was last indexed? |
gbssDomain:yahoo.com | Query on the domain of the url. |
gbssSubdomain:www.yahoo.com | Query on the subdomain of the url. |
gbfacetint:gbssNumRedirects | Query on the number of times the url redirect when attempting to index it. |
gbssDocId:1234567 | Show all the spider status docs for the document with this docId. |
gbfacetint:gbssHopCount | Query on the hop count of the document. |
gbfacetint:gbssCrawlRound | Query on the crawl round number. |
gbssDupOfDocId:123456 | Show all the documents that were considered dups of this docId. |
gbssPrevTotalNumIndexAttempts:1 | Before this index attempt, how many attempts were there? |
gbssPrevTotalNumIndexSuccesses:1 | Before this index attempt, how many successful attempts were there? |
gbssPrevTotalNumIndexFailures:1 | Before this index attempt, how many failed attempts were there? |
gbrevsortbyint:gbssFirsIndexed | The date in utc that the document was first indexed. |
gbfacetint:gbssContentHash32 | The hash of the document content, excluding dates and times. Used internally for deduping. |
gbsortbyint:gbssDownloadDurationMS | How long it took in millisecons to download the document. |
gbsortbyint:gbssDownloadStartTime | When the download started, in seconds since the epoch, UTC. |
gbsortbyint:gbssDownloadEndTime | When the download ended, in seconds since the epoch, UTC. |
gbfacetint:gbssUsedRobotsTxt | This is 0 or 1 depending on if robots.txt was not obeyed or obeyed, respectively. |
gbfacetint:gbssConsecutiveErrors | For the last set of indexing attempts how many were errors? |
gbssIp:1.2.3.4 | The IP address of the document being indexed. Is 0.0.0.0 if unknown. |
gbsortby:gbssIpLookupTimeMS | How long it took to lookup the IP of the document. Might have been in the cache. |
gbsortby:gbssSiteNumInlinks | How many good inlinks the document's site had. |
gbsortby:gbssSiteRank | The site rank of the document. Based directly on the number of inlinks the site had. |
gbfacetint:gbssContentInjected | This is 0 or 1 if the content was not injected or injected, respectively. |
gbfacetfloat:gbssPercentContentChanged | A float between 0 and 100, inclusive. Represents how much the document has changed since the last time we indexed it. This is only valid if the document was successfully indexed this time.respectively. |
gbfacetint:gbssSpiderPriority | The spider priority, from 0 to 127, inclusive, of the document according to the url filters table. |
gbfacetstr:gbssMatchingUrlFilter | The url filter expression the document matched. |
gbfacetstr:gbssLanguage | The language of the document. If document was empty or not downloaded then this will not be present. Uses xx to mean unknown language. Uses the language abbreviations found at the bottom of the url filters page. |
gbfacetstr:gbssContentType | The content type of the document. Like html, xml, json, pdf, etc. This field is not present if unknown. |
gbsortbyint:gbssContentLen | The content length of the document. 0 if empty or not downloaded. |
gbfacetint:gbssCrawlDelay | The crawl delay according to the robots.txt of the document. This is -1 if not specified in the robots.txt or not found. |
gbssSentToDiffbotThisTime:1 | Was the document's url sent to diffbot for processing this time of spidering the url? |
gbssSentToDiffbotAtSomeTime:1 | Was the document's url sent to diffbot for processing, either this time or some time before? |
gbssDiffbotReplyCode:0 | The reply received from diffbot. 0 means success, otherwise, it indicates an error code. |
gbfacetstr:gbssDiffbotReplyMsg:0 | The reply received from diffbot represented in text. |
gbsortbyint:gbssDiffbotReplyLen | The length of the reply received from diffbot. |
gbsortbyint:gbssDiffbotReplyResponseTimeMS | The time in milliseconds it took to get a reply from diffbot. |
gbfacetint:gbssDiffbotReplyRetries | The number of times we had to resend the request to diffbot because diffbot returned a 504 gateway timed out error. |
gbfacetint:gbssDiffbotReplyNumObjects | The number of JSON objects diffbot excavated from the provided url. |
Boolean Queries |
Example Query | Description |
Note: boolean operators must be in UPPER CASE. |
cat AND dog | Search results have the word cat AND the word dog in them. |
cat OR dog | Search results have the word cat OR the word dog in them, but preference is given to results that have both words. |
cat dog OR pig | Search results have the two words cat and dog OR search results have the word pig, but preference is given to results that have all three words. This illustrates how the individual words of one operand are all required for that operand to be true. |
"cat dog" OR pig | Search results have the phrase "cat dog" in them OR they have the word pig, but preference is given to results that have both. |
intitle:"cat dog" OR pig | Search results have the phrase "cat dog" in their title OR they have the word pig, but preference is given to results that have both. |
cat OR dog OR pig | Search results need only have one word, cat or dog or pig, but preference is given to results that have the most of the words. |
cat OR dog AND pig | Search results have dog and pig, but they may or may not have cat. Preference is given to results that have all three. To evaluate expressions with more than two operands, as in this case where we have three, you can divide the expression up into sub-expressions that consist of only one operator each. In this case we would have the following two sub-expressions: cat OR dog and dog AND pig. Then, for the original expression to be true, at least one of the sub-expressions that have an OR operator must be true, and, in addition, all of the sub-expressions that have AND operators must be true. Using this logic you can evaluate expressions with more than one boolean operator. |
cat AND NOT dog | Search results have cat but do not have dog. |
cat AND NOT (dog OR pig) | Search results have cat but do not have dog and do not have pig. When evaluating a boolean expression that contains ()'s you can evaluate the sub-expression in the ()'s first. So if a document has dog or it has pig or it has both, then the expression, (dog OR pig) would be true. So you could, in this case, substitute true for that expression to get the following: cat AND NOT (true) = cat AND false = false. Does anyone actually read this far? |
(cat OR dog) AND NOT (cat AND dog) | Search results have cat or dog but not both. |
left-operand OPERATOR right-operand | This is the general format of a boolean expression. The possible operators are: OR and AND. The operands can themselves be boolean expressions and can be optionally enclosed in parentheses. A NOT operator can optionally preceed the left or the right operand. |