Gigablast - API




            WEB SEARCH    

            NEWS    

            IMAGES    

            DIRECTORY    

            ADVANCED    

            ADD URL    

            SYNTAX    

            ABOUT    

            BLOG    

            API    
 

ADMIN    



by Matt Wells

NOTE: All APIs support both GET and POST method. If the size of your request is more than 2K you should use POST.
NOTE: All APIs support both http and https protocols.
API by pages
/search

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2qSTRINGqueryThe query to perform. See help. See the query operators below for more info. REQUIRED
3cSTRINGcollectionSearch this collection. Use multiple collection names separated by a whitespace to search multiple collections at once. REQUIRED
4nINT32number of results per query10The number of results returned. If you want more than 1000 results you must use &stream=1 so Gigablast does not run out of memory. Search feed customers are typically limited to 10 results per query, so additional queries must be conducted to receive more results.
5searchtypeSTRINGsearchtypeSet to news or images to search for those respective entities.
6sINT32first result num0Start displaying at search result #X. Starts at 0. If you want more than 1000 results in total, you must use &stream=1 so Gigablast does not run out of memory.
7uipSTRINGuser ipThe ip address of the searcher. We can pass back for use in the autoban technology which bans abusive IPs. Required for for maintaining statistics for ads if search results are in a JSON or XML feed where the searcher IP is not directly provided by the connecting socket.
8showerrorsBOOL (0 or 1)show errors0Show errors from generating search result summaries rather than just hide the docid. Useful for debugging.
9showanomaliesBOOL (0 or 1)show anomalies0Show search results that only contain the query terms in some anomalous link texts.
10showurlasnameBOOL (0 or 1)show url as name0Show the website name instead of the url itself in the search results. Used by news search.
11showimagesBOOL (0 or 1)show images0Should we return or show the thumbnail images in the search results?
12showgoodimagesBOOL (0 or 1)show images if good match1Should we return or show the thumbnail images in the search results if they are close to all the search terms?
13scBOOL (0 or 1)site clustering0Should search results be site clustered? This limits each site to appearing at most twice in the search results. Sites are subdomains for the most part, like abc.xyz.com.
14tcCHARtopic clustering0Should search results be clustered by topic? Used by news search.
15pptCHARpromote popular topics0If topic clustering is enabled, should we promote highly clustered results to the top? A form of re-ranking.
16hacrBOOL (0 or 1)hide all clustered results0Only display at most one result per site.
17drBOOL (0 or 1)dedup results0Should similar search results be removed?
18qeINT32query expansion levelvariable0 means none. 1 means to expand terms to basic word endings and 2 means to do basic word endings plus synonym expansion. If you are doing 0 result queries for spell checking using &n=0&spell=1 then you will need to specify &qe=1 or &qe=2 in order to get synonyms.
19spellBOOL (0 or 1)do spell checkingvariableIf enabled, when Gigablast finds a spelling recommendation it will be included in the XML <spellingSuggestion> tag or JSON "spellingSuggestion" field. Default is 0 if using an XML or JSON feed, 1 otherwise.
20autospellBOOL (0 or 1)auto correct spellingvariableIf enabled, when Gigablast is CONFIDENT of a spelling recommendation it will automatically re-perform the query with the recommended spelling. Spell checking must be enabled for this to work. This is a default value and can be overriden directly with the autospell parm in each individual http request.
21relqueriesBOOL (0 or 1)show related queries0Offer related queries at bottom of search results.
22dmozBOOL (0 or 1)display dmoz categories in results0If enabled, results in dmoz will display their categories on the results page. The url itself must be explicitly in the DMOZ category.
23idmozBOOL (0 or 1)display indirect dmoz categories in results0If enabled, results in dmoz will display their indirect categories on the results page. That is, the categories of which their root url is a member.
24streamCHARstream search results0Stream search results back on socket as they arrive. Useful when thousands/millions of search results are requested. Required when doing such things otherwise Gigablast could run out of memory. Only supported for JSON and XML formats, not HTML. You must use this if you want more than 1000 results.
25secsbackINT32seconds back0Limit to results with pub dates from this many seconds ago. Use 0 to disable.
26usetimeINT64use time0Use this provided UTC timestamp rather than the current time for secsback or for news search. Helps with debugging. 0 means to ignore it.
27filetypeSTRINGfiletypeRestrict results to this filetype. Supported filetypes are pdf, doc, html xml, json, xls.
28facetSTRINGfacet query termA query term that is prepended to the query. i.e. &facet=gbfacetint%3Atype for document type facets.
29fastBOOL (0 or 1)fast results0Sacrifice some quality and result filtering for the sake of speed.
30nfINT32max number of facets to return50Max number of facets to return
31qlangcountrySTRINGcountry lang preferenceUse the specified country and language. Example: en-us or en-uk or de-de, etc.
32qcountrySTRINGsort country preferenceusDefault country to use for ranking results. Value should be any country code abbreviation, for example "us" for United States.
33qlangSTRINGsort language preferenceenDefault language to use for ranking results. Value should be any language abbreviation, for example "en" for English. Use xx to give ranking boosts to no language in particular. See the language abbreviations at the bottom of the url filters page.
34langwFLOAT32language weightvariableUse this to override the default language weight for this collection. The default language weight can be set in the search controls and is usually something like 40.0. Which means that we multiply a result's score by 40.0 if from the same language as the query or the language is unknown.
35onlylangSTRINGlanguage restrictionsAll documents returned will be in this language.
36nsINT32number of summary excerptsvariableHow many summary excerpts to display per search result?
37linkSTRINGrestrict search to pages that link to this urlThe url which the pages must link to. From the advance search page. All returned results will link to the specified url. Example: &link=http://www.foo.com/
38sitesSTRINGrestrict results to these sitesReturned results will have URLs from these space-separated list of sites. Can have up to 200 sites. A site can include sub folders. This is allows you to build a Custom Topic Search Engine. Example: ?q=test&sites=foo.com+bar.com+baz.com
39docidsSTRINGrestrict results to these docidsReturned results will be from this space-separated list of docIds. Can have up to 200. This is used for the View Full Coverage link for news search, among other things. Example: ?q=test&docids=12345678+9877665432
40ffBOOL (0 or 1)family filter0Remove objectionable results if this is enabled.
41qhBOOL (0 or 1)highlight query terms in summaries1Use to disable or enable highlighting of the query terms in the summaries.
42hqSTRINGcached page highlight queryHighlight the terms in this query instead.
43showcachedBOOL (0 or 1)show cached links1Show cached links next to each result? For HTML output only, of course.
44bqINT32boolean status2Can be 0 or 1 or 2. 0 means the query is NOT boolean, 1 means the query is boolean and 2 means to auto-detect.
45dtSTRINGmeta tags to displayA space-separated string of meta tag names. Do not forget to url-encode the spaces to +'s or %%20's. Gigablast will extract the contents of these specified meta tags out of the pages listed in the search results and display that content after each summary. i.e. &dt=description will display the meta description of each search result. &dt=description:32+keywords:64 will display the meta description and meta keywords of each search result and limit the fields to 32 and 64 characters respectively. When used in an XML feed the <display name="meta_tag_name">meta_tag_content</> XML tag will be used to convey each requested meta tag's content.
46rdcBOOL (0 or 1)return number of docs per topic1Use 1 if you want Gigablast to return the number of documents in the search results that contained each topic (gigabit).
47rdBOOL (0 or 1)return docids per topic0Use 1 if you want Gigablast to return the list of docIds from the search results that contained each topic (gigabit).
48dioBOOL (0 or 1)return docids only0Is 1 to return only docids as query results.
49prependSTRINGprependprepend this to the supplied query followed by a |.
50sbBOOL (0 or 1)show banned pages0show banned pages
51iccINT32include cached copy of page0Will cause a cached copy of content to be returned instead of summary.

Example XML Output (&format=xml)

<response>
	<statusCode>0</statusCode>
	<statusMsg>Success</statusMsg>
	<currentTimeUTC>1404513734</currentTimeUTC>
	<responseTimeMS>284</responseTimeMS>
	<docsInCollection>226</docsInCollection>
	<hits>193</hits>
	<moreResultsFollow>1</moreResultsFollow>
	<result>
		<imageBase64>/9j/4AAQSkZJRgABAQAAAQABA...</imageBase64>
		<imageHeight>350</imageHeight>
		<imageWidth>223</imageWidth>
		<origImageHeight>470</origImageHeight>
		<origImageWidth>300</origImageWidth>
		<title><![CDATA[U.S....]]></title>
		<sum>Department of the Interior protects America's natural resources and</sum>
		<url><![CDATA[www.doi.gov]]></url>
		<size>  64k</size>
		<docId>34111603247</docId>
		<site>www.doi.gov</site>
		<spidered>1404512549</spidered>
		<firstIndexedDateUTC>1404512549</firstIndexedDateUTC>
		<contentHash32>2680492249</contentHash32>
		<language>English</language>
	</result>
</response>

Example JSON Output (&format=json)

{

	# This is zero on a successful query. 
	# Otherwise it will be a non-zero number 
	# indicating the error code.
	"statusCode":0,

	# Similar to above, this is "Success" 
	# on a successful query. Otherwise it 
	# will indicate an error message 
	# corresponding to the statusCode above.
	"statusMsg":"Success",

	# This is the current time in UTC in 
	# unix timestamp format (seconds since 
	# the epoch) that the server has when 
	# generating this JSON response.
	"currentTimeUTC":1404588231,

	# This is how long it took in 
	# milliseconds to generate the JSON 
	# response from reception of the request.
	"responseTimeMS":312,

	# This is how many matches were 
	# excluded from the search results 
	# because they were considered 
	# duplicates, banned, had errors 
	# generating the summary, or were from an 
	# over-represented site. To show them use 
	# the &sc &dr &pss &sb and &showerrors 
	# input parameters described above.
	"numResultsOmitted":3,

	# This is how many shards failed to 
	# return results. Gigablast gets results 
	# from multiple shards (computers) and 
	# merges them to get the final result 
	# set. Some times a shard is down or 
	# malfunctioning so it will not 
	# contribute to the results. So If this 
	# number is non-zero then you had such a 
	# shard.
	"numShardsSkipped":0,

	# This is how many shards are ideally 
	# in use by Gigablast to generate search 
	# results.
	"totalShards":159,

	# This is how many total documents are 
	# in the collection being searched.
	"docsInCollection":226,

	# This is how many of those documents 
	# matched the query.
	"hits":193,

	# This is 1 if more search results are 
	# available, otherwise it is 0.
	"moreResultsFollow":1,

	# Start of query-based information.
	"queryInfo":{

		# The entire query that was received, 
		# represented as a single string.
		"fullQuery":"test",

		# The original query before it was 
		# autocorrected by the spell checker.
		"origQuery":"teest",

		# Did gigablast perform query 
		# expansion on the query? i.e. inserting 
		# synonyms. This can be 0 for no or 1 for 
		# yes.
		"didQueryExpansion":1,

		# Did gigablast perform spell 
		# checking on the query? When doing spell 
		# checking it takes slightly more time. 
		# You can force it off with &spell=0 or 
		# force it on with &spell=1.
		"checkedQuerySpelling":1,

		# The query as it should be according 
		# to the spell checker.
		"spellingSuggestion":"test",

		# Like above but the words that were 
		# replaced are in bold.
		"spellingSuggestionBolded":"test",

		# Confident of spelling suggestion? 
		# Is Gigablast confident that its 
		# spelling suggestion is really what the 
		# searcher meant?		"confidentOfSpellingSuggestion":1,

		# Did Gigablast automatically fix the 
		# misspelled query and re-execute the 
		# search with the new query?		"queryWasRespelled":1,

		# The language of the query. This is 
		# the 'preferred' language of the search 
		# results. It is reflecting the &qlang 
		# input parameter described above. Search 
		# results in this language (or an unknown 
		# language) will receive a large boost. 
		# The boost is multiplicative. The 
		# default boost size can be overridden 
		# using the &langw input parameter 
		# described above. This language 
		# abbreviation here is usually 2 letter, 
		# but can be more, like in the case of 
		# zh-cn, for example.
		"queryLanguageAbbr":"en",

		# The language of the query. Just 
		# like above but the language is spelled 
		# out. It may be multiple words.
		"queryLanguage":"English",

		# List of space separated words in 
		# the query that were ignored for the 
		# most part. Because they were common 
		# words for the query language they are 
		# in.
		"ignoredWords":"to the",

		# There is a maximum limit placed on 
		# the number of query terms we search on 
		# to keep things fast. This can be 
		# changed in the search controls.
		"queryNumCandidateTermsTotal":52,
		"queryNumTermsUsed":20,
		"queryWasTruncated":1,

		# The start of the terms array. Each 
		# query is broken down into a list of 
		# terms. Each term is described here.
		"terms":[

			# The first query term in the JSON 
			# terms array.
			{

			# The term number, starting at 0.
			"termNum":0,

			# The term as a string.
			"termStr":"test",

			# Is this query term ignored when 
			# computing the search results? Popular 
			# words like the, and, to, etc. are 
			# typicaly ignored to make the query 
			# faster.			"isIgnored":0,

			# Is this query term required when 
			# computing the search results? Synonyms 
			# of original query terms are not 
			# considered required, per se.			"isRequired":1,

			# Is this query term a base term? 
			# Basically, synonyms of query terms are 
			# not the base term.			"isBaseTerm":1,

			# The term frequency. An estimate of 
			# how many pages in the collection 
			# contain the term. Helps us weight terms 
			# by popularity when scoring the results.
			"termFreq":425239458,

			# A 48-bit hash of the term. Used to 
			# represent the term in the index.
			"termHash48":67259736306430,

			# A 64-bit hash of the term.
			"termHash64":9448336835959712000,

			# If the term has a field, like the 
			# term title:cat, then what is the hash 
			# of the field. In this example it would 
			# be the hash of 'title'. But for the 
			# query 'test' there is no field so it is 
			# 0.
			"prefixHash64":0

			},

			# The second query term in the JSON 
			# terms array.
			{

			"termNum":1,
			"termStr":"tested",

			# The language the term is from, in 
			# the case of query expansion on the 
			# original query term. Gigablast tries to 
			# find multiple forms of the word that 
			# have the same essential meaning. It 
			# uses the specified query language 
			# (&qlang), however, if a query term is 
			# from a different language, then that 
			# language will be implied for query 
			# expansion.
			"termLang":"en",

			# The query term that this term is a 
			# form of.
			"synonymOf":"test",

			# If a synonym, what kind of synonym 
			# is it?
			"synonymType":"form",

			# If a synonym, what language is it 
			# in?
			"synonymLang":"en",

			"termFreq":73338909,
			"termHash48":66292713121321,
			"termHash64":9448336835959712000,
			"prefixHash64":0
			},

			...

		# End of the JSON terms array.
		]

	# End of the queryInfo JSON structure.
	},

	# The start of the gigabits array. 
	# Each gigabit is mined from the content 
	# of the search results. The top N 
	# results are mined, and you can control 
	# N with the &dsrt input parameter 
	# described above.
	"gigabits":[

		# The first gigabit in the array.
		{

		# The gigabit as a string in utf8.
		"term":"Membership",

		# The numeric score of the gigabit.
		"score":240,

		# The popularity ranking of the 
		# gigabit. Out of 10000 random documents, 
		# how many documents contain it?
		"minPop":480,

		# The gigabit in the context of a 
		# document.
		"instance":{

			# A sentence, if it exists, from one 
			# of the search results which also 
			# contains the gigabit and as many 
			# significant query terms as possible. In 
			# UTF-8.
			"sentence":"Get a free Tested Premium Membership here!",

			# The url that contained that 
			# sentence. Always starts with http.
			"url":"http://www.tested.com/",

			# The domain of that url.
			"domain":"tested.com"
		}

		# End of the first gigabit
		},

		...

	# End of the JSON gigabits array.
	],

	# Start of the facets array, if any.
	"facets":[

		# The first facet in the array.
		{
			# The field you are faceting over
			"field":"Company",

			# How many documents in the 
			# collection had this particular field? 
			# 64-bit integer.
			"totalDocsWithField":148553,

			# How many documents in the 
			# collection had this particular field 
			# with the same value as the value line 
			# directly below? This should always be 
			# less than or equal to the 
			# totalDocsWithField count. 64-bit 
			# integer.
			"totalDocsWithFieldAndValue":44184,

			# The value of the field in the case 
			# of this facet. Can be a string or an 
			# integer or a float, depending on the 
			# type described in the gbfacet query 
			# term. i.e. gbfacetstr, gbfacetint or 
			# gbfacetfloat.
			"value":"Widgets, Inc.",

			# Should be the same as 
			# totalDocsWithFieldAndValue, above. 
			# 64-bit integer.
			"docCount":44184

		# End of the first facet in the array.
		}

	# End of the facets array.
	],

	# If a search ad matches the query it 
	# will be here.
	"ad": {
		# The unique 32-bit id of the ad.
		"id32": 0,

		# The title of the ad.
		"title": "My Ad Title",

		# The text of the ad.
		"text": "My Ad Text",

		# The link of the ad.
		"link": "http://www.foo.com",

		# Fetch this link when the user 
		# clicks an ad so we count the click. 
		# Append &uip=<userIpAddress> to 
		# this link to store the searcher's IP 
		# address. It will return an empty 
		# document as the response.
		"registerClickLink":"/adclick?redir=0&adid=0&q=test+query"

	# End of the ad.
	},

	# Start of the JSON array of 
	# individual search results.
	"results":[

		# The first result in the array.
		{

		# The title of the result. In UTF-8.
		"title":"This is the title.",

		# A DMOZ entry. One result can have 
		# multiple DMOZ entries.
		"dmozEntry":{

			# The DMOZ category ID.
			"dmozCatId":374449,

			# The DMOZ direct category ID.
			"directCatId":1,

			# The DMOZ category as a UTF-8 
			# string.
			"dmozCatStr":"Top: Computers: Security: Malicious 
			 Software: Viruses: Detection and Removal Tools: 
			 Reviews",

			# What title some DMOZ editor gave 
			# to this url.
			"dmozTitle":"The DMOZ Title",

			# What summary some DMOZ editor gave 
			# to this url.
			"dmozSum":"A great web page.",

			# The DMOZ anchor text, if any.
			"dmozAnchor":"",

		# End DMOZ entry.
		},

		# The content type of the url. Can be 
		# html, pdf, text, xml, json, doc, xls or 
		# ps.
		"contentType":"html",

		# The summary excerpt of the result. 
		# In UTF-8.
		"sum":"Department of the Interior protects America's natural resources.",

		# The url of the result. If it starts 
		# with http:// then that is omitted. Also 
		# omits the trailing / if the urls is 
		# just a domain or subdomain on the root 
		# path.
		"url":"www.doi.gov",

		# The hopcount of the url. The 
		# minimum number of links we would have 
		# to click to get to it from a root url. 
		# If this is 0 that means the url is a 
		# root url, like http://www.root.com/.
		"hopCount":0,

		# The size of the result's content. 
		# Always in kilobytes. k stands for 
		# kilobytes. Could be a floating point 
		# number or and integer.
		"size":"  64k",

		# The exact size of the result's 
		# content in bytes.
		"sizeInBytes":64560,

		# The unique document identifier of 
		# the result. Used for getting the cached 
		# content of the url.
		"docId":34111603247,

		# The site the result comes from. 
		# Usually a subdomain, but can also 
		# include part of the URL path, like, 
		# abc.com/users/brad/. A site is a set of 
		# web pages controlled by the same 
		# entity.
		"site":"www.doi.gov",

		# The time the url was last INDEXED. 
		# If there was an error or the url's 
		# content was unchanged since last 
		# download, then this time will remain 
		# unchanged because the document is not 
		# reindexed in those cases. Time is in 
		# unix timestamp format and is in UTC.
		"spidered":1404512549,

		# The first time the url was 
		# successfully INDEXED. Time is in unix 
		# timestamp format and is in UTC.
		"firstIndexedDateUTC":1404512549,

		# A 32-bit hash of the url's content. 
		# It is used to determine if the content 
		# changes the next time we download it.
		"contentHash32":2680492249,

		# Does the result contain adult 
		# words? 0 for no, 1 for yes.
		"isAdult":0,

		# The dominant language that the 
		# url's content is in. The language name 
		# is spelled out in its entirety.
		"language":"English"

		# A convenient abbreviation of the 
		# above language. Most are two 
		# characters, but some, like zh-cn, are 
		# more.
		"langAbbr":"en"

		# If the result has an associated 
		# image then the image thumbnail is 
		# encoded in base64 format here. It is a 
		# jpg image.
		"imageBase64":"/9j/4AAQSkZJR...",

		# If the result has an associated 
		# image then what is its height and width 
		# of the above jpg thumbnail image in 
		# pixels?
		"imageHeight":223,
		"imageWidth":350,

		# If the result has an associated 
		# image then what are the dimensions of 
		# the original image in pixels?
		"origImageHeight":300,
		"origImageWidth":470

		# End of the first result.
		},

		...

	# End of the JSON results array.
	]

# End of the response.
}




/get

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2dINT64docId0The docid of the cached page to view. REQUIRED
3urlSTRINGurlInstead of specifying a docid, you can get the cached webpage by url as well. REQUIRED
4cSTRINGcollectionGet the cached page from this collection. REQUIRED
5stripINT32strip0Is 1 or 2 two strip various tags from the cached content.
6ihBOOL (0 or 1)include header1Is 1 to include the Gigablast header at the top of the cached page, 0 to exclude the header.
7qSTRINGqueryHighlight this query in the page.

Example XML Output (&format=xml)
<response>
	<statusCode>0</statusCode>
	<statusMsg>Success</statusMsg>
	<url><![CDATA[http://www.doi.gov/]]></url>
	<docId>34111603247</docId>
	<cachedTimeUTC>1404512549</cachedTimeUTC>
	<cachedTimeStr>Jul 04, 2014 UTC</cachedTimeStr>
	<content><![CDATA[<html><title>Some web page title</title><head>My first web page</head></html>]]></content>
</response>

Example JSON Output (&format=json)
{ "response":{
	"statusCode":0,
	"statusMsg":"Success",
	"url":"http://www.doi.gov/",
	"docId":34111603247,
	"cachedTimeUTC":1404512549,
	"cachedTimeStr":"Jul 04, 2014 UTC",
	"content":"<html><title>Some web page title</title><head>My first web page</head></html>"
}
}



/admin/status
Only available to admin



/admin/collectionpasswords
Only available to admin



/admin/master
Only available to admin



/admin/search
Only available to admin



/admin/spider
Only available to admin



/admin/proxies
Only available to admin



/admin/log
Only available to admin



/admin/masterpasswords
Only available to admin



/admin/addcoll
Only available to admin



/admin/delcoll
Only available to admin



/admin/clonecoll
Only available to admin



/admin/hosts
Only available to admin



/admin/stats
Only available to admin



/account/adduser
Only available to admin



/account/edituser
Only available to admin



/account/addad
Only available to admin



/account/editad
Only available to admin



/account/showad
Only available to admin



/account/pausead
Only available to admin



/account/resumead
Only available to admin



/account/deletead
Only available to admin



/account/deposit
Only available to admin



/account/refund
Only available to admin



/account/showuser
Only available to admin



/account/showusers
Only available to admin



/admin/rebuild
Only available to admin



/admin/reindex
Only available to admin



/admin/inject
Only available to admin



/admin/addurl
Only available to admin