Team looks to extend Apache Nutch to improve capabilities of search engines.
Kitware is developing software extensions that aim to address complex search problems common in fields such as security and defense as part of the Defense Advanced Research Projects Agency (DARPA) Memex program.
A key use case of the technology is the ability of law enforcement to discover and address human trafficking through the Web. The prominence of websites that facilitate trafficking is increasing, necessitating new software platforms that address the scale and scope of the expanding Web. Current software and search approaches do not effectively integrate interactive and social media, text, images, and video, while taking into account the degree of importance of each piece of media, which is required for performing “deep searches.”
“The leading search providers are singularly focused on displaying ads, which has led to closed systems and stagnant interfaces,” Jeff Baumes, a Co-Principal Investigator for the project and Technical Lead at Kitware, said.
Kitware is collaborating with NASA’s Jet Propulsion Laboratory at the California Institute of Technology and Continuum Analytics LLC to create specific extensions to Apache Nutch with the goal of improving its overall technological superiority for search. The project, which was featured on 60 Minutes earlier this year, is funded by DARPA. The work seeks to integrate state-of-the-art algorithms developed by Kitware with multimedia and video analytics to provide image and video understanding support to allow for automatic detection of objects and massive deployment via Nutch.
“The objective of our project is to create rich, customizable search experiences, using open-source tools and integrating diverse content such as video and images,” Baumes said.
The project is designed to create an interactive and visual interface for initiating Nutch crawls. The plan for the interface is to use the Blaze platform to expose Nutch data and to provide a domain-specific language for the crawls. The interface is also tasked with using the Bokeh visualization library to deliver simple interactive visualization and plotting techniques for exploring crawled information. Moreover, the team intends to make improvements to media-oriented search, as well as to the user interface (UI) and the domain-specific language for search, to unleash “deep search” activities that can be easily implemented by law enforcement and analysts for quick turnaround in time-critical situations.
“Our motivation is to enable rapid search and visualization in ways that were not possible in the past,” Baumes said.
Several communities can benefit from this work. The general intelligence community can benefit from the Nutch-based crawl interface, while the Apache and Python communities can benefit from the combination of Apache, Blaze, and Bokeh capabilities. Furthermore, the project can help more academic and open-source researchers explore and improve software for detecting human trafficking in the field, as it proposes to improve the power of Nutch as a search engine framework.
“Through DARPA partnerships, our tools can be utilized by organizations that desperately need better tools to make progress against real societal problems such as human trafficking,” Baumes said.
To learn more about the work discussed in this material, please see “DARPA: Nobody’s Safe on the Internet.”
Approved for Public Release, Distribution Unlimited