In the digital age, almost everyone has an online presence. Most people will look online before stepping foot in a store because everything is available online—even if it’s just information on where to get the best products. We even look up cinema times online!
As such, staying ahead of the competition regarding visibility is no longer merely a matter of having a good marketing strategy. Newspaper and magazine articles, television and radio advertising, and even billboards (for those who can afford them) are no longer enough, even though they’re still arguably necessary.
Now, you also have to ensure that your site is better than your competitors’, from layout to content, and beyond. If you don’t, you’ll slip away into obscurity, like a well-kept secret among the locals—which doesn’t bode well for any business.
This notion is where search engine optimization (SEO) comes in. There is a host of SEO tools and tricks available to help put you ahead and increase your search engine page ranking—your online visibility. These range from your use of keywords, backlinks, and imagery, to your layout and categorization (usability and customer experience). One of these tools is the website crawler.
What is a Website Crawler?
A website crawler is a software program used to scan sites, reading the content (and other information) so as to generate entries for the search engine index. All search engines use website crawlers (also known as a spider or bot). They typically work on submissions made by site owners and “crawl” new or recently modified sites and pages, to update the search engine index.
The crawler earned its moniker based on the way it works: by crawling through each page one at a time, following internal links until the entire site has been read, as well as following backlinks to determine the full scope of a site’s content. Crawlers can also be set to read the entire site or only specific pages that are then selectively crawled and indexed. By doing so, the website crawler can update the search engine index on a regular basis.
Website crawlers don’t have free reign, however. The Standard for Robot Exclusion (SRE) dictates the so-called “rules of politeness” for crawlers. Because of these specifications, a crawler will source information from the respective server to discover which files it may and may not read, and which files it must exclude from its submission to the search engine index. Crawlers that abide by the SRE are also unable to bypass firewalls, a further implementation designed to protect site owner’s’ privacy rights.
Lastly, the SRE also requires that website crawlers use a specialized algorithm. This algorithm allows the crawler to create search strings of operators and keywords, in order built onto the database (search engine index) of websites and pages for future search results. The algorithm also stipulates that the crawler waits between successive server requests, to prevent it from negatively impact the site’s response time for real (human) users visiting the site.
What Are the Benefits of Using a Website Crawler?
The search engine index is a list where the search engine’s data is stored, allowing it to produce the search engine results page (SERP). Without this index, search engines would take considerably longer to generate results. Each time one makes a query, the search engine would have to go through every single website and page (or other data) relating to the keyword(s) used in your search. Not only that, but it would also have to follow up on any other information each page has access to—including backlinks, internal site links, and the like—and then make sure the results are structured in a way to present the most relevant information first.
This finding means that without a website crawler, each time you type a query into your search bar tool, the search engine would take minutes (if not hours) to produce any results. While this is an obvious benefit for users, what is the advantage for site owners and managers?
Using the algorithm as mentioned above, the website crawler reviews sites for the above information and develops a database of search strings. These strings include keywords and operators, which are the search commands used (and which are usually archived per IP address). This database is then uploaded to the search engine index to update its information, accommodating new sites and recently updated site pages to ensure fair (but relevant) opportunity.
Crawlers, therefore, allow for businesses to submit their sites for review and be included in the SERP based on the relevancy of their content. Without overriding current search engine ranking based on popularity and keyword strength, the website crawler offers new and updated sites (and pages) the opportunity to be found online. Not only that, but it allows you to see where your site’s SEO ranking can be improved.
How to Choose a Website Crawler?
Site crawlers have been around since the early 90s. Since then, hundreds of options have become available, each varying in usability and functionality. New website crawlers seem to pop up every day, making it an ever-expanding market. But, developing an efficient website crawler isn’t easy—and finding the right option can be overwhelming, not to mention costly if you happen to pick the wrong one.
Here are seven things to look out for in a website crawler:
1. Scalability – As your business and your site grow bigger, so do your requirements for the crawler to perform. A good site crawler should be able to keep up with this expansion, without slowing you down.
2. Transparency – You want to know exactly how much you’re paying for your website crawler, not run into hidden costs that can potentially blow your budget. If you can understand the pricing plan easily, it’s a safe bet: compact packages often have those unwanted hidden costs.
3. Reliability – A static site is a dead site. You’ll be making changes to your site on a fairly regular basis, whether it’s regarding adding (or updating) content or redesigning your layout. A good website crawler will monitor these changes, and update its database accordingly.
4. Anti-crawler mechanisms – Some sites have anti-crawling filters, preventing most website crawlers from accessing their data. As long as it remains within limits defined in the SRE (which a good website crawler should do anyway), the software should be able to bypass these mechanisms to gather relevant information accurately.
5. Data delivery – You may have a particular format you want to view the website crawler’s collected information. While you do get some programs that focus on specific data formats, you won’t go wrong finding one capable of multiple formats.
6. Support – No matter how advanced you are, chances are you’re going to need some help optimizing your website crawler’s performance, or even making sense of the output when starting out. Website crawlers with a good support system relieve a lot of unnecessary stress, especially when things go wrong once in awhile.
7. Data quality – Because the information gathered by website crawlers is initially as unstructured as the web would be without them, it’s imperative that the software you ultimately decide on is capable of cleaning it up and presenting it in a readable manner.
Now that you know what to look for in a website crawler, it’s time we made things easier for you by narrowing your search down from (literally) thousands to the best 60 options.
With a focus on sitemap building (which the website crawler feature uses to determine which pages it’s allowed to read), DYNO Mapper is an impressive and functional software option.
DYNO Mapper’s website crawler lets you enter the URL (Uniform Resource Locator—the website address, such as www.example.com) of any site and instantly discover its site map, and build your own automatically.
There are three packages to choose from, each allowing a different number of projects (sites) and crawl limitations regarding the number of pages scanned. If you’re only interested in your site and a few competitors, the Regular package (at $480 a year paid annually) is a good fit. However, their Freelancer ($696 per year) and Most Popular ($1296 a year) packages are better options for more advanced users, especially those who want to be able to crawl numerous sites and up to 50 000 pages.
With a 14-day free trial (and two months off if you do opt for annual billing), you can’t go wrong.
Screaming Frog offers a host of search engine optimization tools, and their SEO Spider is one of the best website crawlers available. You’ll instantly find where your site needs improvement, discovering broken links and differentiating between temporary and permanent redirects.
While their free version is somewhat competent, to get the most out of the Screaming Frog SEO Spider tool, you’ll want to opt for the paid version. Priced at about $197 (paid on an annual basis), it allows for unlimited pages (memory dependent) as well as a host of functions missing from the free version. These include crawl configuration, Google Analytics integration, customized data extraction, and free technical support.
Screaming Frog claim that some of the biggest sites use their services, including Apple, Disney, and even Google themselves. The fact that they’re regularly featured in some of the top SEO blogs goes a long way to promote their SEO Spider.
DeepCrawl is something of a specialized website crawler, admitting on their homepage that they’re not a “one size fits all tool.” They offer a host of solutions, however, which you can integrate or leave out as you choose, depending on your needs. These include regular crawls for your site (which can be automated), recovery from Panda and (or) Penguin penalties, and comparison to your competitors.
There are five packages to choose from, ranging from $864 annually (you get one month free by opting for an annual billing cycle) to as high as $10 992 a year. Their corporate package, which offers the most features, is individually priced, and you’ll need to contact their support team to work out a cost.
Overall, the Agency package ($5484 a year) is their most affordable option for anyone wanting telephonic support and three training sessions. However, the Consultant plan ($2184 annually) is quite capable of meeting most site owners’ needs and does include email support.
Designed to extract the site map and data from websites, Apifier processes information in a readable format for you surprisingly quickly (they claim to do so in a matter of seconds, which is impressive, to say the least).
Developers do have the option of signing up for free, but the package does not entail all the basics. To get the best out of Apifier, you’ll want to opt for the Medium Business plan at $1548 annually ($129 a month), but the Extra Small option at $228 annually is also quite competent.
Since Google understands only a portion of your site, OnCrawl offers you the ability to read all of it with semantic data algorithms and analysis with daily monitoring.
The features available include SEO audits, which can help you improve your site’s search engine optimization and identify what works and what doesn’t. You’ll be able to see exactly how your SEO and usability is affecting your traffic (number of visitors). OnCrawl even monitors how well Google can read your site with their crawler and will help you to improve and control what does and doesn’t get read.
With OnCrawl’s Starter package ($136 a year) affords you a 30-day money back guarantee, but it’s so limited you’ll likely be upgrading to one of the bigger packages that don’t offer the same money-back guarantee. Pro will set you back $261 a year—you get two months free with the annual plan—but will also cover almost every requirement.
We now start moving away from the paid website crawlers to the free options available, starting with the SEO Chat Website Crawler and XML Site Map Builder. Also referred to as SEO Chat’s Ninja Website Crawler Tool, the online software mimics the Google sitemap generator to scan your site. It also offers spell checking and identifies page errors, such as broken links.
It’s incredibly easy to use integrate with any number of SEO Chat’s other free online SEO tools. After entering the site URL—either typing it out or using copy/paste—you can choose whether you want to scan up to 100, 500, or 1000 pages from the site.
Of course, there are some limitations in place. You’ll have to register (albeit for free) if you want the tool to crawl more than 100 pages, and you can only run five scans a day.
The Webmaster World Website Crawler Tool and Google Sitemap Builder is another free scanner available online. Designed and developed in a very similar manner to the SEO Chat Ninja Website Crawler Tool above, it also allows you to punch in (or copy/paste) a site URL and opt to crawl up to 100, 500, or 1000 of its pages. Because the two tools have been built using almost the same code, it comes as no surprise that you’ll need to register for a free account if you want it to scan more than 100 pages.
Another similarity is that it can take up to half an hour to complete a website crawl, but allows you to receive the results via email. Unfortunately, you’re still limited to five scans per day.
However, where the Webmaster World tool does outshine the SEO Chat Ninja is in its site builder capabilities. Instead of being limited to XML, you’ll be able to use HTML too. The data provided is also interactive.
Rob Hammond offers a host of architectural and on-page search engine optimization tools, one of which is a highly efficient free SEO Crawler. The online tool allows you to scan website URLs on the move, being compatible with a limited range of devices that seem to favor Apple products. There are also some advanced features that allow you to include, ignore, or even remove regular expressions (the search strings we mentioned earlier) from your crawl.
Results from the website crawl are in a TSV file, which can be downloaded and used with Excel. The report includes any SEO issues that are automatically discovered, as well as a list of the total external links, meta keywords, and much more besides.
The only catch is that you can only search up to 300 URLs for free. It isn’t made clear on Hammond’s site whether this is tracked according to your IP address, or if you’ll have to pay to make additional crawls—which is a disappointing omission.
WebCrawler.com is easily the most obviously titled tool on our list, and the site itself seems a little overly simplistic, but it’s quite functional. The search function on the site’s homepage is a little deceptive, acting as a search engine would and bringing up results of the highest ranking pages containing the URL you enter. At the same time, you can see the genius of this though—you can immediately see which pages are ranking better than others, which allows you to quickly determine which SEO methods are working the best for your sites.
One of the great features of WebCrawler.com is that you can integrate it into your site, allowing your users to benefit from the tool. By adding a bit of HTML code to your site (which they provide for you free of charge as well), you can have the WebCrawler.com tool appear on your site as a banner, sidebar, or text link.
Another rather simply named online scanner, the Web Crawler by Diffbot is a free version of the API Crawlbot included in their paid packages. It extracts information on a range of features of pages. The data contained are titles, text, HTML coding, comments, date of publication, entity tags, author, images, videos, and a few more.
While the site claims to crawl pages within seconds, it can take a few minutes if there’s a lot of internal links on your site. There’s an ill-structured web results page that can be viewed online, but you can also download the report in one of two formats: CSV or JSON.
You’re also limited in the number of searches, but it isn’t stipulated as to exactly what that limitation is—although you can share the tool on social media to gain 300 more crawls before being prompted to sign up for a 14-day free trial for any of Diffbot’s paid packages.
The Internet Archive’s Heritrix is the first open source website crawler we’ll be mentioning. Because it (and, in fact, the rest of the crawlers that follow it on our list) require some knowledge of coding and programming languages. Hence, it’s not for everyone, but still well worth the mention.
Everyone is free to download and use Heritrix, for redistribution and (or) modification (allowing you to build your website crawler using Heritrix as a foundation), within the limitations stipulated in the Apache License.
Nutch 2.x, on the other hand, stems from Nutch 1.x but is still being processed (it’s still usable, however, and one can use it as a foundation for developing your website crawler). The key difference is that Nutch 2.x uses Apache Gora, allowing for the implementation of a more flexible model/stack storage solution.
Both versions of Apache Nutch are modular and provide interface extensions like parsing, indexation, and a scoring filter. While it’s capable of running off a single workstation, Apache does recommend that users run it on a Hadoop cluster for maximum effect.
Scrapy is a collaborative open source website crawler framework, designed with Python for cross-platform use. Developed to provide the basis for a high-level web crawler tool, Scrapy is capable of performing data mining as well as monitoring, with automated testing. Because the coding allows for requests to be submitted and processed asynchronously, you can run multiple crawl types—for quotes, for keywords, for links, et cetera—at the same time. If one request fails or an error occurs, it also won’t interfere with the other crawls running at the same time.
This flexibility allows for very fast crawls, but Scrapy is also designed to be SRE compliant. Using the actual coding and tutorials, you can quickly set up waiting times, limits on the number of searches an IP range can do in a given period, or even restrict the number of crawls done on each domain.
Developed using C++ and compatible on several platforms, DataparkSearch Engine is designed to organize search results in a website, group of websites, local systems, and intranets. Some of the key features include HTTP, https, FTP, NNTP, and news URL scheme support, as well as an htdb URL for SQL database indexation. DataparkSearch Engine is also able to index text/plain, text/XML, text/HTML, audio/MPEG, and image/gif types natively, as well as multilingual websites and pages with content negotiation.
Using the vector calculation, results can be sorted by relevancy. Popularity ranking reports are classified as “Goo,” which adds weight to incoming links, as well as “Neo,” based on the neutral network model. You can also view your results according to the last time a site or page has been modified, or by a combination of relevancy and popularity rank to determine its importance. DataparkSearch Engine also allows for a significant reduction in search times by incorporating active caching mechanisms.
Formed as a free software package, GNU Wget leans toward retrieving information on the most common internet protocols, namely HTTP, HTTPS, and FTP. Not only that, but you’ll also be able to mirror a site (if you so wish) using some of GNU Wget’s many features.
If a download of information and files is interrupted or aborted for any reason, using the REST and RANGE commands, allow you to resume the process with ease quickly. GNU Wget uses NSL-based message files, making it suitable for a wide array of languages, and can utilize wildcard file names.
Downloaded documents will be able to interconnect locally, as GNU Wget’s programming allows you to convert absolute links to relative links.
GNU Wget was developed with the C programming languages and is for use on Linux servers (but compatible with other UNIX operating systems, such as Windows).
Designed as a website crawling software for clients and servers, Grub Next Generation assists in creating and updating search engine indexes. It makes it a viable option for anyone developing their search engine platform, as well as those looking to discover how well existing search engines can crawl and index their site.
It’s also operating system independent, making it a cross-platform program, and can be implemented in coding schemes using Perl, Python, C, and C# alike. The program also translates into several languages, namely Dutch, Galician, German, French, Spanish, Polish, and Finnish.
The most recent update included two new features, allowing users to alter admin upload server settings as well as adding more control over client usage. Admittedly, this update was as far back as mid-June 2011, and Freecode (the underlying source of Grub Next Generation platform) stopped providing updates three years later. However, it’s still a reliable web crawling tool worth the mention.
The HTTrack Website Copier is a free, easy-to-use offline website crawler developed with C and C++. Available as WinHTTrack for Windows 2000 and up, as well as WebHTTrack for Linux, UNIX, and BSD, HTTrack is one of the most flexible cross-platform software programs on the market.
Allowing you to download websites to your local directory, HTTrack allows you to rebuild all the directories recursively, as well as sourcing HTML, images, and other files. By arranging the site’s link structure relatively, you’ll have the freedom of opening the mirrored version in your browser and navigate the site offline.
Furthermore, if the original site is updated, HTTrack will pick up on the modifications and update your offline copy. If the download is interrupted at any point for any reason, the program is also able to resume the process automatically.
HTTrack has an impressive help system integrated as well, allowing you to mirror and crawl sites without having to worry if anything goes wrong.
Available as an HTTP Collector and a Filesystem Collector, the Norconex Collectors are probably the best open source website crawling solutions available for download.
Although designed for developers, the programs are often extended by integrators and (while still being easily modifiable) can be used comfortably by anyone with limited developing experience too. Using one of their readily available Committers, or building your own, Norconex Collectors allow you to make submissions to any search engine you please. And if there’s a server crash, the Collector will resume its processes where it left off.
The HTTP Collector is designed for crawling website content for building your search engine index (which can also help you to determine how well your site is performing), while the Filesystem Collector is geared toward collecting, parsing, and modifying information on local hard drives and network locations.
While OpenSearchServer also offers cloud-based hosting solutions (starting at $228 annually on a monthly basis and ranging up to $1428 for the Pro package), they also provide enterprise-class open source search engine software, including search functions and indexation.
You can opt for one of six downloadable scripts. The Search code, made for building your search engine, allows for full text, Boolean, and phonetic queries, as well as filtered searches and relevance optimization. The index includes seventeen languages, distinct analysis, various filters, and automatic classification. The Integration script allows for index replication, periodic task scheduling, and both REST API and SOAP web services. Parsing focuses on content file types such as Microsoft Office Documents, web pages, and PDF, while the Crawler code includes filters, indexation, and database scanning.
The sixth option is Unlimited, which includes all of the above scripts in one fitting space. You can test all of the OpenSearchServer code packages online before downloading. Written in C, C++, and Java PHP, OpenSearchServer is available cross-platform.
A free search engine program designed with Java and compatible with many operating systems, YaCy was developed for anyone and everyone to use, whether you want to build your search engine platform for public or intranet queries.
YaCy’s aim was to provide a decentralized search engine network (which naturally includes website crawling) so that all users can act as their administrator. Period means that search queries are not stored, and there is no censoring of the shared index’s content either.
Contributing to a worldwide network of peers, YaCy’s scale is only limited by its number of active users. Nevertheless, it is capable of indexation billions of websites and pages.
Installation is incredibly easy, taking only about three minutes to complete—from download, extraction, and running the start script. While the Linux and Debian versions do require the free OpenJDK7 runtime environment, you won’t need to install a web server or any databases—all of that is included in the YaCy download.
Written with C++ for the UNIX operating system, ht://Dig is somewhat outdated (their last patch released in 2004), but is still a convenient open source search and website crawling solution.
With the ability to act as a www browser, ht://Dig will search servers across the web with ease. You can also customize results pages for the ht://Dig search engine platform using HTML templates, running Boolean and “fuzzy” search types. It’s also completely compliant with the rules and limitations set out for website crawlers in the Standard for Robot Exclusion.
Using (or at least setting up) ht://Dig does require a UNIX machine and both a C and C++ compiler. If you use Linux, however, you can also make use of the open source tool by also installing libstdc++ and using GCC and (or) g++ instead.
You’ll also have to ensure you have a lot of free space for the databases. While there are no means of calculating exactly how much disk space you’ll need, the databases tend to take about 150MB per 13 000 documents.
mnoGoSearch isn’t very well documented, but it’s a welcome inclusion to our list (despite having seen no update since December 2015). Built with the C programming language, and originally designed for Windows only, mnoGoSearch has since expanded to include UNIX as well and offers a PHP front-end. It includes a site mirroring function, built-in parsers for HTML, XML, text, RTF, Docx, eml, mht, and MP3 file types, and support for HTTP, HTTPS, FTP, news, and nntp (as well as proxy support for both HTTP and HTTPS).
A whole range of database types, ranging from the usual MySQL and MSSQL to PostgreSQL and SQLite, can be used for storage purposes. With HTBD (the virtual URL scheme support), you can build a search engine index and use mnoGoSearch as an external full-text search solution in database applications for scanning large text fields.
mnoGoSearch also complies with the regulations set for website crawlers in the Standard for Robot Exclusion.
An object oriented library by Uwe Hunfeld, PHP Crawl can be used for website and website page crawling under several different platform parameters, including the traditional Windows and Linux operating systems.
By overriding PHP Crawl’s base class to implement customized functionality for the handleDocumentInfo and handleHeaderInfo features, you’ll be able to create your website crawler using the program as a foundation. In this way, you’ll not only be able to scan each website page but control the crawl process and include manipulation functions to the software. A good example of crawling code that can be implemented in PHP Crawl to do so is available at Dev Dungeon, who also provide open source coding to add a PHP Simple HTML DOM one-file library. This option allows you to extract links, headings, and other elements for parsing.
PHP Crawl is for developers, but if you follow the tutorials provided by Dev Dungeon a basic understanding of PHP coding will suffice.
Using the Crawler Workbench allows you to design and control a customized website crawler of your own. It allows you to visualize groups of pages as a graph, save website pages to your PC for offline viewing, connect pages together to read and (or) print them as one document and extract elements such as text patterns.
Without the WebSPHINX Class Library, however, none of it would be possible, as it’s your source for support in developing your website crawler. It offers a simple application framework for website page retrieval, tolerant HTML parsing, pattern matching, and simple HTML transformations for linking pages, renaming links, and saving website pages to your disk.
The standard for Robot Exclusion-compliant, WebSPHINX is one of the better open source website crawlers available.
While in pre-Alpha mode back in 2002, Tom Hey made the basic crawling code for WebLech available online once it was functional, inviting interested parties to become involved in its development.
Now a fully featured Java based tool for downloading and mirroring websites, WebLech can emulate the standard web-browser behavior in offline mode by translating absolute links into relative links. Its website crawling abilities allow you to build a general search index file for the site before downloading all its pages recursively.
If it’s your site, or you’ve been hired to edit someone else’s site for them, you can re-publish changes to the web.
With a host of configuration features, you can set URL priorities based on the website crawl results, allowing you to download the more interesting/relevant pages first and leaving the less desirable one for last—or leave them out of the download altogether.
Written in 2001 by an anonymous developer who wanted to familiarize himself/herself with the java.net package, Arale is no longer actively managed. However, the website crawler does work very well, as testified by some users, although one unresolved issue seems to be an OutofMemory Exception error.
On a more positive note, however, Arale is capable of downloading and crawling more than one user-defined file at a time without using all of your bandwidth. You’ll also have the ability to rename dynamic resources and code file names with query strings, as well as set your minimum and maximum file size.
While there isn’t any real support systems, user manuals, or official tutorials available for using Arale, the community has put together some helpful tips—including alternative coding to get the program up and running on your machine.
As it is command-prompt driven and requires the Java Runtime Environment to work, Arale isn’t really for the casual user.
Hosted by Source Forge, JSpider was developed with Java under the LGPL Open Source license as a customizable open source website crawler engine. You can run JSpider to check sites for internal server errors, look up outgoing and internal links, create a sitemap to analyze your website’s layout and categorization structure, and download entire websites.
The developers have also posted an open calling for anyone who uses JSpider to submit feature requests and bug reports, as well as any developers willing to provide patches that resolve issues and implement new features.
Because it’s such a highly configurable platform, you have the option of adding any number of functions by writing JSpider plugins, which the developers (who seem to have last updated the program themselves in 2004) encourage users to make available for other community members. Of course, this doesn’t include breaking the rules—JSpider is designed to be compliant with the Standard for Robot Exclusion.
Another functional (albeit last updated in 2003) open source website crawling solution hosted by Source Forge, HyperSpider offers a simple yet serviceable program. Like most website crawlers, HyperSpider was written in Java and designed for use on more than one operating system. The software gathers website link structures by following existing hyperlinks, and both imports and exports data to and from the databases using CSV files. You can also opt to export your gathered information into other formats, such as Graphviz DOT, XML Topic Maps (XTM), Prolog, HTML, and Resource Description Framework (RDF and (or) DC).
Data is formulated into a visualized hierarchy and map, using minimal click paths to define its form out of the collection of website pages—something which, at the time at least, was a cutting-edge solution. It’s a pity that the project was never continued, as the innovation of HyperSpider’s initial development showed great promise. As is, it’s still a worthy addition to our list.
One thing you won’t have to add yourself is an HTML parser for running an input stream of HTML content. However, Arachnid is not intuitively SRE compliant, and users are warned not to use the program on any site they don’t own. To use the website crawler without infringing on another site’s loading time, you’ll need to add extra coding.
BitAcuity was initially founded in 2000 as a technical consulting group, based in Washington DC’s metropolitan area. Using their experience in providing and operating software for both local and international clients, they released an open source, Java-based website crawler that is operational on various operating systems.
It’s a top quality, enterprise class website crawling solution designed for use as a foundation for developing your crawler program. Their aim was (and is) to save clients both time and effort in the development process, which ultimately translates to reduced costs short-term as well as long-term.
BitAcuity also hosts an open source community, allowing established users and developers to get together in customizing the core design for your specific needs and providing resources for upgrades and support. This community basis also ensures that before your website crawler becomes active, it is reviewed by peers and experts to guarantee that your customized program is on par with the best practices in use.
As of 2003, when the developers last updated their page, LARM was set up with some basic specifications gleaned from its predecessor, another experimental Jakarta project called LARM Web Crawler (as you can see, the newer version also took over the name). The more modern project started with a group of developers who got together to brainstorm how best to take the LARM Web Crawler to the next level as a foundation framework, and hosting of the website crawler was ultimately moved away from Jakarta to Source Forge.
The basic coding is there to implement file indexation, database table creation, and maintenance, and web site crawling, but it remains largely up to the user to develop the software further and customize the program.
Metis was first established in 2002 for the IdeaHamster Group with the intent of ascertaining the competitive data intelligence strength of their web server. Designed with Java for cross-platform usage, the website crawler also meets requirements set out in the Open Source Security Testing Methodology Manual’s section on CI Scouting. This flexibility also makes it compliant with the Standard for Robot Exclusion.
Composed of two packages, the faust.sacha.web and org.ideahamster.metis Java packages, Metic acts as a website crawler, collecting and storing gathered data. The second package allows Metis to read the information obtained by the crawler and generate a report for user analysis.
The developer, identified only as Sacha, has also stipulated an intention to integrate better Java support, as well as a shift to BSD crawling code licensing (Metis is currently made available under the GNU public license). A distributed engine is also in the works for future patches.
The structure is set up to allow for querying and extracting both full-text content and metadata from an array of systems, including websites, file systems, and mailboxes, as well as their file formats (such as documents and images). It’s designed to be easy to use, whether you’re learning the program, adding code, or deploying it for industrial projects. The architecture’s flexibility allows for extensions to be added for customized file formats and data sources, among others.
Data is exchanged based on the Semantic Web Standards, including the Standard for Robot Exclusion, and unlike many of the other open-source website crawler software options available you also benefit from built-in support for deploying on OSGi platforms.
Web Harvest uses a traditional methodology for XSLT, XQuery, and Regular Expressions (among others) text to XML extraction and manipulation. While it focuses mainly on HTML and XML websites in crawling for data—and these websites do still form the vast majority of online content—it’s also quite easy to supplement the existing code with customized Java libraries to expand Web Harvest’s scope.
A host of functional processors is supported to allow for conditional branching, file operations, HTML and XML processing, variable manipulation, looping, file operations, and exception handling.
The Web Harvest project remains one of the best frameworks available online, and our list would not be complete without it.
FindBestOpenSource.com are passionate about gathering open source projects together and helping promote them. It comes as no surprise that they’ve opted to host ASPseek, a Linux-oriented C++ search engine software by SVsoft.
Offering a search daemon and a CGI search frontend, ASPseek’s impressive indexation robot is capable of crawling through and recording data from millions of URLs, using words, phrases, wildcards, and performing Boolean searches. You can also limit the searches to a specified period (complying with the Standard for Robot Exclusion), website, or even to a set of sites, known as a web space. The results are sorted by your choice of date or relevance, the latter of which bases order on PageRank.
Thanks to ASPseek’s Unicode storage mode, you’ll also be able to perform multiple encodings and work with multiple languages at once. HTML templates, query word highlighting, excerpts, a charset, and iSpell support are also included.
Written with Java as an open source, cross-platform website crawler released under the Apache License, the Bixo Web Mining Toolkit runs on Hadoop with a series of cascading pipes. This capability allows users to easily create a customized crawling tool optimized for your specific needs by offering the ability to assemble your pipe groupings.
The cascading operations and subassemblies can be combined, creating a workflow module for the tool to follow. Typically, this will begin with the URL set that needs to be crawled and end with a set of results that are parsed from HTML pages.
Two of the subassemblies are Fetch and Parse. The former handles the heavy lifting, sourcing URLs from the URL Datum tuple wrappers, before emitting Status Datums and Fetched Datums via two tailpipes. The latter (the Parse Subassembly) processes the content gathered, extracting data with Tika.
Their hosting site provides step by step coding instructions for setting Crawler4j up, whether you’re using Maven or not in the installation process. From there, you need to create the crawler class that differentiates between which URLs and URL types the crawler should scan. This class will also handle the downloaded page, and Crawler4j provides a quality example that includes manipulations for the shouldVisit and visit functions.
Secondly, you’ll want to add a controller class to specify the crawl’s seeding, the number of concurrent threads, and a folder for immediate scan data to be stored in. Once again, Crawler4j provides an example code.
While it does require some coding experience, by following the list of examples almost anyone can use Crawler4j.
Also hosted by GitHub, Matteo Radaelli’s Ebot is a highly scalable and customizable website crawler. Written in Erlang for use on the Linux operating system, the open-source framework is designed with a noSQL database (Riak and Apache CouchDB), webmachine, mochiweb, and AMQP database (RabbitMQ).
Because of the NoSQL database structure (as opposed to the more standard Relational Database scheme), Ebot is easy to expand and customize—without having to spend too much extra money on a developer.
Although built on and primarily for Linux Debian, Matteo Radaelli released a patch that allowed for other operating systems that support Erlang coding to run and host the Ebot website crawling tool.
There are also some plugins available to help you customize Ebot, but not very many—you’ll end up looking for someone experienced in Erlang to help you flesh it out to your satisfaction.
Designed to run as is, but allowing for customization, Hounder also includes a wiz4j installation wizard and a clusterfest website application to monitor and manage the engine’s many components. This capacity makes it one of the better open source website scanners available, and it’s fully integrated with a more than a satisfactory crawler, document indexes, and search function.
Hounder is also capable of running several queries concurrently and has the flexibility for users to distribute the tool over many servers that run search and index functions, thus increasing the performance of your queries as well as the number of documents indexed.
Designed and developed by Mikio Hirabayashi and Tokuhirom, the Hyper Estraier website crawler is an open source cross-platform program written in C and C++ and hosted, of course, on Source Forge.
Based on architecture made through peer community collaborations, the Hyper Estraier essentially mimics the website crawler program used by Google. However, it is a much-simplified version, designed to act as a framework structure on which to build your software. It’s even possible to develop your search engine platform using the Hyper Estraier work form, whether you have a high-end or low-end computer do so on.
As such, most users ought to be able to customize the coding themselves, but as both C and C++ can be somewhat complicated to learn on the go, you’d benefit from having very little experience with either language or hiring someone who does
Open Web Spider was designed and developed independently but encourages community members to get involved. First released in 2008, Open Web Spider has enjoyed several updates but appears to have remained much the same as it did in 2015. Whether the original developers continue to work on the project or community peers have largely taken over, is unknown at present.
Nevertheless, as an open source website crawler framework it certainly packs a punch to this day. Compatible with the C# and Python coding languages, Open Web Spider is fully functional on a range of operating systems.
You’ll be surprisingly happy with the Open Web Spider Software, with its quick set-up, high-performance charts, and fast operation (their site boasts of the program’s ability to source up to 10 million hits in real time).
The Open Web Spider developers have relied on community members not only to assist in keeping the project alive but also to spread its reach by translating the code.
A Gopher, HTTP, FTP, HTTP over SSL, and FTP over SSL recursive data retrieval website crawler written in the C coding language for Linux users, Pavuk is known for using the string used to query servers to form the document titles, converting URL to file names. It is possible to edit this if it creates issues when you want to review the data, however (some punctuation in string forms are known to do so, especially if browsing manually through the index).
Pavuk also includes a detailed built-in support system, accessed by executing code from commands (which Linux favors), and has several configuration options for notifications, logs, and interface appearance. Besides these, there are a wide array of other customization options available, including proxy and directory settings.
Of course, Pavuk has been designed with the Standard for Robot Exclusion. Our list of website crawlers would certainly not be complete without this open source software.
As you’ve probably noticed by now, most open source website crawlers are primarily marketed as a search engine solution, whether on the scale of rivaling (or attempting to rival) Google or as an internal search function for individual sites. The Sphider PHP Search Engine software is indeed one of these.
As the name itself implies, Sphider was written in PHP and has been designed as a cross-platform solution. The back end database is programmed for MySQL, the most common database format in the world. All this makes the Sphider PHP Search Engine flexible as well as functional as a website crawler.
Sphider is fully compliant with the Standard for Robot Exclusion and other robots.txt protocols, and also respects the no-follow and no-index META tags that some sites incorporate to distinguish pages for exclusion in website crawls and the development of search engine indexes.
Licensed under the GPL as a free open source search engine library, the Xapian Project is kept very well up to date. In fact, it was initially available in C++, but bindings have since been included to allow for Perl, PHP, Python, Tcl, C#, Ruby, Jaca, Erlang, Lua, R, and Node.js. And the list is expected to grow, especially with the developers set to participate in the 2017 Google Summer of Code.
The toolkit’s code is incredibly adaptive, allowing it to run on several operating systems, and affording developers the opportunity to supplement their applications with the advanced search and indexation website crawler facilities provided. Probabilistic Information Retrieval and a wide range of Boolean search query operators are some of the other models supported.
And for those users looking for something closer to a finished product, the developers have used the Xapian Project to build another open source tool: Omega, a more refined version that retains the same versatility the Xapian Project is known for.
Arachnode.net was written with the C# coding language best suited to Windows. In fact, it is indeed (as the name itself implies) a program designed to fit the .NET architecture, and quite an expensive one at that.
Arachnode.net is a complete package, suitably used for crawling, downloading, indexation, and storing website content (the latter is done using SQL 2005 and 2008). The content isn’t limited to text only, of course: Arachnode.net scans and indexes whole website pages, including the files, images, hyperlinks, and even email addresses found.
The search engine indexation need not be restricted to storage on the SQL Server 2008 model (which also runs with SSIS in the coding), however, as data can also be saved as full-text records in .DOC, .PDF, .PPT, and .XLS formats. As can be expected from a .NET application, it includes Lucene integration capabilities and is completely SRE compliant.
The Open Source Large-Scale Website Crawwwler, also hosted by FindBestOpenSource.com, is still in its infancy phase, but it set to be a truly large scale website crawler. A purposefully thin manager, designed to act as an emergency shutdown, occasional pump, and ignition switch, controls the (currently very basic) plugin architecture, all of which is written for the Java platform in C++ (no MFC inclusion/conversion is available at present, and doesn’t seem to be in the works either).
The manager is also designed to ensure plugins don’t need to transfer data to all of their peers—only those that effectively “subscribe” to the type of data in question, so that plugins only receive relevant information rather than slowing down the manager class.
A fair warning though, from the developers themselves: a stable release of Crawwwler is still in the works, so it’s best not to use the software online yet.
Not much is known regarding the Distributed Website Crawler, and it’s had some mixed reviews but is overall a satisfactory data extraction and indexation solution. It’s primarily an implementation program, sourcing its code structure from other open source website crawlers (hence the name). This capability has given it some advantage in certain regards and is relatively stable thanks to its Hadoop and Map Reduce integration.
Released under the GNU GPL v3 license, the Distributed Website Crawler uses svn-based control methods for sourcing and is also featured on the Google Code Archive. While it doesn’t explicitly state as much, you can expect the crawler to meet with and abide by the regulations set out in the Standard for Robot Exclusion. After all, Google is a trustworthy and authoritative name in the industry, and can certainly be relied on to ensure such compliance in any crawler they promote.
It’s entirely web-based, and despite being very nearly a complete package as is allows for any number of compatible features to be added to and supported by the existing architecture, making it a somewhat customizable and extensible website crawler. Information, crawled and sourced with svn-based controls, is stored using MS SQL databases for use in creating search engine indexes.
iCrawler also operated under two licenses—the GNU GPL v3 license that many open source data extraction programs use, as well as the Creative Commons 3.0 BY-SA content license.
As you’ve probably noticed, the two largest competitors in the hosting of open source website crawler and search engine solutions are Source Forge and (increasingly) the somewhat obviously named FindBestOpenSource.com. The latter has the benefit of giving those looking for Google approved options the ability to immediately determine whether an offering is featured on the Google Code Archive.
Psycreep is also quite extensible and uses regular expression search query keywords and phrases to match with URLs when crawling websites and their pages. Implementing the common svn-based controls for regulating its sourcing process, Psycreep is fully observant of the Standard for Robot Exclusion (although they don’t explicitly advertise the fact, which is an odd exclusion). Psycreep is also licensed under GNU GPL v3.
A general open source Chinese search engine, Opese OpenSE consists of four essential components written for Linux servers in C++. These modules allow for the software to act as a query server (search engine platform), query CGI, website crawler, and data indexer.
Users are given the option of specifying query strings but also allows for keyword-driven search results. These results consist mainly of element lists, with each item containing a title, extract, URL link, and a snapshot link of website pages that meet include the query words provided and searched for by front end users.
Opese OpenSE also allows the user to use the picture link for viewing the corresponding website page’s snapshot in the software’s database driven search engine index list. It’s capable of supporting a large number of searches and sites in its index and is Google Code Archive approved—just like most open source solutions found hosted by FindBestOpenSource.com.
Still, in pre-alpha stage, the Andjing Web Crawler 0.01 originates in India and has been featured on the Google Code Archive. As development has not progressed very far yet, Andjing is still an incredibly basic website crawler. Written in PHP and running in a CLI environment, the program does require some extensive knowledge of the PHP coding language, and a machine that is capable of running MySQL.
Interestingly, one of the recommendations made to users by the developers themselves is to alter the coding to allow for Andjing to use SQLite rather than MySQL to save on your CPU resources. Whether a future patch negating the user’s need to do so will be released or not is unknown at present.
Because the software is not stable, and usability requires a lot of customization at this point, Andjing isn’t quite ready to be used reliably yet, but it does show a lot of potentials.
Hosted by FindBestOpenSource.com, the Ccrawler Web Crawler Engine operates under three licenses: a public Artistic License, the GNU GPL v3 license, and the Creative Commons 3.0 BY-SA for content.
Despite finding itself well-supported, with inclusion on the Google Code Archive for open source programs, there isn’t very much that can be found on the web regarding Ccrawler. It is, however, known to be svn-based for managing its sourcing, and abides by the regulations set out in the Standard for Robot Exclusion.
Built with the 3.5 version of C# and designed exclusively for Windows, the Ccrawler Web Crawler Engine provides a basic framework and an extension for web content categorization. While this doesn’t make it the most powerful open source resource available, it does mean you won’t have to add any code specifically for Ccrawler to be able to separate website content by content type when downloading data.
Most sites don’t deal purely with HTML though, as often use a pre-processor language as well. PHP is the most common of these, and WebEater—despite its lightweight frame—was designed to accommodate this occurrence.
Licensed under the GPL and LGPL certificates, WebEater enjoyed its last official patch in 2003, when GUI updates were introduced. Nevertheless, it remains a functional website crawling framework and deserves its place on our list.
Because the branches dealing with the retrieval and handling of documents are kept separated, integrating your modules will be a natural process. JoBo is also expected to release patches with new modules shortly, but a release date and further details have not yet been made public.
While the acronym LAW doesn’t quite add up to the word order in its full name, the Laboratory for Web Algorithmics is nevertheless a respected name in technology. UbiCrawler was their first website crawler program, and is a tried and tested platform that was first developed circa 2002. In fact, at the Tenth World Wide Web Conference, their first report on UbiCrawler’s design won the Best Poster Award.
With a scalable architecture, the fully distributed website crawler is also surprisingly fault-tolerant. It’s also incredibly fast, capable of crawling upwards of a hundred pages per second, putting it ahead of many other open source website crawling solutions available online.
Composed of several autonomous agents that are coordinated to crawl different sections of the web, with built-in inhibitors to prevent UbiCrawler from scanning more than one page of any given site at a time (thus ensuring compliance with the Standard for Robot Exclusion).
A very new entrant in the realm of website crawlers, BUbiNG was recently released as the Laboratory for Web Algortihmics’ follow-up to UbiCrawler after ten years of additional research. In fact, in the course of developing BUbiNG as a working website crawler, the development team managed to break a server worth nearly $46,000. They also needed to reboot their Linux operating system after incurring bug #862758, but the experience they gained through the process has enabled them to design a code structure, so sound BUbiNG is reportedly capable of opening 5000 random-access files in a short space of time.
At present, the website crawler is still dependent on external plugins for URL prioritization, but as the team at the Laboratory for Web Algorithmics have proven, they’re hell-bent on eventually releasing a fully stand-alone product in the future.
Flax is a little-known but much-respected company that provides an array of open source web application tools, all of which are hosted on GitHub. Marple is their Lucene based website crawling framework program, designed with a focus on indexation.
Marple has two main components, namely a REST API and the React UI. The former is implemented in Java and Dropwizard and focuses on translating Lucene index data into JSON structure. The latter runs in the browser itself and serves to source the crawled data from the API. For this reason, Marple isn’t a true website crawler at this stage and instead piggybacks on other, established search engine indexes to build its own.
We weren’t quite sure whether or not to add Mechanize onto our list at first, but the more we looked into the website crawler, the more we realized it certainly deserves its place here. Developed in Perl, based on Andy Lester’s Python, and capable of opening (and crawling) HTTP, HTTPS, FTP, news, HTTP over SLL, and FTP over SSL, among others, it caught our eye more than once.
The framework’s coding structure allows for easy and convenient parsing and following functions to be executed, and also supports the dynamic configuration of user-agent features, including redirection, cookies, and protocol while negating the need to open a new command line (specifically build_opener) each time.
A start-up Ruby project by Charles H Martin, Ph.D., Cloud Crawler Version 0.1 is a surprisingly good website crawler framework considering it doesn’t appear to have been touched much by the developer since he released it in alpha phase back in April 2013.
A Sinatra application, cloud monitor, is used for supervising the queue and includes coding for spooling nodes onto the Amazon cloud.
Last (but not least) on our list is Storm Crawler, an open source framework designed for helping the average coder develop their own distributed website crawlers (although limiting them somewhat to Apache Storm), written primarily in Java.
It is in fact not a complete website crawling solution itself, but rather a library of resources gathered with the intention of being a single source point for Apache developers interesting in expanding the website crawler market. To get the full benefit of the package, you’ll need to create an original Topology class, but everything else is pretty much made available. Which isn’t to say you can’t write your custom components too, of course.