Searching for an Answer
Head of Engineering
Charles is the up-and-coming Head of Engineering, charged with leading the engineering team into FY22 (and beyond)!
When you open up your favorite web browser, the first page to pop up is likely a search engine like Google. Search has become such an integral part of using the internet that most people take it for granted. But working on engineering a recent project for a logistics company here at DEV, my team and I discovered the many barriers and challenges to quick, efficient, accurate search.
So how does a search engine like Google return millions of (often scarily relevant) results in just a few seconds or less? It’s a mindboggingly complicated process to which no one except Google will ever know the intricacies, but it goes something like this: Google is constantly running web crawlers which virtually traverse the internet, following hyperlinks from one website to another and recording information, including the content of each site, as they pass through. One such piece of information is the number of outside sites linking to a certain website, a metric which can be used as a measure of a website’s popularity. All this data is stored in a database called an index.
When you enter a search term and hit “enter,” a list of websites which match your search keywords is returned from the index by an algorithm known as a “spider.” However, this list might be billions of websites long, so the search engine’s even more essential task is ordering the results in terms of relevance. Each search engine has its own ranking algorithms, which take into account hundreds of factors, like if your search term appears in the title, or whether the keywords are grouped together on the site.
Google’s first and most famous page ranking algorithm was called PageRank, named after Google co-founder Larry Page. This algorithm used the number of external websites that hyperlink to a certain website mentioned above, plus simple linear algebra (read up on Markov Chains for more!) to estimate how popular and reputable a website is. Today, the PageRank algorithm is supplemented by machine learning models which can better predict a user’s desires and intentions, and which can hold off advertisers and spammers attempting to game the search engine.
All this happens in the blink of an eye. But Google has a huge advantage, apart from its massive computing power, over the search functionality we implemented in our recent project at DEV: because of Google’s ability to web crawl and index pages in advance, a Google search is not actually running in real time.
For my team’s recent project, we were given a huge database of over 1.6 million logistics companies that users would want to search and filter through to identify the company that would serve them best. At first, our searches simply matched on location information like zip code or identification numbers, and even with such a large database, SQL handled such queries quickly. However, our client also wanted users (justifiably) to be able to filter and sort their results to find the most relevant options. Every new filter required new SQL select statements, and soon searches, although highly accurate and relevant to a user’s goals, went from taking a few seconds to up to a minute.
How can we improve the efficiency of this search in future versions? From Google, we can learn that indexing companies in our database in advance, perhaps by region or zip code, could greatly speed up search times. As for sorting, we currently calculate a company’s rating score at runtime, which, despite involving only simple addition and multiplication, adds up when there are millions of database entries. We could, like Google, cache this rating score in an index, so that the calculation does not have to be re-run every time.
Finally, the database itself can be improved to be more efficient. Currently, there are dozens of columns which mark boolean (true/false) values but, rather than use an actual boolean, use a string (such as the letter “X”) to denote “true” and are empty to denote “false.” This means our search algorithm has to do millions of string comparisons for every search, rather than much quicker boolean checks.
Search is an incredibly complicated operation, but when it’s done well, it provides fast, relevant information to users, without requiring them to conduct the weary work of searching themselves. This is DEV’s first time writing a search algorithm this large, and the lessons we’ve learned from iterating and from other companies’ solutions, like Google’s, will always guide our future projects and improvements.