Choices have slowly been running out when it comes to effective search engines. It seems inevitable an open source search engine project independent from big tech will be needed.

Some of my own tricks are:

  • Use the blacklist plugin to block sites from search.
  • Search for forum sites and communities instead of specific queries. (Wikipedia has a list of forums that might be useful)
  • For technical questions favor Q&A websites like stack exchange.
  • YouTube videos often offer better information than results from search engines. (Use search engines instead of YT search)
  • Look for blogs and journals that specialize in the topic you’re searching for.
  • Use boolean search when possible.
  • Self-host and customize your own metadata search engine. Create a graph network linking websites based on subject/topic. You may not be able to query specific questions but you can discover sites that you otherwise can’t in traditonal search. This is a great way to discover hidden gems! (Example: https://internet-map.net/)
  • (Difficult) Self-host and scrape sites across the web in order to create your own query-able database. This would be the most effective way to search the internet and would be completely independent from potential enshittification and censorship. The cost however is quite high both in term of hardware and time. Kiwix offers a way to download websites for offline use. (Ex: Wikipedia, Stack exchange). This is a good starting point to build your own custom search engine.

I would love to hear the tips and tricks you use! I hope this post helps others in more efficiently finding information on the internet!

  • ieatpwns@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    4 days ago

    I know you came here for answers but how would one start making their own metadata search engine you got any guides to point me towards? I hate google so much I’m willing to learn to make my own search engine

    • cll7793@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      It’s quite sad that we are now at a point where we are forced to make our own search engines from scratch. Search engines are hard! Google’s original search algorithm (about 2 decades ago) was quite amazing. You were able to give vague search terms and yet still find the answer you wanted. The secret sauce was ranking based on relevance to the search query. I’m not aware of any guides/projects on search engines. I wish there was a good way I could search for this. (The irony!) But a great starting resource is this series on networks from wikipedia. (https://en.wikipedia.org/wiki/Network_theory)

      Some random tips:

      • The main goal of any search engine should be to minimize the number of times a user returns to the ranking page to click on a new link. Big tech should be doing this anyways but they have other goals.
      • The main metadata database needs to topologically connect you to any part of the internet. (https://en.wikipedia.org/wiki/Graph_theory) Think of it as a hub/portal giving you general directions, but doesn’t tell you exactly where you should be heading. The ideal solution is to download everything from the internet and query each result for relevance to a search query individually, but this is intractable. Instead you have to group the internet into graphs and sub graphs - STEM, Social, Forums, E-commerce etc. Hyperlinks offer an objective way to calculate connections between websites. For example Lemmy.world <-> Wikipedia.org. The weight of these connections gives you a way to guide a traversal algorithm during search. Semantic analysis of some form allows you to find more efficient ways to draw connections making your search more efficient.
      • The most powerful way to find connection/relevance to a search term is with transformers and their attention mechanism. For example if the search query is “Open source search engine”, the attention heatmap would be on groups of websites subjects like Forums, Q&A, Programming, Network Science, etc. There would also be a negative heatmap for topics like Cooking, Sports, Entertainment, etc. From there you want to load up recursively metadata for websites. For example for Lemmy it would be the title of all posts (and maybe their top comments). If it fits, load as much of this as you can into a transformer and calculate the heatmap relative to the search query. Again you are not using the transformer to generate answers. This is a bad idea. Instead you are using it to rank search results in terms of relevance/attention, what the transformer is fundamentally designed for.

      As a side note, you are able to tune your model to your own search preferences with little data. You are also able to exchange computation time for search quality! This is amazing. If computation is a concern, traditional traversal algorithms and basic relevance/ranking algorithms work too but at the cost of more engineering.

      I hope this sorta helps, if you have any other question feel free to ask! The future of search will likely be self-hosted as conflicts of interest within current search engine providers degrades the quality to the point where they are unusable.