What are some tricks to efficiently search for information on the internet?

cll7793@lemmy.world · 5 天前

What are some tricks to efficiently search for information on the internet?

SmokeyDope@lemmy.world · edit-2 3 天前

A fun weekend project was to set up a local model to tool call from openweather and wolfram alpha through their API for factual dataset retrieval and local weather info.

Someone In our community showed off toolcalling articles on local instance of Wikipedia through a kiwix server and zim file and that seems really cool project too.

I would like to scrape preprints from ArXiv and do basic rag with them. Also Try to find a way to have a local version of OEIS or see if theres an API to scrape.

So I guess my solution is to use automation tools to automate data retrieval from wiki’s and databases directly. Use RSS, direct APIs, scrapers and tool calling.

CrocodilloBombardino@piefed.social · 5 天前

Use resources available through your library’s website

paequ2@lemmy.today · 5 天前

Ask Lemmy.

Kissaki@feddit.org · 5 天前

Help, I’m in a loop - between this ask Lemmy and this comment

INHALE_VEGETABLES@aussie.zone · 4 天前

You know how unhinged our grasp on reality is, right?

paequ2@lemmy.today · 3 天前

Wat. Naow…

looks left - Someone installing Guix

looks right - Someone browsing their Immich library on Librewolf

looks behind - Someone commenting about how we shouldn’t support facists

looks across the street - Google, Apple, Facebook, Walmart, MAGA, OBEY, SLEEP, MARRY AND REPRODUCE

Well… fuck.

remon@ani.social · 4 天前

Well, OP just said “efficiently” … nothing about the quality. So you are technically correct.

Tollana1234567@lemmy.today · edit-2 5 天前

for research, stem, look for sites like researchgate, and others for peer reviewed papers. articles, magazines, blogs are not good sources unless they are citing said research paper that links you to the proper site, and important to not put it out of context which might lull people into pseudoscience beliefs. some people jump the gun on these sites which are basically articles, often using dumbed down wording. universities/colleges often have access to most if not the full library of papers, that usually are behind paywalls of publishers, if yuo somehow can get acces to those go for it.

e0qdk@reddthat.com · 5 天前

If you’re interested in building a new general purpose search engine, it probably makes the most sense to start with Common Crawl’s data set and augment it rather than starting from scratch.

cll7793@lemmy.world · 1 天前

Thanks! That’s a good idea!

your_paranoid_neighbour@lemmy.dbzer0.com · 5 天前

Searxng with brave, duckduckgo, google, mullvadleta, mullvadleta brave and qwant as the search engines. Law of big numbers makes it quite useful.

cll7793@lemmy.world · 1 天前

I noticed some of the best resources from the past are unfindable from any search engine. For example some science youtube channels which offer amazing quality content seem to be unfindable. They are replaced with other channels that try to clickbait their way to the top. The same can be said with websites that SEO as much as they can. The highest quality resources are also often in the least quantity. A form of quantity > quality is favored and amplified and sometimes even censored. (Anna’s archive)

Truscape@lemmy.blahaj.zone · edit-2 4 天前

Utilizing books from a shadow library like Anna’s archive (you can use Wikipedia to find the right domains), you can read prior written material for academic subjects, relevant books on various subjects from the pre-internet area, and so forth. Some users from newer fields (such as 3d printing/CAD) are going as far as to upload their PDF works onto Anna’s for distributed access.

WhatsHerBucket@lemmy.world · 5 天前

There are some paid options that are pretty good (I’m thinking Kagi).

Easy, but one obvious downside.

porcoesphino@mander.xyz · edit-2 4 天前

Does Kagi let you add a domain to a denylist (like a new well SEOed site thats genAI with inaccuracies you’ve noticed), or positively bias search results (like saying you know you want Wikipedia entries high in the list)?

Tywèle [she|her]@lemmy.dbzer0.com · edit-2 4 天前

It does. You can outright block domains, rank them higher or lower and I think even pin them to the top.

baggachipz@sh.itjust.works · 4 天前

It’s one of their best features. No ads being the best, since that also means you get real results and no “sponsored” bullshit. They also have ai slop filters.

evasive_chimpanzee@lemmy.world · 4 天前

For whatever reason, wikipedia seems to have been really pushed down the page on search engines specifically for medical information. It’s a shame because I can acquire the surface level of information (which is all i really ever need) way faster from wikipedia than the other sites that come to the top of the list (mayo clinic, John’s Hopkins, Cleveland clinic, govt sites).

I really shouldn’t complain about it too much, cause they could be pushing pseudoscience blogs.

theneverfox@pawb.social · 4 天前

Go back to 2022 and run your search then

cll7793@lemmy.world · 1 天前

Someday this will be possible when an open source search engine comes around.

theneverfox@pawb.social · 22 小时前

Uh… There are open source search engines. They’re about on par with ddg or qwant, nowhere near even Bing from a few years ago

higgsboson@piefed.social · edit-2 7 小时前

FYI, duckduckgo literally is bing search results, just anonymized. Saying Bing is better just makes you a corporate stooge.

theneverfox@pawb.social · 6 小时前

FYI, ddg initially used Google, and FYI, you’ll get slightly different results because it’s more complicated than that

But none of that was the point I was trying to make. I’m referencing the quality of Bing from a few years ago

ieatpwns@lemmy.world · 4 天前

I know you came here for answers but how would one start making their own metadata search engine you got any guides to point me towards? I hate google so much I’m willing to learn to make my own search engine

cll7793@lemmy.world · 1 天前

It’s quite sad that we are now at a point where we are forced to make our own search engines from scratch. Search engines are hard! Google’s original search algorithm (about 2 decades ago) was quite amazing. You were able to give vague search terms and yet still find the answer you wanted. The secret sauce was ranking based on relevance to the search query. I’m not aware of any guides/projects on search engines. I wish there was a good way I could search for this. (The irony!) But a great starting resource is this series on networks from wikipedia. (https://en.wikipedia.org/wiki/Network_theory)

Some random tips:

The main goal of any search engine should be to minimize the number of times a user returns to the ranking page to click on a new link. Big tech should be doing this anyways but they have other goals.
The main metadata database needs to topologically connect you to any part of the internet. (https://en.wikipedia.org/wiki/Graph_theory) Think of it as a hub/portal giving you general directions, but doesn’t tell you exactly where you should be heading. The ideal solution is to download everything from the internet and query each result for relevance to a search query individually, but this is intractable. Instead you have to group the internet into graphs and sub graphs - STEM, Social, Forums, E-commerce etc. Hyperlinks offer an objective way to calculate connections between websites. For example Lemmy.world <-> Wikipedia.org. The weight of these connections gives you a way to guide a traversal algorithm during search. Semantic analysis of some form allows you to find more efficient ways to draw connections making your search more efficient.
The most powerful way to find connection/relevance to a search term is with transformers and their attention mechanism. For example if the search query is “Open source search engine”, the attention heatmap would be on groups of websites subjects like Forums, Q&A, Programming, Network Science, etc. There would also be a negative heatmap for topics like Cooking, Sports, Entertainment, etc. From there you want to load up recursively metadata for websites. For example for Lemmy it would be the title of all posts (and maybe their top comments). If it fits, load as much of this as you can into a transformer and calculate the heatmap relative to the search query. Again you are not using the transformer to generate answers. This is a bad idea. Instead you are using it to rank search results in terms of relevance/attention, what the transformer is fundamentally designed for.

As a side note, you are able to tune your model to your own search preferences with little data. You are also able to exchange computation time for search quality! This is amazing. If computation is a concern, traditional traversal algorithms and basic relevance/ranking algorithms work too but at the cost of more engineering.

I hope this sorta helps, if you have any other question feel free to ask! The future of search will likely be self-hosted as conflicts of interest within current search engine providers degrades the quality to the point where they are unusable.

bluemoon@piefed.social · edit-2 4 天前

StartPage, Mojeek, SearXNG, YaCy
hyperlink surfing “extranets”, as you would WikiMedia WikiPedia InternetArchive FediVerse posts etc.
webscrapers like Monolith etc. for offline PIR and just as you say convenience of having it all there

i look forward to reading what you come up with, because i am still kinda at the theoretical stage with keeping such a knowledgebase.

edit: i keep thinking a plaintext document of information is way simpler to deal with than webpages. at what point is information posted online preserved in it’s “original” form? just dumping this FediThread into a plaintext file or a folder of plaintext files with names being ‘hierarchy•postID•username’ or something so it is presented self-organized.

OP is ¤, 1st rank comments are ¤a ¤b ¤c and 2nd rank comments attached to comment ¤a are ¤a-a ¤a-b ¤a-c and 3rd rank comments attached to ¤a-c are ¤a-c-a ¤a-c-b ¤a-c-c so on. this then lists itself in a self-organized way, given all ASCII & unicode characters are provided in order. not just a-Z… because that would limit size of posts to take on.

ofcourse more difficult and complicated solutions like selfhosting webservers and managing ports and databases exist… not that i grasp the necessity for so many services.

cll7793@lemmy.world · 1 天前

Finding the balance between what to keep to index is hard! The attention mechanism in transformers should be pretty good at ranking results. The idea is to feed into context titles, top answers, etc in bulk along with a search query. The attention heatmap relative to the search gives you a general rank for how good each result is. Ironically enough, this is probably the most powerful indexer, yet no big tech uses it and instead has the model generate answers instead of ranking them. The best part is, this system is tunable and can be adjusted to user preference with little data. The overall goal should be to minimize the number of results a user checks. (This should be what other engines are doing in the first place)

porcoesphino@mander.xyz · 4 天前

Deny list plugins!?? I’d been looking for a search engine with that built it. It seems so obvious. I didn’t even think to look up a plugin. I had been writing keyword searches for browsers that manually added the query params for particularly frustrating results.

cll7793@lemmy.world · 1 天前

Super useful plugin! You can also subscribe to lists that block SEO/AI generated websites. Now only if there was a whitelist plugin that places forums higher up

evasive_chimpanzee@lemmy.world · edit-2 4 天前

Just found uBlacklist.

Now to find something for whitelist searches (basically I only ever want recipes or medical information from a small list of sites).

Edit: duckduckgo has the capability built in, too

cll7793@lemmy.world · 1 天前

I’d highly recommend using ZIM to download the websites you want! (https://wiki.openzim.org/wiki/Build_your_ZIM_file)

Once downloaded, you honestly can probably get better results from basic notepad search than google/duckduckgo/bing.

Tywèle [she|her]@lemmy.dbzer0.com · 4 天前

Kagi has that feature built in though it is a paid search engine.