I have an anecdote for you.
"But We Wanted Her"
I have a friend who is genuinely leftist. He doesn't believe in God. He's very set in his ways. He is also pro-choice.
One day, the subject came up while we were waiting for a flight. Of course he spouted all of the usual narratives:
* A woman can do whatever she wants to her own body.
* It's not a life until it's born
* It's just a clump of cells
I'm going to set up a site for you all to test what I have so far.
It'll be a few days before I open it up since I have to do some user interface work (not my forte).
Plus it'll take a few days for the Web Crawler to visit the three sites I have permission to crawl:
* Daily Wire
* Hot Air
* Twitchy (I want this because it has a ton of embedded tweets and related images)
I spent the day writing the necessary code for a prototype search engine.
* Web Crawler - functional
* Data storage - functional
* Indexer - well, it can index, sort of
I won't actually be writing my own indexer. There are dozens of ready-made indexers. I also didn't write the data storage engine, I simply designed the data objects.
Tomorrow I'll work on search algorithms. They'll have to be open source so I can tune them.
Ultimately, they're not searching entire documents.
They're searching context that was found in the content based on how many people visited a site for various sets of keywords.
I've way oversimplified all of this because I'd have to write an encyclopedic volume of toots to explain it.
But, that's how search engines work in a nutshell.
The algorithm also decides, based on the context it derives from the content (which was "marked" by the indexer), what ads to show - or whether to show ads at all.
It then decides what other media types are relevant (the "images", "videos" and "news" tabs).
All of this is stored in a document based library (kinda like a database). The document storage algorithms are what make it fast.
Many 100s of billions of documents can be searched in less than a second.
It is here where the "algorithms" are applied. This is where Google can filter the content for various reasons (including what we consider nefarious reasons).
But, primarily, the algorithm keeps track of how many times a particular page is visited (based on those tracking cookies you've heard about). Sites that are visited more frequently for a particular set of keywords will appear at the top of the results for that set of keywords.
Once the indexers have run, they store the data in a much broken down format that is all about keywords, and context.
Think of it as millions of manila folders, each labeled with a set of keywords.
Pages can appear for many different combinations of keywords. This is where the "search engine" takes over.
Something then has to make sense of the keywords people use, what language they're in, what words are relevant (i.e. articles like "a", and "the" aren't relevant).
This next program - which is really dozens of smaller processes - indexes these pages. It scans them for text, images, and links to other pages. It then compares all of this together to try to determine if anything is common among them. If there are things in common it marks them as "related."
These "marks" become the keywords that these pages will appear for. Obviously, the language parsing is more complex than this, but that's the gist of it.
That's pretty tedious.
But that's what Google does. They have tens of thousands of "browsers" running constantly and pressing "CTRL+S" after visiting every link in every page (the "web" of links).
Only these browsers are small single-minded little programs that only save pages to disk. They're called "Web Crawlers" and they've been around since before Google.
The next piece is how these web pages are stored. The Web Crawlers save the full web page.
And another program takes over.
For those interested, and who may not have ever thought about how a search engine works I'll give a brief high-level technical overview.
I'll start with what we're all familiar with: A web browser.
Go visit a page, any page. Then press "CTRL+S" and you'll be able to save that web page to your hard drive.
Depending on how you saved it, the entire page along with images may have been saved.
Now imagine clicking every link on that first page to open each link in a new tab.
CTRL+S in each tab.
I think I want to write a conservative search engine. Not a full web-search. More for indexing conservative sites and weighting them higher in search results - or maybe some crafty algorithm to make it easier to see contrasts between "MSM" and conservative media (including the shitty ones like conservativebrief.com).
It would be good practice as my software engineering mind rots away in upper management.
You know where to find me.