Under the hood of Newshound's news aggregation platform

In the coming weeks, we will be talking about how our news aggregator works under the hood. Please comment on the articles below to help us tackle our technical challenges better.

#1: Gathering and ranking news stories
#2: Clustering news stories at scale
#3: Discovering the biggest newsmakers of the day
#4: Searching news archives
#5: Analyzing news sentiment for fun and profit

#1: Gathering and ranking news stories

A news aggregator collects multiple news stories from multiple publishers which begs the question: how do we surface the most important stories of the day? We use some aspects of Natural Language Processing like the bag-of-words model and approximate string matching algorithms to come up with the answer.

#2: Clustering news stories at scale

A news aggregator consumes thousands of news stories a day. Hashing, where a large number of words are mapped to a small number of integers using a hash function, dramatically improves the scalibility of the bag-of-words model. We use data mining algorithms such as Locality-sensitive Hashing with MinHash to cluster similar news stories at scale.

#3: Discovering the biggest newsmakers of the day

To find out who is dominating the airwaves that day, we make use a natural language processing concept known as n-grams. We look for n-grams that represent proper names or places, count their frequency across all stories in a certain time period, and rank the top ten.

#4: Searching news archives

We can search our news archives based on simple keywords or natural language processing concepts such as stemming, synonyms, soundex, etc. The search results are news stories that are ranked by relevancy and displayed in a reverse chronological timeline.

#5: Analyzing news sentiment for fun and profit

Is it good news or bad news? Did you know that humans only agree about 80% of the time? We use a natural language processing concept known as sentiment analysis to determine the polarity of a news story — positive, negative or neutral — with an estimated accuracy of at most 72%.

Subscribe to our RSS feed to keep up with the latest from Newshound Engineering.

Donate to Newshound

Help us keep the lights on and the servers running.