Our Methodology

How we collect, classify, and analyze news from multiple perspectives

Processing Pipeline

Every article passes through a six-stage pipeline before it reaches you. Each stage runs automatically and can process thousands of articles per day.

  1. Collection — RSS feeds are polled and new articles are deduplicated via SHA-256 content hashing
  2. Extraction — Full article text is retrieved using Mozilla Readability, stripping ads and boilerplate
  3. Classification — Language, political bias, entities, and sentiment are determined
  4. Embedding — Articles are converted to 768-dimensional vectors for semantic search
  5. Analysis — Keywords, summaries (short, medium, long), and sentiment scores are generated
  6. Clustering — Similar articles are grouped into multi-perspective "stories"

Data Collection

Newsar aggregates news from RSS feeds across the political spectrum. We carefully select sources to ensure balanced coverage from left-leaning, centrist, and right-leaning outlets.

Feed Selection Criteria

  • Established news organizations with consistent publishing schedules
  • Diverse political perspectives to avoid echo chambers
  • Regional variety to capture different geographic viewpoints
  • Multiple languages to provide international coverage

Classification System

Every article is automatically classified using a hybrid approach that combines rule-based methods with AI-powered analysis.

Political Bias Scale

We use a continuous scale from -1.0 (far left) to +1.0 (far right), with 0 representing center/neutral coverage:

  • Far Left: -1.0 to -0.6
  • Center-Left: -0.6 to -0.2
  • Center: -0.2 to +0.2
  • Center-Right: +0.2 to +0.6
  • Far Right: +0.6 to +1.0

Classification Methods

1. Rule-Based Classification

For sources with established bias ratings (e.g., from Media Bias/Fact Check), we apply known bias scores directly. This provides fast, consistent classification for major outlets.

2. AI-Powered Analysis

For unknown sources or verification, we use a 14-billion parameter language model (qwen2.5:14b) running on dedicated GPU infrastructure to analyze:

  • Language Detection: Automatic identification of article language
  • Political Bias: Content-based bias analysis using LLM reasoning
  • Geographic POV: Identifying regional perspective in coverage
  • Named Entities: Extracting people, organizations, locations, and events
  • Sentiment: Overall tone and emotional content analysis

Entity Intelligence

Newsar maintains a knowledge base of over 56,000 named entities extracted from articles. Entities are categorized into four types:

  • People — Politicians, business leaders, public figures
  • Organizations — Companies, governments, NGOs, political parties
  • Locations — Countries, cities, regions involved in the news
  • Events — Elections, conflicts, summits, disasters

Entity Summaries

Top trending entities receive AI-generated summaries that are periodically refreshed. Each summary includes a short description, a detailed overview, and trending scores based on recent mention velocity.

Entity Network Graph

Co-occurrence analysis reveals relationships between entities. When two entities are frequently mentioned together across different articles, a connection is established. These relationships are visualized as interactive network graphs on entity detail pages, helping you discover non-obvious connections between people, organizations, and events.

Story Clustering

Articles covering the same news event are automatically grouped into "stories" using semantic similarity analysis.

How It Works

  1. Embedding Generation: Each article is converted into a 768-dimensional vector using the nomic-embed-text model
  2. Similarity Search: We use pgvector's cosine similarity to find related articles
  3. Clustering: A DBSCAN-based algorithm groups articles that exceed a similarity threshold (typically 0.75) into clusters
  4. Story Creation: Each cluster becomes a "story" with articles from different perspectives

Story Trending

Stories receive dynamic trending scores that reflect how actively they are being covered:

  • Article velocity: How many new articles per hour are being published
  • Recency: Scores decay over 48 hours unless new coverage arrives
  • Source diversity: Stories covered by more unique sources rank higher

Coverage Diversity Score

Stories receive a diversity score (0–1) based on:

  • Number of unique sources covering the story
  • Distribution across political bias categories (left, center, right)
  • Geographic diversity of sources
  • Time span of coverage

Content Analysis

Beyond classification, we perform additional analysis to help you understand each article:

Keyword Extraction

AI-powered identification of the most important terms and concepts in each article, with relevance scoring and category labels.

Summary Generation

Each article receives three levels of summary—short, medium, and long—created by AI to help you quickly understand the article's main points at whatever depth you need.

Sentiment Analysis

Overall tone assessment on a scale from very negative (-1.0) to very positive (+1.0), helping you understand the emotional framing of the coverage.

Quality & Transparency

Confidence Scores

Every classification includes a confidence score (0–1) indicating how certain the AI is about its assessment. Lower confidence suggests the article may have mixed signals or be difficult to classify.

Method Indicators

Each article shows whether it was classified using:

  • Rule-based: Known source with established rating
  • AI-analyzed: Content-based analysis for unknown sources
  • Hybrid: Combination of both methods

Continuous Improvement

Our classification system is constantly being refined. We:

  • Monitor classification accuracy across sources
  • Update source ratings as outlets evolve
  • Improve AI prompts based on performance
  • Add new sources to maintain balanced coverage

Infrastructure

All AI inference runs on dedicated GPU infrastructure managed via an on-demand cloud system. GPU pods are created automatically when processing jobs are queued and terminated after idle periods to optimize costs. No user data or reading habits are sent to third-party AI services—only raw article content is processed.

The full technology stack includes:

  • Nuxt 4 for the web application
  • PostgreSQL + pgvector for storage and semantic search
  • Ollama serving qwen2.5:14b (chat) and nomic-embed-text (embeddings)
  • BullMQ + Redis for job queuing and worker orchestration

Limitations & Biases

While we strive for accuracy and fairness, it's important to acknowledge limitations:

  • AI models can make mistakes or misinterpret nuanced content
  • The left-right political spectrum is simplistic and doesn't capture all viewpoints
  • Source selection inherently involves editorial decisions
  • Breaking news may not be immediately classified or clustered
  • Entity extraction can produce duplicates or miss contextual nuance

We encourage critical thinking and using Newsar as one of many tools for staying informed.

Have questions about our methodology or suggestions for improvement?

Learn More About Newsar