Evaluation Methodology

This page describes the methodology huby uses to evaluate AI products and produce its Product Transparency & Curation Reports. It is published in full to allow readers, product owners, and enterprise buyers to independently assess the validity and limitations of our findings.

This methodology applies to all AI products evaluated on the huby platform. Product-type-specific adaptations are documented in companion framework specifications where applicable.

1. Purpose and Scope

This methodology applies to all AI products evaluated on the huby platform. Product-type-specific adaptations are documented in companion framework specifications where applicable.

2. Product Classification

Every evaluation begins with product classification. We first assign the product to a product type (e.g., conversational AI assistant, code generation tool, AI-powered search engine) and then identify a product subtype by mapping it against the competitive landscape. Classification determines which subcategories and atomic factors are activated within the evaluation framework.

Classification decisions are documented and disclosed in the opening section of each report so readers understand the basis of comparison.

3. Evaluation Framework

3.1 Structure

The evaluation framework is organized in three tiers:

Category — Six broad dimensions common across all product types (e.g., Security)
Subcategory — Measurable dimensions specific to the product subtype (e.g., Access Control Mechanisms)
Atomic factor — The specific, verifiable data point or capability being assessed (e.g., availability of native multi-factor authentication)

The number of subcategories within each category, and the number of atomic factors within each subcategory, are not uniform. They are determined during framework establishment based on the complexity and relevance of each dimension for the specific product subtype being evaluated. This variability is intentional — it ensures the framework reflects the actual shape of the evaluation problem rather than imposing artificial symmetry.

3.2 The Six Evaluation Categories

Quality

The degree to which the product delivers accurate, consistent, and reliable outputs relative to its stated purpose. Indicators include output accuracy and consistency, user experience design, product documentation and developer support, independent benchmarks, and reported defect rates and resolution times.

Use Cases & Pricing

The breadth and depth of supported use cases, the presence of novel or differentiated capabilities, and integration with relevant ecosystems. Pricing assessment covers tier structure, free access provisions, licensing model transparency, and value for money relative to competitive alternatives. Adoption signals such as user growth and retention are considered where data is available.

Security

The controls in place to protect the product, its infrastructure, and user data from unauthorized access and exploitation. Evidence considered includes third-party security audit reports, published CVEs and remediation records, penetration testing outcomes, static and dynamic code analysis results, API security posture, data encryption in transit and at rest, and access control mechanisms including multi-factor authentication and role-based access where applicable.

Privacy

The transparency, consistency, and user-centricity of the product’s data handling practices. Assessment covers privacy policy clarity and enforceability, relevant compliance certifications (SOC 2, GDPR, CCPA/CPRA, HIPAA where applicable), data sharing practices with third parties, data retention and deletion controls, and user rights regarding access, portability, and consent.

Sustainability & Reliability

Evaluated across two dimensions. Company sustainability assesses organizational longevity indicators including funding status and runway, revenue trajectory, team depth, and market position. Service reliability assesses uptime commitments, published SLAs, historical incident records, and performance benchmarks. Environmental and social sustainability practices are noted where publicly documented.

Impact, Ethics & Safety

The degree to which the product and its developer act responsibly toward users and society. Assessment covers the existence and substance of a published ethics policy, demonstrated practices around bias mitigation and model transparency, controls to prevent misuse and harm to users, societal applications and misuse risk, and the company’s track record of responding to ethical failures or safety incidents.

4. Scoring

4.1 Scale

All scores are reported on a 1.0–5.0 scale at atomic factor, subcategory, and category levels.

4.2 Score Definitions

5.0 — Exemplary. Strong, independently verified evidence of best-in-class practice. Formal audits, third-party certifications, or peer-reviewed assessments confirm performance.
4.0–4.9 — Strong. Multiple credible, corroborated sources substantiate the claim. Minor gaps exist but do not materially undermine the finding.
3.0–3.9 — Adequate. Evidence is present but limited, mixed, or partially corroborated. The product meets a basic standard but with notable gaps or inconsistencies.
2.0–2.9 — Weak. Evidence suggests meaningful shortfalls. Claims made by the product owner are not substantiated by independent sources, or documented failures are present.
1.0–1.9 — Deficient. Material failures or absence of the capability or control. Significant risk to users or enterprise buyers exists.

Scores between integers reflect graduated evidence quality within a band (e.g., 3.7 indicates the upper range of “Adequate” but below “Strong” threshold). Scores are not assigned where evidence is insufficient — in such cases the factor is recorded as Not Assessable (N/A) and disclosed in the report. N/A factors are excluded from score calculations rather than treated as zero.

4.3 Score Aggregation

Scores are aggregated upward through the three-tier hierarchy using a weighted model established during framework design, not a simple average. Weights are assigned at each level before any data collection begins, ensuring they reflect considered analytical judgment about relative importance rather than being adjusted post-hoc.

Atomic factor to subcategory. Each atomic factor carries a defined weight reflecting its relative criticality. Factor weights within each subcategory sum to 100%.

Subcategory to category. Each subcategory carries a defined weight reflecting its centrality to the category’s overall assessment. Subcategory weights within each category sum to 100%.

Weights are defined in the framework specification for each product subtype and published alongside the corresponding evaluation report. They are not adjusted after data collection begins. N/A factors are removed from the weight distribution and remaining weights are normalized to sum to 100% before aggregation.

Category scores are further aggregated into a single composite score. Users can see assigned scores at all levels.

5. Data Collection

5.1 Source Hierarchy

huby uses a tiered source framework. Higher-tier sources take precedence when sources conflict. All sources cited in a report must be independently accessible via a verifiable URL.

Tier 1 — Independent third-party audits and certifications. SOC 2 reports, CVE databases, named security firm assessments, peer-reviewed research.
Tier 2 — Established independent journalism and analysis. Major technology publications, analyst firms, academic institutions.
Tier 3 — Product owner documentation. Official privacy policies, terms of service, technical documentation, changelogs.
Tier 4 — Structured user and practitioner evidence. Enterprise case studies from named customers, verified professional community posts.
Tier 5 — Anecdotal community evidence. Reddit threads, social media posts, informal reviews.

5.2 Product Owner Data

Since many AI products don’t have a long history, we require less known product owners to submit their product documentation, technical specifications, and supporting materials directly to huby for consideration. Product owner submissions do not receive preferential weighting and are identified as such in citations.

huby does not accept payment from product owners in exchange for favorable scores. Commercial relationships with product owners, where they exist, are disclosed in the relevant report.

5.3 Competitive Context

Where sufficient public data exists, scores are contextualized against the assessed product’s competitive set. This does not change the score itself but provides readers with relative positioning. Where competitive benchmarking is not possible, this limitation is disclosed.

5.4 Data Normalization

Raw evidence is normalized across four dimensions before scoring: recency (evidence older than 24 months is discounted unless no newer data exists), source independence (per the hierarchy above), specificity (general claims are weighted below specific, verifiable assertions), and corroboration (number of independent sources confirming a finding).

6. Analytical Process

6.1 Independence and Conflict Management

Our analysis is not influenced by a product owner. We fully disclose any work that we do with product owners. Score assignments at subcategory level and above go through multiple reviews before they are published.

6.2 Evidence Gaps

Where a product capability or control cannot be assessed due to insufficient public evidence, this is explicitly recorded as an evidence gap rather than inferred from absence. Evidence gaps are disclosed in the report and may result in an N/A score rather than a low score.

6.3 Report Versioning and Updates

Each report carries a production date. Reports are reviewed for material updates on a regular basis — when we are notified of a significant product change, security incident, or regulatory development that warrants earlier revision. We plan to track score changes between versions and disclose them with explanatory notes.

7. Report Structure

huby produces two report types for each evaluated product:

Public Transparency Report. Covers all six evaluation categories with subcategory-level scores and written assessments. Evidence is cited with verifiable URLs. Available to all users of the huby platform at no cost.
Detailed Owner Report. Produced for the product owner. Includes atomic factor-level scores and analysis, specific improvement recommendations mapped to each factor, and competitive gap analysis. This report is confidential to the product owner.

8. Limitations

Reports are based on publicly available evidence and product owner submissions at the time of evaluation. They do not constitute a formal security audit, legal compliance certification, or financial advisory opinion.
Scores reflect the state of the product at the report production date and may not reflect subsequent changes.
The absence of evidence for a capability or control does not confirm its absence — it reflects the limits of publicly available information at evaluation time.
huby staff are not lawyers, certified security auditors, or financial advisors. Where legal, security, or financial interpretation is required, readers should seek qualified professional advice.

9. Terminology

Atomic factor — The smallest unit of evaluation; a specific, verifiable capability or control.
Corroboration — Confirmation of a claim by an independent source separate from the originating source.
Evidence gap — A topic within the framework where insufficient public evidence exists to assign a score.
N/A (Not Assessable) — A factor excluded from score calculations due to an evidence gap.
Product subtype — A refined classification of a product within a product type, used to determine applicable subcategories.
Rollup — The aggregation of atomic factor scores into subcategory scores, and subcategory scores into category scores.

This methodology is published in full at huby.ai/methodology. For questions or feedback, contact support@huby.ai.