AI has changed the shape of innovation. It doesn’t make it effortless, but it definitely makes it faster to express. What I saw in our recent internal AI hackathon was the opposite of “easy output”: teams were making thoughtful choices under tight constraints, shaping ideas into real workflows, and proving value through working experiences – many of which will be productised within our digital experience platform, Xperience by Kentico.
Here I’ll briefly explore how AI innovation is commonly evaluated today in product companies and in the broader market, and outline how our Kentico evaluation compared, where we did especially well, and what I’d suggest tightening for next time (because there will definitely be a next time!).
From tackling really hard problems, to focusing on user delight, the breadth of our recent product-focused hackathon submissions encouraged me to explore a more philosophical question: how should we assess the value of AI innovation now that the pace of building is changing so dramatically?
How AI innovation is evaluated (by those who love frameworks)
Good innovation evaluation looks at both immediate impact as well as the future opportunities an idea can unlock. It enables product teams to balance quick wins with platform bets that expand their options over time. In practice, it means resisting two common traps: over-valuing what’s ‘shiny’ today, and under-valuing what builds durable advantage for tomorrow. A useful starting point for assessing innovation from a ‘product thinking’ or a portfolio mindset that incorporates a time horizon:
1) Product thinking: not all innovation is meant to “ship next quarter”
While not being AI-specific, the Three Horizons model (often associated with McKinsey’s strategy work) has remained a useful model because it gives teams a language for evaluating them according to a time horizon:
- Horizon 1: improvements that strengthen today’s core (near-term impact),
- Horizon 2: adjacent bets that extend the business (mid-term),
- Horizon 3: longer-term options that could reshape the future.
Why should we consider this? Because hackathons can sometimes be biased toward Horizon 1 demos, or things that feel immediately useful. That’s not wrong, but it can inadvertently disadvantage ideas that are less “demo-ready” yet strategically important (e.g., foundations like orchestration, evaluation tooling, or reusable MCP capabilities). If we want fair evaluation, our criteria should explicitly ask: which horizon is this in, and are we judging it with the right expectations?
2) Lifecycle strength: AI is a system, not a feature
The OECD’s framing of the AI system lifecycle is a great reminder that “AI value” isn’t just the moment the model outputs something impressive. It spans planning and design, data work, model building/validation, and then deployment, operation, and monitoring. For evaluation, this is the difference between: “Does it work in a demo?” and “Could we verify it, deploy it, and operate it safely and reliably over time?”
In a DXP context, I’d extend this to add one more practical layer to this lifecycle view: orchestration and connectivity. Many AI ideas are only truly valuable when they work seamlessly with the rest of the platform, such as content structures, permissions, workflows, channels, analytics, and (increasingly) multiple MCP servers. So part of evaluation becomes: is this a feature, or is it a capability that strengthens the platform?
If we apply this lens to hackathon judging, it supports a more mature discussion about readiness: not just how polished the UI looks, but whether the solution has the “operational bones” to be trustworthy and maintainable.
3) Scaling value is the hard part
A third lens is execution at scale. In the AI era, many organisations get stuck in “pilot purgatory” (I love this term – sorry I can’t recall where I first heard it but I know it was used in this HBR article!). Essentially this means they are running lots of experiments but failing to convert them into scalable, supportable features. Given the speed of AI development, this is hardly surprising to me. But recent HBR work argues that a scattershot approach doesn’t create transformative impact. Instead, real value comes from focus and delving deep into a particular area, redesigning workflows, and building an operating model and user experience that can help encourage user delight and stickiness. (Harvard Business Review)
This matters because, while AI makes prototypes cheaper, scaling is still hard. So the evaluation criteria needs to ask questions like:
- How repeatable is this?
- Is it still valuable for mass adoption instead of individual use?
- Does it really reduce work end-to-end, or just move effort somewhere else?
That doesn’t mean every hackathon project must prove scale. But it does mean we should score (or at least discuss) what it would take to scale, because that’s where “innovation” turns into “impact.”
4) Trustworthiness & control: “agentic” AI needs reliability-by-design
If the first three lenses help us judge value, this one helps us judge whether we can safely trust and scale an AI capability, especially when it’s agentic and able to take actions, call tools, and operate across multiple systems.
A widely referenced approach here is the NIST AI Risk Management (NIST, AI RMF Playbook), which organises AI risk work into four practical functions: Govern, Map, Measure, Manage. Even without going deep into risk terminology, it gives product teams a clean way to evaluate what makes an AI innovation operationally real:
- Govern: Are roles, permissions, and accountability clear (who approves, who owns outcomes)?
- Map: Do we understand the context, such as users, workflows, data sources, and where the model/agent can fail?
- Measure: Are we testing quality and reliability (accuracy, drift, error modes, cost-to-run, consistency)?
- Manage: Do we have guardrails and recovery (human-in-the-loop gates, logging, rollback, safe fallbacks)? (NIST, AI RMF Playbook)
“Why does this matter for a hackathon?” I hear you ask. AI can make a demo look finished, but agentic features only become valuable when they’re repeatable, observable, and controllable. They can’t be just impressive once and never again. Using a lightweight NIST-inspired check can encourage teams to design for trust early, which makes the path from prototype → product far clearer, and faster, by reducing productisation efforts.
Taken together, these frameworks suggest a simple shift: in AI-era innovation, we shouldn’t only judge what a team built in 2.5 days, but also which horizon it serves, whether it strengthens the platform across its lifecycle, and what it would take to ‘productise it’.
What we did at Kentico: a practical, multi-lens evaluation
What may come as no surprise to any of you, is that we didn’t treat AI as a science-fair novelty or demo competition – even though this was our first AI hackathon. We evaluated it like a product company. After all, we’re interested in solving our customers challenges, providing benefits and removing pain points!
The judges were selected from across the company – Marketing, Sales, UX, Product Management and Engineering – which I believe resulted in a more considered, balanced approach to evaluation. This is something I feel was a real strength of the event. In many hackathons, judges can easily be swayed by “the coolness factor” or be impressed by someone’s humorous presentation skills.
Taking the diversity of the judging panel into account, the judges were asked to assess the following criteria across project submissions:
- Speed impact / productivity gain (time saved, steps removed)
- User Impact (who benefits, how often, and how meaningful)
- Technical quality and polish (does it feel native, is it usable)
- AI tooling adoption (how effectively did the team use AI to build the solution – not just using AI within the solution)
- Overall impression & commercial potential (marketability, differentiation)
By leveraging the experience and insights of all areas of our business, this approach was designed to help us build a more objective picture of innovative value from different perspectives.
Where I think Kentico rated well
Using the earlier frameworks as a reference point, Kentico’s hackathon performed strongly where AI-era evaluation matters most: cross-functional realism, workflow fit, and native-integration. In terms of OECD’s lifecycle strength, we also did well at judging beyond “does it work?”, but unsurprisingly, most demos skewed towards Horizon 1 – immediate productivity and customer value:
- We rewarded ‘native-like’ rather than ‘bolted on’ approaches
Judges praised solutions that felt more natively Kentico, using Xperience by Kentico patterns and workflows instead of disconnected prototypes. That’s a meaningful marker of AI maturity because it correlates strongly with adoption and long-term maintainability. However, we mustn't forget that ‘disconnected’ or technical demos can also be extremely valuable for stretching the platform and discovering entirely new things.
- We valued time-to-value (not just technical ambition)
Tangible friction reduction was also highly rated: fewer steps, less context switching, less repetitive work, and productivity gains – be they major or micro.
Another thing that I believe shone throughout the judging process, was the cross-functional lens that we used to reduce biases and blind spots. Marketing considered “would my team actually use this?” Product questioned “is this an important problem to solve?” UX thought “does it fit the workflow?” Engineering explored “how good is the technical solution and code quality?” and Sales asked themselves “could I sell and demo this for my prospects?”. This is closer to how mature organisations evaluate AI initiatives for business outcomes, and this is especially important when AI can make demos look more complete than they really are (hello, vibe-coded prototypes!).
But even perfection can be improved upon, right?
Hindsight being 20-20 vision, I can see that we were really focused on Horizon 1 innovation. So for the next product-focused AI hackathon we run, depending upon its focus, I think there are a number of things we could consider improving or incorporating in how we evaluate team submissions:
- Ask teams to consider and express the time horizon for the problem they are solving.
As the majority of submission demos and judges comments focused on solving use cases that would have immediate user impact, there were a few standouts that really tackled some larger, longer-term challenges that would significantly reduce speed to market for campaigns and product launches with more time investment. Highlighting the time horizon could really help these teams to better showcase the future value of their submissions.
- Add a clearer “scale test” dimension
Several judge comments revealed a natural tension: “this works” vs “this would really wow if it scaled or was more deeply integrated into the product’s workflows.” Next time, without having to explicitly incorporate it in the demo, we could consider asking the teams to reflect upon and suggest: What could happen at 10x usage and 100x usage? What would need to be true for this to work across many customers, content models, or implementations?
- Separate “polish” from “product readiness”
Some projects earned high marks for their polish and UI completeness, but feedback also hinted at the odd, less convincing user experience flows or connectivity between product capabilities not being clearly shown during the demo. Therefore, if we were to include ‘product readiness’ as a key theme for the evaluation next time, then we may wish to incorporate some level of user experience evaluation criteria and ensure judges were clearly assessing along these lines. Otherwise, it’s great to keep the door open for more creative solutioning and focus on evaluating ideas instead of polish.
- Measure reliability-by-design for agentic workflows
We didn’t specify whether the AI approach should be fully or semi- autonomous agents that are initiated via prompt window or triggered as an action from elsewhere within the UI. As a result, we had a mixture of the two in the final submissions which made it harder to define agent-specific evaluation criteria. Next time, we could consider scoring for the team’s clear articulation of the boundaries set for the agent/s. For the purposes of creative solutioning, verification steps and repeatability might not be something we would want to include as specific criteria, unless of course these were areas where the teams actively choose to focus on exploring.
- Require a “value hypothesis” in one sentence + one metric
Something that most teams did, and something I’d like to encourage, was the inclusion of the value hypothesis at the beginning of their demo presentation. What did they set out to achieve for the user and how close did they come within that constrained timeframe? Having this built into our presentation guidance would make our judging sharper, and it would make follow-through after the hackathon much easier.
Where we surprised ourselves (or what I think we should keep)
- AI has changed the “prototype ceiling”
The judges weren’t judging “does it work?” as much as “how close is this to something we’d ship?” That’s an important cultural shift, and it’s aligned with what the market is seeing: AI compresses build cycles dramatically, but it increases the importance of evaluation maturity.
- The best ideas were both ambitious and grounded
The strongest projects targeted real user pain (marketing throughput, content quality, multi-channel execution) while also demonstrating modern AI practices (agent interaction, orchestration patterns, workflow fit). That combination is exactly what BCG’s research suggests separates companies that get value from AI from those that don’t: it’s not so much ‘experimentation’ but rather embedding AI into real workflows and measuring the outcomes.
In summary: We’re raising the bar!
This hackathon showed that we naturally gravitated toward the kinds of evaluation that the world’s leading innovation teams rely on. We judged ideas:
- Through multiple lenses such as product, engineering, UX, marketing and sales, to give every idea a fair and realistic reading.
- With an eye for integration - rewarding solutions that felt native to Xperience, not just clever demos.
- With a focus on impact by assessing how speed, clarity, and reduced effort could provide a better experience for real users, and
- With simple, visible goals and signals that showed what success could look like beyond the hackathon moment.
As a result, this event was full of ideas that were both imaginative and immediately meaningful. The teams showcased demos that made you think, “Yes, I could see this in the product.” And to me I think that’s the real win: a culture that supports creative exploration and practical value at the same time. We’re not choosing between fun and focus. Instead, we’re working towards having both which I believe is a huge advantage in the age of AI. Clearer vocabulary, stronger shared instincts, and more confidence about what “good AI innovation” looks like, won't constrain our creativity, it gives use more room to be brilliant.
This hackathon showed what our teams can do in 2.5 days when they’re energised, curious, and unafraid to experiment. With the momentum we have now, I'm positive that the next one will be bolder, richer in creativity and even more fun!
Want to learn more about what we did in our AI Hackathon? Check out this article from our AI Enablement Officer, Alexandros Koukovistas.
Sources:
McKinsey & Company. (2009, December 1). Enduring Ideas: The three horizons of growth. McKinsey Quarterly. McKinsey & Company
OECD.AI. (n.d.). Advancing accountability in AI (AI system lifecycle description). Organisation for Economic Co-operation and Development. OECD AI
Challagalla, G., Khan, M., & Beaulieu, F. (2025, November–December). Stop running so many AI pilots. Harvard Business Review. Harvard Business Review
National Institute of Standards and Technology (NIST). (n.d.). NIST AI RMF Playbook. NIST
National Institute of Standards and Technology (NIST). (2024, July 26). Artificial Intelligence Risk Management Framework: Generative AI Profile (NIST-AI-600-1). NIST
Debbie Tucek
VP Product at Kentico. Product team leader, and former Marketing Executive, with more than 10 years’ experience across the product management and marketing spectrum – from startups to large organizations - I’m passionate about designing cloud software products and end-to-end services that customers love to use. My superpowers include my unending passion & enthusiasm for what I do and the problems my products solve, and my ability to project this passion to motivate others, be that in grooming sessions or on the international stage.