DALL-E 2 and the Importance of Artificial Intelligence System Evaluation

Olivier Blais
5 min readAug 17, 2022

--

In June 2022, Open AI released a new AI system called DALL-E 2, which can create realistic images and art from a description in natural language. Since then, this system has generated a LOT of keen interest amongst the data science, AI and artistic communities worldwide.

Interest over time, created by the author using Google trends data

The reason for this interest is quite understandable. DALL-E 2 can create truly remarkable and unique images autonomously, way better than its predecessor DALL-E 1. According to Open AI, 71.7% of users preferred DALL-E 2 over DALL-E 1 for caption matching and 88.8% for photorealism.

Here are three outstanding examples of outputs from DALL-E 2 that Open AI and the community usually communicate.

3 selected examples from a production version of DALL-E 2, created by Open AI research team

This system led to many discussions around the room for humans in creative and design creation, which created fear amongst these artists and professionals. Here are the types of articles that were released.

However, my experience, using and testing DALL-E 2 has been, quite honestly… a little disappointing overall. Many examples of queries returned images like the ones below. Fuzzy, misshaped, deformed images with several spelling errors.

3 random examples from a production version of DALL-E 2, created by the author

Don’t get me wrong, I still believe DALL-E 2 is a significant step forward in creative capabilities, and the future is bright. However, I am questioning why we get such a distorted reality reading the articles and the news.

I think this example is perfect for discussing key components of model evaluation.

Cherry-picking

Humans tend to cherry-pick. Sometimes this favours AI systems, like in this case, or it can be against AI systems. This is often observed when building enterprise AI or software systems that might trigger great organizational changes.

Cherry-picking or confirmation bias is detrimental to the AI system evaluation of your system as it will skew the conclusions one way or another. From a change management perspective, this is also problematic as end users might not all feel the same about the system. This emotional gap can generate cognitive dissonance.

To mitigate confirmation bias, a neutral and rigorous methodology should be implemented. In this case, Open AI had an evaluation process consisting of evaluators comparing 1000 images between DALL-E 1 and DALL-E 2 for photorealism, caption similarity and diversity. However, very few details are available on the full methodology through the DALL-E 2 website or the associated research paper.

Baselines and objectives

The Open AI team worked on developing a system that beat its predecessor, DALL-E 1. From the communicated metrics, it met its objectives.

To conclude that Open AI can replace human designers is a stretch as the evaluation process does not compare DALL-E 2 with human-generated images. The baseline should not be DALL-E 1, but a vast library of existing photos, designs, and art pieces.

The complexity of evaluating large AI systems

The more a system is vast, the more complex the quality evaluation is. For example, is random 1000 examples enough if you can ask a system about anything about any topics? It probably is not. The generalization of a system might cause sampling errors, as it is difficult to select all significant salient characteristics of the population.

Sampling Error, created by Investopedia

One of the best ways to improve our evaluation is using a “user acceptance testing” methodology.

User Acceptance Testing

User Acceptance Testing consists of having actual users test the software to determine if it does what it was designed to do in real-world situations, validating changes made and assessing adherence to their organization’s business requirements.

I always suggest organizations to build their user acceptance tests by interviewing users and asking them to curate the scenarios based on their experience in the field.

In the context of DALL-E 2, this could be reflected by hiring multiple artists and designers as evaluators and curating test scenarios using the evaluator’s knowledge.

Why are we doing this? Because 1000 random examples are a minimal number of examples. However, 1000 curated examples could be enough since it will reflect this system’s actual usage.

Then, let’s have the evaluators evaluate the test scenarios. If a predetermined amount of test scenarios is good enough, according to predefined metrics, user acceptance testing is passed, as the system meets the user requirements.

Performance monitoring

Another evaluation technique is also essential to implement. It is the capability to monitor user behaviour once they are using the platform. It is also imperative to implement monitoring capabilities as most complex AI systems cannot realistically be thoroughly tested (i.e. 100% coverage), and actual usage may discover unpleasant surprises. Also, AI systems can degrade at some point if production data drift.

Model degradation, created by David Sklenar

Open AI could develop monitoring mechanisms by explicitly asking users if they are satisfied with the images related to their requests or implicitly calculating the number of shares or downloads associated with user requests. However, as you can suspect, there are fallbacks to both techniques.
A more explicit online performance metric might require development and be impossible to do in some cases (for example, an automated trading algorithm). However, this approach enables more precise monitoring. On the other hand, implicit metrics are easier to calculate but might generate false or noisy signals.

In conclusion

In conclusion, although DALL-E 2 demonstrates awe-inspiring capabilities, it is far from perfect. This AI system is an excellent example of why quality evaluation is difficult to achieve, even though it is critical for the impacted stakeholders and potentially the world if an imperfect AI system gets released without proper evaluation.

--

--

Olivier Blais
Olivier Blais

Written by Olivier Blais

Cofounder & VP Decision Science at Moov AI and Editor of the ISO/IEC TS 25058 — Guidance for quality evaluation of AI systems technical specifications.

No responses yet