GPT the next frontier: Multi-modality a path to Artificial General Intelligence

Aamir Mirza, AI researcher and data scientist, draws on his decade of experience within Artificial Intelligence to discuss the future of GPT-type models in their quest for Artificial General Intelligence (AGI).

GPT the next frontier: Multi-modality a path to Artificial General Intelligence
Aamir Mirza

Aamir Mirza

Data Scientist and Data Engineer

Humans, in all our glory, consider our intelligence to be the pinnacle of cognitive achievement given how our ability to reason, adapt, and create shapes the world around us.


But as we explore artificial general intelligence (AGI), the question arises: is human intelligence truly the upper limit?


The rapid evolution of GPT models suggests that machines are on track to exceed our own cognitive capabilities, unlocking new dimensions of thought, problem-solving, and creativity that go beyond the confines of human understanding.


In this article, we’ll explore how advancements in generative AI are not only replicating aspects of human cognition but potentially transcending it, pushing the boundaries of what we thought possible with intelligence and innovation.


Bridging AI and human cognition


GPT-models can evolve beyond their current capabilities to encompass a broader understanding of context, reasoning, and adaptability, ultimately aiming to bridge the gap between narrow AI and true general intelligence.


The human brain streams information from our senses 24/7, including vision, hearing, speaking, smell, touch, and internal feedback such as pain or loss.


This continuous influx of sensory data is processed and integrated into complex neural networks, allowing us to perceive the world around us, make decisions, and interact with our environment in real-time.


Our brains have the remarkable ability to integrate this information into a single unified view or model of this world.


This unified perception enables us to navigate our surroundings, recognise patterns, make predictions, and form (mostly, we hope) coherent interpretations of our experiences.


It is this holistic processing that underlies our ability to understand context, anticipate events, and adapt to changing circumstances; characteristics that pose significant challenges for artificial intelligence systems striving to achieve similar levels of comprehension and adaptability.


When it comes to AGI, we keep coming back to human intelligence because it serves as the yardstick through which we measure progress. We consider something truly intelligent when it can mimic most, if not all, parts of our cognitive abilities.


What is AGI?


Human intelligence is understood to represent the pinnacle of evolution’s design, exhibiting a remarkable blend of creativity, problem-solving, emotional intelligence, and adaptability.


It encompasses the ability to process information efficiently and to understand context, learn from experience, and interact meaningfully with the world.

As such, achieving AGI entails replicating these diverse facets of human intelligence within an artificial system.


By striving to emulate human cognition, we aim to create machines capable of understanding and navigating the complexities of our reality, ultimately advancing technology to unprecedented heights and reshaping the future of humanity.

 

From the Multiverse to Multimodalities


Despite how impressive GPT models are, they still have a long way to go, as the worldview of the current generation of models is based on a single modality.


To further simplify the matter, let's do a thought experiment:


  1. Imagine that you are stuck in a dark room, and your only view of the world is through a series of symbolic tokens that seem to appear out of nowhere without any prior context.
  2. These tokens convey information about the world, but they lack the richness and depth of real sensory experiences and background awareness. For you to make a realistic model of this world, you will process text data and generate responses based solely on the patterns and correlations within that data.
  3. Similarly, for GPT-type models, while they demonstrate impressive language understanding and generation capabilities, they cannot perceive and interact with the world in the way humans do, through multiple sensory modalities.
  4. This limitation underscores the need for further research and development to create AI systems that can integrate information from various modalities and achieve a more comprehensive understanding of the world, bringing us closer to the goal of AGI.


AGI and the mind’s eye


There are certain situations where words fall short of what we are trying to convey.


Whether it's the intricate details of a complex concept, the beauty of a breathtaking landscape, or the subtle nuances of human emotions, sometimes words alone cannot capture the full depth and richness of an experience.


In such cases, visual imagery can be incredibly powerful, offering a direct and immediate way to communicate ideas, evoke emotions, and convey information.


By incorporating visual elements into our communication strategies, whether through photographs, diagrams, or illustrations, we can enhance understanding, engage our audience on a deeper level, and create more impactful messages.


In the journey towards AGI, exploring the integration of visual perception with language understanding will be a crucial next step, allowing AI systems to comprehend and communicate with the richness and complexity of the human experience.


The next step for the machines


Alphabet (Google’s parent) and OpenAI, two heavyweights of the tech industry, understand where these models need to go to fulfil the promise of AGI and create agents that truly perceive this world as humans do.


With their vast resources, expertise, and commitment to advancing artificial intelligence, Google and OpenAI are at the forefront of research and development in this field.


They recognise the importance of not only improving the capabilities of existing AI models but also pushing the boundaries of innovation to achieve a deeper understanding of human cognition and perception.


By collaborating with researchers, investing in cutting-edge technologies, and fostering a culture of experimentation and exploration, these organisations are driving forward the quest for AGI and paving the way for a future where intelligent machines can truly comprehend and interact with the world in a manner akin to human beings.


While Open AI through GPT-4 has somewhat limited multi-modal capabilities, Google Gemini-Ultra as per Google is designed from the ground up to be fully MM (Multi-modal).


While it is a significant step on the path to AGI, it does leave me with the feeling that we are not quite there yet.


Multi-modal models

 

Multi-modal models are increasingly being recognised for their ability to address complex business cases across various industries, leveraging the integration of multiple types of data, such as text, images, audio, and more, to enhance decision-making and operational efficiency.


With the hindsight of over twelve years in the field of ML and AI, my two cents would be to invest and create MMs that are designed to work on a specific sector and the narrow scope of solving key business problems with great efficiency and accuracy.


In human terms, one might be an expert in one specific domain and ineffective in some other domain. So, if we wish to create an MM who is a great coder or developer, we do not care how good they are in accounting.


The journey towards a new era of intelligence


Machines will never be able to interpret and experience the world the way humans do, but they have the potential to surpass our cognitive abilities.


Humans and AI already co-exist together. The next step is to evolve together – enabling AI to reach new heights, engage with the world in more meaningful ways, and unlock even more advanced approaches to creativity, problem-solving, and innovation.


Here at CUBE, we use enhanced AI and machine learning capabilities to ensure financial services firms are fully compliant with the rules, laws, and regulations that are relevant.


Find out how CUBE taps into the world of language models and AI technologies to streamline compliance so it’s far less of a headache for the compliance function.