Author: Anthony Alford

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Google Research recently published their work on VideoPoet, a large language model (LLM) that can generate video. VideoPoet was trained on 2 trillion tokens of text, audio, image, and video data, and in evaluations by human judges its output was preferred over that of other models.
Unlike many image and video generation AI systems that use diffusion models, VideoPoet uses a Transformer architecture that is trained to handle multiple modalities. The model can handle multiple input and output modalities by using different tokenizers. After training, VideoPoet can perform a variety of zero-shot generative tasks, including text-to-video, image-to-video, video inpainting, and video style transfer. When evaluated on a variety of benchmarks, VideoPoet achieves “competitive” performance compared to state-of-the-art baselines. According to Google,
Through VideoPoet, we have demonstrated LLMs’ highly-competitive video generation quality across a wide variety of tasks, especially in producing interesting and high quality motions within videos. Our results suggest the promising potential of LLMs in the field of video generation. For future directions, our framework should be able to support “any-to-any” generation, e.g., extending to text-to-audio, audio-to-video, and video captioning should be possible, among many others.
Although OpenAI’s ground-breaking DALL-E model was an early example of using Transformers or LLMs to generate images from text prompts, diffusion models such as Imagen and Stable Diffusion soon became the standard architecture for generating images. More recently, researchers have trained diffusion models to generate short videos; for example, Meta’s Emu and Stability AI’s Stable Video Diffusion, which InfoQ covered in 2023.
With VideoPoet, Google returns to the Transformer architecture, citing the advantage of re-using infrastructure and optimizations developed for LLMs. The architecture also supports multiple modalities and tasks, in contrast to diffusion models, which according to Google require “architectural changes and adapter modules” to perform different tasks.
The key to VideoPoet’s support for multiple modalities is a set of tokenizers. The Google team used a video tokenizer called MAGVIT-v2 and an audio tokenizer called SoundStream; for text they used T5‘s pre-trained text embeddings. From there, the model uses a decoder-only autoregressive Transformer model to generate a sequence of tokens, which can then be converted into audio and video streams by the tokenizers.
VideoPoet was trained to perform eight different tasks: unconditioned video generation, text-to-video, video prediction, image-to-video, video inpainting, video stylization, audio-to-video, and video-to-audio. The model was trained on 2 trillion tokens, from a mix of 1 billion image-text pairs and 270 million videos.
The research team also discovered the model exhibited several emergent capabilities by chaining together several operations, for example, VideoPoet can use image-to-video to animate a single image, then apply stylization to apply visual effects. It can also generate long-form video, maintain consistent 3D structure, and apply camera motion from text prompts.
In a Hacker News discussion about VideoPoet, one user wrote:
The results look very impressive. The prompting however, is a bit weird – there’s suspiciously many samples with an “8k”-suffix, presumably to get more photorealistic results? I really don’t like that kind of stuff, when prompting becomes more like reciting sacred incantations instead of actual descriptions of what you want.
The VideoPoet demo site contains several examples of the model’s output, including a one-minute video short-story.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

OpenAI recently published a beta version of their Preparedness Framework for mitigating AI risks. The framework lists four risk categories and definitions of risk levels for each, as well as defining OpenAI’s safety governance procedures.
The Preparedness Framework is part of OpenAI’s overall safety effort, and is particularly concerned with frontier risks from cutting-edge models. The core technical work in evaluating the models is handled by a dedicated Preparedness team, which assesses a model’s risk level in four categories: persuasion, cybersecurity, CBRN (chemical, biological, radiological, nuclear), and model autonomy. The framework defines risk thresholds for deciding if a model is safe for further development or deployment. The framework also defines an operational structure and process for preparedness, which includes a Safety Advisory Group (SAG) that is responsible for evaluating the evidence of potential risk and recommending risk mitigations. According to OpenAI:
We are investing in the design and execution of rigorous capability evaluations and forecasting to better detect emerging risks. In particular, we want to move the discussions of risks beyond hypothetical scenarios to concrete measurements and data-driven predictions. We also want to look beyond what’s happening today to anticipate what’s ahead…We learn from real-world deployment and use the lessons to mitigate emerging risks. For safety work to keep pace with the innovation ahead, we cannot simply do less, we need to continue learning through iterative deployment.
The framework document provides detailed definitions for the four risk levels (low, medium, high, and critical) in the four tracked categories. For example, a model with medium risk level for cybersecurity could “[increase] the productivity of operators…on key cyber operation tasks, such as developing a known exploit into an attack.” OpenAI plans to create a suite of evaluations to automatically assess a model’s risk level, both before and after any mitigations are applied. While the details of these have not been published, the framework contains illustrative examples, such as “participants in a hacking challenge…obtain a higher score from using ChatGPT.”
The governance procedures defined in the framework include safety baselines based on a model’s pre- and post-mitigation risk levels. Models with a pre-mitigation risk of high or critical will trigger OpenAI to “harden” their security; for example, by deploying the model only into a restricted environment. Models with a post-mitigation risk of high or critical will not be deployed, and models with post-mitigation scores of critical will not be developed further. The governance procedures also state that while the OpenAI leadership are by default the decision makers with regard to safety, the Board of Directors have the right to reverse decisions.
In a Hacker News discussion about the framework, one user commented:
I feel like the real danger of AI is that models will be used by humans to make decisions about other humans without human accountability. This will enable new kinds of systematic abuse without people in the loop, and mostly underprivileged groups will be victims because they will lack the resources to respond effectively. I didn’t see this risk addressed anywhere in their safety model.
Other AI companies have also published procedures for evaluating and mitigating AI risk. Earlier this year, Anthropic published their Responsible Scaling Policy (RSP), which includes a framework of AI Safety Levels (ASL) modeled after the Center for Disease Control’s biosafety level (BSL) protocols. In this framework, most LLMs, including Anthropic’s Claude, “appear to be ASL-2.” Google DeepMind recently published a framework for classifying AGI models, which includes a list of six autonomy levels and possible associated risks.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Alphabet’s autonomous taxi company Waymo recently published a report showing its autonomous driver software outperforms human drivers on several benchmarks. The analysis covers over seven million miles of driving with no human behind the wheel, with Waymo cars having an 85% reduction in crashes involving an injury.
Waymo compared their crash data, which the National Highway Traffic Safety Administration (NHTSA) mandates that all automated driving systems (ADS) operators report, to a variety of human-driver benchmarks, including police reports and insurance claims data. These benchmarks were grouped into two main categories: police-reported crashes and any-injury-reported crashes. Because accident rates vary by location, Waymo restricted their comparison to human benchmarks for their two major operating areas: San Francisco, CA, and Phoenix, AZ. Overall, Waymo’s driverless cars sustained a 6.8 times lower any-injury-reported rate and a 2.3 times lower police-reported rate, per million miles traveled. According to Waymo,
Our approach goes beyond safety metrics alone. Good driving behavior matters, too — driving respectfully around other road users in reliable, predictable ways, not causing unnecessary traffic or confusion. We’re working hard to continuously improve our driving behavior across the board. Through these studies, our goal is to provide the latest results on our safety performance to the general public, enhance transparency in the AV industry, and enable the community of researchers, regulators, and academics studying AV safety to advance the field.
Waymo published a separate research paper detailing their work on creating benchmarks for comparing ADS to human drivers, including their attempts to correct biases in the data, such as under-reporting; for example, drivers involved in a “fender-bender” mutually agreeing to forgo reporting the incident. They also recently released the results of a study done by the reinsurance company Swiss Re Group, which compared Waymo’s ADS to a human baseline and found that Waymo “reduced the frequency of property damage claims by 76%” per million miles driven compared to human drivers.
Waymo currently operates their autonomous vehicles in three locations: Phoenix, San Francisco, and Los Angeles. Because Waymo only recently began operating in Los Angeles, they have accumulated only 46k miles, and while they have had no crashes there, the low mileage means that the benchmark comparisons lack statistical significance. Waymo’s San Francisco data in isolation showed the best in comparison to humans: Waymo’s absolute incidents per million miles was lower there than overall, and human drivers were worse than the national average.
AI journalist Timothy B. Lee posted his remarks about the study on X:
I’m worried that the cowboy behavior of certain other companies will give people the impression that AV technology in general is unsafe. When the reality is that the leading AV company, Waymo, has been steadily building a sterling safety record.
In a discussion about the report on Hacker News, one user noted that in many incidents, the Waymo vehicle was hit from behind, meaning that the crash was the other driver’s fault. Another user replied:
Yes, but…there is something else to be said here. One of the things we have evolved to do, without necessarily appreciating it, is to intuit the behavior of other humans through the theory-of-mind. If [autonomous vehicles consistently] act “unexpectedly”, this injects a lot more uncertainty into the system, especially when interacting with other humans.
The raw data for all autonomous vehicle crashes in the United States is available on the NHTSA website.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

OpenAI recently published a guide to Prompt Engineering. The guide lists six strategies for eliciting better responses from their GPT models, with a particular focus on examples for their latest version, GPT-4.
The guide’s six high-level strategies are: write clear instructions, provide reference text, split complex tasks into simpler subtasks, give the model time to “think”, use external tools, and test changes systematically. Each of the strategies is broken down into a set of specific, actionable tactics with example prompts. Many of the tactics are based on results of LLM research, such as chain-of-thought prompting or recursive summarization.
OpenAI’s research paper on GPT-3, published in 2020, showed how the model could perform a variety of natural language processing (NLP) tasks using few shot learning; essentially, by prompting the model with a description or examples of the task to be performed. In 2022, OpenAI published a cookbook article which contained several “techniques for improving reliability” of GPT-3’s responses. Some of these, such as giving clear instructions and breaking up complex tasks, are still included in the new guide. The older cookbook guide also contains a bibliography of research papers supporting their techniques.
Several of the guide’s tactics make use of the Chat API’s system message. According to OpenAI’s documentation, this parameter “helps set the behavior of the assistant.” One tactic suggests using it to give the model a persona for shaping its responses. Another suggests using it to pass the model a summary of a long conversation, or to give a set of instructions that are to be repeated for multiple user inputs.
The strategy of use external tools gives tips on interfacing the GPT model with other systems, with pointers to articles in OpenAI’s cookbook. One of the tactics suggests that instead of asking the model to perform math calculations itself, it should instead generate Python code to do the calculation; the code would then be extracted from the model response and executed. The guide does, however, contain a disclaimer that the code the model produces is not guaranteed to be safe, and should only be executed in a sandbox.
Another strategy in the guide, test changes systematically, deals with the problem of deciding if a different prompt actually results in better or worse output. This strategy suggests using the OpenAI Evals framework, which InfoQ covered along with the release of GPT-4. The strategy also suggests using the model to check its own work “with reference to gold-standard answers,” via the system message.
In a Hacker News discussion about the guide, one user said:
I’ve been hesitant lately to dedicate a lot of time to learning how to perfect prompts. It appears every new version, not to mention different LLMs, responds differently. With the rapid advancement we’re seeing, in two years or five, we might not even need such complex prompting as systems get smarter.
Several other LLM providers have also released prompt engineering tips. Microsoft Azure, which provides access to GPT models as a service, has a list of techniques similar to OpenAI’s; their guide also provides tips on setting model parameters such as temperature and top_p, which control the randomness of the model’s output generation. Google’s Gemini API documentation contains several prompt design strategies as well as suggestions for the top_p and temperature values.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Microsoft Research announced Phi-2, a 2.7 billion-parameter Transformer-based language model. Phi-2 is trained on 1.4T tokens of synthetic data generated by GPT-3.5 and outperforms larger models on a variety of benchmarks.
Phi-2 is the latest iteration of Microsoft’s Phi suite of models, which are trained on a mixture of web-crawled and synthetic “textbook-quality” datasets. The previous Phi models contain only 1.3B parameters, but showed excellent performance on coding and reasoning tasks. Phi-2 is twice as large as previous ones and was trained for two weeks on a cluster of 96 A100 GPUs. It has performance comparable to models which are up to 25x larger, outperforming the 70B parameter Llama-2 model on reasoning, language understanding, and coding benchmarks. According to Microsoft:
With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2 available in the Azure AI Studio model catalog to foster research and development on language models.
InfoQ recently covered several efforts to replicate the abilities of large language models (LLMs) in smaller models. Many of these use LLMs such as ChatGPT to generate synthetic training datasets for the smaller model. Google’s Distilling Step-by-Step method prompts a teacher LLM to automatically generate a small fine-tuning dataset that contains both an input with an output label, as well as a “rationale” for why the output label was chosen. Microsoft Research’s Orca 2 uses a synthetic training dataset and a new technique called Prompt Erasure to achieve performance equal to or better than models that contain 10x the number of parameters.
The key innovation with the Phi series of models is a synthetic dataset of “textbook-like” data. Although the researchers have not released the dataset or even very many details of its generation, previous tech reports on the Phi models include high-level descriptions. One goal for the datasets was to generate “diverse and non-repetitive” examples that cover a range of “concepts, skills, and scenarios” that vary in “level of difficulty, complexity, and style.” For Phi-1.5, the team selected 20k different topics for generated examples of language understanding problems.
Sebastien Bubeck, lead ML foundations team at Microsoft Research, posted on X about some additional work fine-tuning Phi-2:
phi-2 is really a good base for further fine-tuning: we [fine-tune] on 1M math exercises (similar to phi-1 w. CodeExercises) & test on recent French nation-wide math exam (published after phi-2 finished training). The results are encouraging! Go try your own data…
Mark Tenenholtz, the head of AI at Predelo, also posted about Phi-2, that “knowledge distillation really does work.” In a Hacker News discussion about Phi-2, one user noted that the compute cost of training the model was probably around 30k USD, or “cheaper than a car.” Another pointed out:
Note the model is trained on data generated by GPT-4. It’s probably orders of magnitude more expensive to generate the data at current API prices. The whole point of these papers is that training data quality is key. I would much prefer for these companies to release the training data than the weights.
The Phi-2 model weights are available on HuggingFace.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Microsoft Research released its Orca 2 LLM, a fine-tuned version of Llama 2 that performs as well as or better than models that contain 10x the number of parameters. Orca 2 uses a synthetic training dataset and a new technique called Prompt Erasure to achieve this performance.
Orca 2 models are trained using a teacher-student scheme, where a larger, more powerful LLM acts as a teacher for a smaller student LLM, with the goal of improving the performance of the student to be comparable with that of a larger model. Microsoft’s training technique teaches the smaller model multiple reasoning techniques and also how to choose the most effective technique for a given task. To do this, the teacher is given sophisticated prompts to trigger a certain reasoning behavior. However, in a scheme called Prompt Erasure, the student is given only the task requirements and desired response, but not the teacher’s prompt. When evaluated on benchmarks, a 13B parameter Orca 2 model outperformed a baseline 13B parameter Llama 2 by 47.54%. The 7B parameter Orca 2 was “better or comparable” to a 70B parameter Llama 2 on reasoning tasks.
Although LLMs like ChatGPT can often perform well on a wide range of tasks with few-shot prompting, hosting the models is challenging due to their memory and compute requirements. Smaller models can also perform well when fine-tuned, and many researchers have investigated training them with synthetic datasets generated by larger LLMs. InfoQ recently covered Google’s Distilling Step-by-Step method which prompts a teacher LLM to automatically generate a small fine-tuning dataset that contains both an input with an output label, as well as a “rationale” for why the output label was chosen. InfoQ also covered Stability AI’s Stable Beluga model which is trained using Microsoft’s original Orca 1 scheme, which uses Explanation Tuning, where the teacher LLM is prompted to “generate detailed answers.”
Like Orca 1, the Orca 2 training dataset is generated by a teacher LLM which is given a detailed prompt. However, the new approach, which Microsoft dubs Cautious Reasoning, pairs training tasks with prompts which elicit the teacher to use a specific problem solving strategy, such as “step-by-step” or “explain your answer.” Then during training of the student, the teacher’s prompt is erased, which pushes the student to learn to pick the correct strategy.
To evaluate the methodology, Microsoft compared Orca 2 model performance to several baseline models, including Llama 2, ChatGPT (GPT-3.5) and GPT-4. The benchmark tasks included reasoning, language understanding, text completion, and summarization. On the reasoning benchmarks, the 13B parameter Orca 2 model outperformed all baselines except ChatGPT and GPT-4. They also found that giving Orca 2 a “cautious” system prompt (“You are a cautious assistant. You carefully follow instructions.”) gave it a small performance boost compared to an empty system prompt.
Several users posted about Orca 2 on X. One noted that “[Y]ou do not need to prompt it with tricks like “explain step by step.” It just knows.” AI researcher Rudi Ranck wrote:
Many brilliant ideas are so simple…Like “Prompt Erasure” in Orca 2: Instead of presenting the entire prompt, only the task and the answer are shown to the model (it filters the full prompt used to generate those answers). It helps the model to strategize at a higher level. Such a nice paper. I highly recommend reading it all the way through.
The 7B and 13B parameter Orca 2 models are available on Huggingface.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Stability AI released the code and model weights for Stable Video Diffusion (SVD), a video generation AI model. When given an input image as context, the model can generate 25 video frames at a resolution of 576×1024 pixels.
The model is based on Stability’s Stable Diffusion text-to-image generation model, with additional video pre-training and fine-tuning using a high-quality curated dataset. To perform this additional training, Stability collected a dataset called Large Video Dataset (LVD), which contains 580M video clips representing 212 years of runtime. While the initial model release only supports image-to-video generation, Stability AI claims it can be adapted for multiple video generation tasks, including text-to-video and multi-view (i.e., 3D object) generation; the company has also announced a waitlist to gain access to a web-based text-to-video interface. The model license allows use for research purposes only:
While we eagerly update our models with the latest advancements and work to incorporate your feedback, we emphasize that this model is not intended for real-world or commercial applications at this stage. Your insights and feedback on safety and quality are important to refining this model for its eventual release.
Stability AI’s general strategy for building SVD was to collect and annotate a large dataset of videos. Starting with raw video, the team first removed motion inconsistencies such as “cuts” as well as videos with no motion at all. They then applied three synthetic captions to each clip using an image-only caption model, a video caption model, and an LLM to combine the two. They also used CLIP to extract aesthetic scores for selected frames in the video samples.
After training a base video diffusion model on the large dataset, the researchers used smaller curated datasets to fine-tune task-specific models for text-to-video, image-to-video, frame-interpolation, and multi-view generation. They also trained LoRA camera-control blocks for the image-to-video model. When evaluated by human judges, the output of the image-to-video model was preferred over that generated by state-of-the-art commercial products GEN-2 and PikaLabs. The multi-view generation model outperformed state-of-the-art models Zero123 and SyncDreamer.
Emad Mostaque, Stability AI’s CEO, wrote about the model’s current and future capabilities on X:
It [has] not only camera control via LoRA, you can do explosions & all sorts of effects…We will have blocking, staging, mise en scene, cinematography & all other elements of scene creation & brand new ones…
In a discussion about SVD on Hacker News, one user pointed out shortcomings of this approach:
[A]lthough I love SD and these video examples are great… It’s a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that. However I’m willing to bet that we’ll soon have something much better: you’ll describe something and you’ll get a full 3D scene, with 3D models, source of lights set up, etc. And the scene shall be sent into Blender and you’ll click on a button and have an actual rendering made by Blender, with correct lighting.
The Stable Video Diffusion code is available on GitHub, and the model weights are available on Huggingface.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Meta AI Research announced two new generative AI models: Emu Video, which can generate short videos given a text prompt, and Emu Edit, which can edit images given text-based instructions. Both models are based on Meta’s Emu foundation model and exhibit state-of-the-art performance on several benchmarks.
Emu Video uses a factorized or two-step approach for video generation: first generating an image based on the text prompt, then generating a video from the prompt and generated image. Both steps use a single fine-tuned Emu diffusion model, unlike previous methods such as Make-a-Video which use a pipeline of distinct models. Emu Edit is also based on the Emu diffusion model, but includes a task-embedding layer, which converts the text instruction prompt into an additional conditioning vector. Both Emu Video and Emu Edit were evaluated by human judges, who rated those models’ outputs on generated image quality and instruction faithfulness. Both models outperformed baseline models a majority of the time; in the case of Emu Video, 91.8% of the time on quality and 86.6% on faithfulness. According to Meta,
While certainly no replacement for professional artists and animators, Emu Video, Emu Edit, and new technologies like them could help people express themselves in new ways—from an art director ideating on a new concept or a creator livening up their latest reel to a best friend sharing a unique birthday greeting. And we think that’s something worth celebrating.
The Emu foundation model was announced earlier this year at the Meta Connect event. It is a latent diffusion model that is pre-trained on over 1 billion image-text pairs, then fine tuned on “a few thousand carefully selected high-quality images.” Emu can generate “highly visually appealing” images, with human judges preferring its output to Stable Diffusion XL over 70% of the time.
To create Emu Video, the researchers used a dataset of 34 million video-text pairs to further fine-tune an Emu foundation model; the model learned to predict several future video frames given an initial frame image. The resulting model can produce four-second long videos of 512×512 pixels at 16 fps. In addition to text-to-video, the model can generate a video from a user’s image; for this task, its output was preferred 96% of the time over that of the baseline VideoComposer model.
To train Emu Editor, the Meta team created a synthetic dataset of 10 million samples. Each sample consists of an input image, a textual instruction, a desired output image, and a task index. The index indicates which one of sixteen predefined tasks the instruction represents, such as removing an object or changing the image style. During training, the model learns an embedding for each task. The model can learn a new task by fine-tuning the embedding layer on just a “handful” of new examples.
In a discussion on Reddit, one user posted that:
The most interesting thing here is [the] appendix where they describe how they create the training dataset. They use a toolchain involving LLaMA, DINO, Segment Anything, and an image generator to create millions of image -> instruction -> output pairs. This is a real success story for synthetic data.
In a discussion on Hacker News, several users expressed disappointment that the models have not been open-sourced, stating that “Meta had been on an open source roll lately.” Meta did create a demo website for both Emu Video and Emu Edit. Meta also released the Emu Edit benchmark dataset on Huggingface.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

xAI, the AI company founded by Elon Musk, recently announced Grok, a large language model. Grok can access current knowledge of the world via the X platform and outperforms other LLMs of comparable size, including GPT-3.5, on several benchmarks.
xAI was launched earlier this year and trained their first model, the 33B parameter Grok-0. The company has not disclosed the parameter or training details of its latest version, Grok-1, but says that the model outperforms GPT-3.5 and Llama 2 on several benchmarks, including mathematics benchmarks GSM8k and MATH, the question-answering benchmark MMLU, and the coding benchmark HumanEval. The model is touted as having “bit of wit and has a rebellious streak,” and xAI claims it will answer questions that other LLMs will not. According to the xAI team:
By creating and improving Grok, we aim to gather feedback and ensure we are building AI tools that maximally benefit all of humanity. We believe that it is important to design AI tools that are useful to people of all backgrounds and political views. We also want to empower our users with our AI tools, subject to the law. Our goal with Grok is to explore and demonstrate this approach in public. We want Grok to serve as a powerful research assistant for anyone, helping them to quickly access relevant information, process data, and come up with new ideas. Our ultimate goal is for our AI tools to assist in the pursuit of understanding.
While the word “grok” was coined by Robert Heinlein’s in his sci-fi novel Stranger in a Strange Land, xAI says that their model is inspired by the the Hitchhiker’s Guide to the Galaxy, the eponymous fictional guidebook of Douglas Adams’s sci-fi series. According to xAI, it is “intended to answer almost anything….”
Although technical details about Grok are scarce, xAI mentioned that they built a custom ML framework for training and inference using JAX, Rust, and Kubernetes; they also mention that the model was trained for two months. xAI founding member Toby Pohlen posted a thread on X with videos demonstrating the Grok UI. Further, the X account for the Qdrant open-source vector database posted that Grok’s real-time knowledge capabilities are built on Qdrant, and encourages users to “stay tuned” for more details in a future blog post and tech talk with the X engineering team.
Reaction to the announcement was mixed. On Reddit, one user praised the effort, saying:
Beating Meta with just two months of training is really impressive. We know they have at least 10,000 H100s, which is more compute than was used for GPT-4. It seems like they are going to continue with rolling releases, so it will probably improve quickly. Also, it’s nice that the model seems much less censored, as this will push other companies to do the same.
Hacker News users were more skeptical. One user speculated that Grok’s benchmark scores could be due to training on the test set:
Many of the modern LLMs take an entire copy of the internet which includes the test set for many of these benchmarks. So if someone claims to beat ChatGPT and their model is trained on the test set, of course they’ll do better. Even ChatGPT is likely trained on the test set.
xAI said that they could not rule out that possibility. However, the team also hand-graded the model’s attempt at the Hungarian national high school final exam in mathematics, which was published after their dataset was collected. On this exam, Grok outperformed both GPT-3.5 and Claude 2.
Other users questioned whether Grok’s touted lack of censorship meant that xAI was “brushing off” concerns of bias and other risks. xAI said that they are working on “safeguards against catastrophic forms of malicious use.” The company lists Dan Hendrycks, the director of the Center for AI Safety, as an advisor. Hendrycks recently appeared on the Future of Life Institute podcast to discuss AI risks. In the podcast, Hendrycks said of xAI:
I think it’s relevant to note that [xAI is] a fairly serious effort. I’d anticipate it would probably be one of the main three AI companies next year or the year after: OpenAI, Google DeepMind, and xAI. I don’t think of it as a smaller effort: it has the capacity to have a substantial show of force.
A waitlist for early beta access to Grok is available only to verified X users.

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Multimodal AI company Jina AI recently released jina-embeddings-v2, a sentence embedding model. The model supports context lengths up to 8192 tokens and outperforms OpenAI’s text-embedding-ada-002 on several embedding benchmarks.
The jina-embeddings-v2 model, which is freely available under the Apache 2.0 license, is the second iteration of Jina’s embeddings model. The model, which supports only the English language, is based on a BERT architecture and is available in two sizes: small, which has 33 million parameters; and base, with 137 million. A large version with 435 million is “releasing soon.” The model was trained on the C4 dataset, along with a new dataset of negated statements created by Jina AI. On the Huggingface leaderboard for the Massive Text Embedding Benchmark (MTEB), jina-embeddings-v2 outperforms OpenAI’s text-embedding-ada-002 on several tasks of the benchmark, including text classification, reranking, and summarization. According to Dr. Han Xiao, CEO of Jina AI:
In the ever-evolving world of AI, staying ahead and ensuring open access to breakthroughs is paramount. With jina-embeddings-v2, we’ve achieved a significant milestone. Not only have we developed the world’s first open-source 8K context length model, but we have also brought it to a performance level on par with industry giants like OpenAI. Our mission at Jina AI is clear: we aim to democratize AI and empower the community with tools that were once confined to proprietary ecosystems. Today, I am proud to say, we have taken a giant leap towards that vision.
Sentence embeddings are a mapping of a piece of text into a vector. Spatial relationships between two vectors, such as the cosine distance, are used to measure how related the meanings of the two source texts are. The embedding of a text can be used for several downstream AI tasks, such as text classification or summarization. Embeddings are also used to index documents in a vector database for tasks such as retrieval-augmented generation (RAG). In 2022, InfoQ covered the release of OpenAI’s text-embedding-ada-002 model, which replaced five previous task-specific models.
Jina AI pointed out that one common shortcoming of embedding models is that negated statements are often mapped very close to their original positive statement. For example, the statements “a couple is walking together” and “a couple is not walking together” are often embedded more closely together than the statements “a couple walks hand in hand down a street” and “a couple is walking together.” To address this, the team used GPT-3.5 to create a negation dataset of “query, positive, negative” triples. During training, the model learned to separate the embeddings of the positive and negative components of the triples.
Several users discussed the model in a thread on Hacker News. One user pointed out that the dimensions of Jina’s embedding vector was about half that of OpenAI’s, which would make it more performant for database queries. Another user claimed:
In my experience, OpenAI’s embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that’s the metric for RAG). Ideally you’d want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].
Both varieties of jina-embeddings-v2, as well as the v1 models, are available on Huggingface. Jina AI claims that they are developing German and Spanish language models, and will publish an academic paper with the technical details of their work.