DeepMind claims early progress in AI-based predictive protein modelling

Google -owned AI specialist, DeepMind, has claimed a “significant milestone” in being able to demonstrate the usefulness of artificial intelligence to help with the complex task of predicting 3D structures of proteins based solely on their genetic sequence.

Understanding protein structures is important in disease diagnosis and treatment, and could improve scientists’ understanding of the human body — as well as potentially helping to support protein design and bioengineering.

Writing in a blog post about the project to use AI to predict how proteins fold — now two years in — it writes: “The 3D models of proteins that AlphaFold [DeepMind’s AI] generates are far more accurate than any that have come before — making significant progress on one of the core challenges in biology.”

There are various scientific methods for predicting the native 3D state of protein molecules (i.e. how the protein chain folds to arrive at the native state) from residual amino acids in DNA.

But modelling the 3D structure is a highly complex task, given how many permutations there can be on account of protein folding being dependent on factors such as interactions between amino acids.

There’s even a crowdsourced game (FoldIt) that tries to leverage human intuition to predict workable protein forms.

DeepMind says its approach rests upon years of prior research in using big data to try to predict protein structures.

Specifically it’s applying deep learning approaches to genomic data.

“Fortunately, the field of genomics is quite rich in data thanks to the rapid reduction in the cost of genetic sequencing. As a result, deep learning approaches to the prediction problem that rely on genomic data have become increasingly popular in the last few years. DeepMind’s work on this problem resulted in AlphaFold, which we submitted to CASP [Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction] this year,” it writes in the blog post.

“We’re proud to be part of what the CASP organisers have called “unprecedented progress in the ability of computational methods to predict protein structure,” placing first in rankings among the teams that entered (our entry is A7D).”

“Our team focused specifically on the hard problem of modelling target shapes from scratch, without using previously solved proteins as templates. We achieved a high degree of accuracy when predicting the physical properties of a protein structure, and then used two distinct methods to construct predictions of full protein structures,” it adds.

DeepMind says the two methods it used relied on using deep neural networks trained to predict protein properties from its genetic sequence.

“The properties our networks predict are: (a) the distances between pairs of amino acids and (b) the angles between chemical bonds that connect those amino acids. The first development is an advance on commonly used techniques that estimate whether pairs of amino acids are near each other,” it explains.

“We trained a neural network to predict a separate distribution of distances between every pair of residues in a protein. These probabilities were then combined into a score that estimates how accurate a proposed protein structure is. We also trained a separate neural network that uses all distances in aggregate to estimate how close the proposed structure is to the right answer.”

It then used new methods to try to construct predictions of protein structures, searching known structures that matched its predictions.

“Our first method built on techniques commonly used in structural biology, and repeatedly replaced pieces of a protein structure with new protein fragments. We trained a generative neural network to invent new fragments, which were used to continually improve the score of the proposed protein structure,” it writes.

“The second method optimised scores through gradient descent — a mathematical technique commonly used in machine learning for making small, incremental improvements — which resulted in highly accurate structures. This technique was applied to entire protein chains rather than to pieces that must be folded separately before being assembled, reducing the complexity of the prediction process.”

DeepMind describes the results achieved thus far as “early signs of progress in protein folding” using computational methods — claiming they demonstrate “the utility of AI for scientific discovery”.

Though it also emphasizes it’s still early days for the deep learning approach having any kind of “quantifiable impact”.

“Even though there’s a lot more work to do before we’re able to have a quantifiable impact on treating diseases, managing the environment, and more, we know the potential is enormous,” it writes. “With a dedicated team focused on delving into how machine learning can advance the world of science, we’re looking forward to seeing the many ways our technology can make a difference.”

Agtech startup Imago AI is using computer vision to boost crop yields

Presenting onstage today in the 2018 TC Disrupt Berlin Battlefield is Indian agtech startup Imago AI, which is applying AI to help feed the world’s growing population by increasing crop yields and reducing food waste. As startup missions go, it’s an impressively ambitious one.

The team, which is based out of Gurgaon near New Delhi, is using computer vision and machine learning technology to fully automate the laborious task of measuring crop output and quality — speeding up what can be a very manual and time-consuming process to quantify plant traits, often involving tools like calipers and weighing scales, toward the goal of developing higher-yielding, more disease-resistant crop varieties.

Currently they say it can take seed companies between six and eight years to develop a new seed variety. So anything that increases efficiency stands to be a major boon.

And they claim their technology can reduce the time it takes to measure crop traits by up to 75 percent.

In the case of one pilot, they say a client had previously been taking two days to manually measure the grades of their crops using traditional methods like scales. “Now using this image-based AI system they’re able to do it in just 30 to 40 minutes,” says co-founder Abhishek Goyal.

Using AI-based image processing technology, they can also crucially capture more data points than the human eye can (or easily can), because their algorithms can measure and asses finer-grained phenotypic differences than a person might pick up on or be easily able to quantify just judging by eye alone.

“Some of the phenotypic traits they are not possible to identify manually,” says co-founder Shweta Gupta. “Maybe very tedious or for whatever all these laborious reasons. So now with this AI-enabled [process] we are now able to capture more phenotypic traits.

“So more coverage of phenotypic traits… and with this more coverage we are having more scope to select the next cycle of this seed. So this further improves the seed quality in the longer run.”

The wordy phrase they use to describe what their technology delivers is: “High throughput precision phenotyping.”

Or, put another way, they’re using AI to data-mine the quality parameters of crops.

“These quality parameters are very critical to these seed companies,” says Gupta. “Plant breeding is a very costly and very complex process… in terms of human resource and time these seed companies need to deploy.

“The research [on the kind of rice you are eating now] has been done in the previous seven to eight years. It’s a complete cycle… chain of continuous development to finally come up with a variety which is appropriate to launch in the market.”

But there’s more. The overarching vision is not only that AI will help seed companies make key decisions to select for higher-quality seed that can deliver higher-yielding crops, while also speeding up that (slow) process. Ultimately their hope is that the data generated by applying AI to automate phenotypic measurements of crops will also be able to yield highly valuable predictive insights.

Here, if they can establish a correlation between geotagged phenotypic measurements and the plants’ genotypic data (data which the seed giants they’re targeting would already hold), the AI-enabled data-capture method could also steer farmers toward the best crop variety to use in a particular location and climate condition — purely based on insights triangulated and unlocked from the data they’re capturing.

One current approach in agriculture to selecting the best crop for a particular location/environment can involve using genetic engineering. Though the technology has attracted major controversy when applied to foodstuffs.

Imago AI hopes to arrive at a similar outcome via an entirely different technology route, based on data and seed selection. And, well, AI’s uniform eye informing key agriculture decisions.

“Once we are able to establish this sort of relation this is very helpful for these companies and this can further reduce their total seed production time from six to eight years to very less number of years,” says Goyal. “So this sort of correlation we are trying to establish. But for that initially we need to complete very accurate phenotypic data.”

“Once we have enough data we will establish the correlation between phenotypic data and genotypic data and what will happen after establishing this correlation we’ll be able to predict for these companies that, with your genomics data, and with the environmental conditions, and we’ll predict phenotypic data for you,” adds Gupta.

“That will be highly, highly valuable to them because this will help them in reducing their time resources in terms of this breeding and phenotyping process.”

“Maybe then they won’t really have to actually do a field trial,” suggests Goyal. “For some of the traits they don’t really need to do a field trial and then check what is going to be that particular trait if we are able to predict with a very high accuracy if this is the genomics and this is the environment, then this is going to be the phenotype.”

So — in plainer language — the technology could suggest the best seed variety for a particular place and climate, based on a finer-grained understanding of the underlying traits.

In the case of disease-resistant plant strains it could potentially even help reduce the amount of pesticides farmers use, say, if the the selected crops are naturally more resilient to disease.

While, on the seed generation front, Gupta suggests their approach could shrink the production time frame — from up to eight years to “maybe three or four.”

“That’s the amount of time-saving we are talking about,” she adds, emphasizing the really big promise of AI-enabled phenotyping is a higher amount of food production in significantly less time.

As well as measuring crop traits, they’re also using computer vision and machine learning algorithms to identify crop diseases and measure with greater precision how extensively a particular plant has been affected.

This is another key data point if your goal is to help select for phenotypic traits associated with better natural resistance to disease, with the founders noting that around 40 percent of the world’s crop load is lost (and so wasted) as a result of disease.

And, again, measuring how diseased a plant is can be a judgement call for the human eye — resulting in data of varying accuracy. So by automating disease capture using AI-based image analysis the recorded data becomes more uniformly consistent, thereby allowing for better quality benchmarking to feed into seed selection decisions, boosting the entire hybrid production cycle.

Sample image processed by Imago AI showing the proportion of a crop affected by disease

In terms of where they are now, the bootstrapping, nearly year-old startup is working off data from a number of trials with seed companies — including a recurring paying client they can name (DuPont Pioneer); and several paid trials with other seed firms they can’t (because they remain under NDA).

Trials have taken place in India and the U.S. so far, they tell TechCrunch.

“We don’t really need to pilot our tech everywhere. And these are global [seed] companies, present in 30, 40 countries,” adds Goyal, arguing their approach naturally scales. “They test our technology at a single country and then it’s very easy to implement it at other locations.”

Their imaging software does not depend on any proprietary camera hardware. Data can be captured with tablets or smartphones, or even from a camera on a drone or using satellite imagery, depending on the sought for application.

Although for measuring crop traits like length they do need some reference point to be associated with the image.

“That can be achieved by either fixing the distance of object from the camera or by placing a reference object in the image. We use both the methods, as per convenience of the user,” they note on that.

While some current phenotyping methods are very manual, there are also other image-processing applications in the market targeting the agriculture sector.

But Imago AI’s founders argue these rival software products are only partially automated — “so a lot of manual input is required,” whereas they couch their approach as fully automated, with just one initial manual step of selecting the crop to be quantified by their AI’s eye.

Another advantage they flag up versus other players is that their approach is entirely non-destructive. This means crop samples do not need to be plucked and taken away to be photographed in a lab, for example. Rather, pictures of crops can be snapped in situ in the field, with measurements and assessments still — they claim — accurately extracted by algorithms which intelligently filter out background noise.

“In the pilots that we have done with companies, they compared our results with the manual measuring results and we have achieved more than 99 percent accuracy,” is Goyal’s claim.

While, for quantifying disease spread, he points out it’s just not manually possible to make exact measurements. “In manual measurement, an expert is only able to provide a certain percentage range of disease severity for an image example; (25-40 percent) but using our software they can accurately pin point the exact percentage (e.g. 32.23 percent),” he adds.

They are also providing additional support for seed researchers — by offering a range of mathematical tools with their software to support analysis of the phenotypic data, with results that can be easily exported as an Excel file.

“Initially we also didn’t have this much knowledge about phenotyping, so we interviewed around 50 researchers from technical universities, from these seed input companies and interacted with farmers — then we understood what exactly is the pain-point and from there these use cases came up,” they add, noting that they used WhatsApp groups to gather intel from local farmers.

While seed companies are the initial target customers, they see applications for their visual approach for optimizing quality assessment in the food industry too — saying they are looking into using computer vision and hyper-spectral imaging data to do things like identify foreign material or adulteration in production line foodstuffs.

“Because in food companies a lot of food is wasted on their production lines,” explains Gupta. “So that is where we see our technology really helps — reducing that sort of wastage.”

“Basically any visual parameter which needs to be measured that can be done through our technology,” adds Goyal.

They plan to explore potential applications in the food industry over the next 12 months, while focusing on building out their trials and implementations with seed giants. Their target is to have between 40 to 50 companies using their AI system globally within a year’s time, they add.

While the business is revenue-generating now — and “fully self-enabled” as they put it — they are also looking to take in some strategic investment.

“Right now we are in touch with a few investors,” confirms Goyal. “We are looking for strategic investors who have access to agriculture industry or maybe food industry… but at present haven’t raised any amount.”

Chances DNA can be used to find your family? Sixty percent and rising

Image of a family tree.

Earlier this year, news broke that police had devised an unexpected new method to crack cold cases. Rather than use a suspect’s DNA to identify them, data from the DNA was used to search public repositories and identify an alleged killer’s family members. From there, a bit of family tree building led to a limited number of suspects and the eventual identification of the person who was charged with the Golden State killings. In the months that followed, more than a dozen other cases were reported to have been solved in the same manner.

The potential for this sort of analysis had been identified by biologists as early as 2014, but they viewed it as a privacy risk—there was potential for personal information from research subjects to leak out to the public via their DNA sequences. Now, a US-Israeli team of researchers has gone through and quantified the chances of someone being identified through public genealogy data. If you live in the US and are of European descent, odds are 60 percent that you can be identified via information that your relatives have made public.

ID, the family plan

Any two humans share identical versions of the vast majority of their DNA. But there are enough differences commonly scattered across the three billion or so bases of our genomes that it’s now cheap and easy to determine which version of up to 700,000 differences people have. This screen forms the basis of personal DNA testing and genealogy services.

Read 13 remaining paragraphs | Comments

Using Medieval DNA to track the barbarian spread into Italy

Two-sided image. At left, grave goods; at right, a skeleton.

The genetics of Europe are a bit strange. Just within historic times, it’s seen waves of migrations, invasions, and the rise and fall of empires—all of which should have mixed its populations up thoroughly. Yet, if you look at the modern populations, there’s little sign of all this upheaval and some indications that many of the populations have been in place since agriculture spread across the continent.

This was rarely more obvious than during the contraction and collapse of the Roman Empire. Various Germanic tribes from north-eastern Europe poured into Roman territory in the west only to be followed by the force they were fleeing, the Huns. Before it was over, one of the groups ended up founding a kingdom in North Africa that extended throughout much of the Mediterranean, while another ended up controlling much of Italy.

It’s that last group, the Longobards (often shorted as “Lombards”), that’s the focus of a new paper. We know very little of them or any of the other barbarian tribes that roared through Western Europe other than roughly contemporary descriptions of where they came from. But a study of the DNA left behind in the cemeteries of the Longobards provides some indication of their origins and how they interacted with the Europeans they encountered.

Read 12 remaining paragraphs | Comments

George Church’s genetics on the blockchain startup just raised $4.3 million from Khosla

Nebula Genomics, the startup that wants to put your whole genome on the blockchain, has announced the raise of $4.3 million in Series A from Khosla Ventures and other leading tech VC’s such as Arch Venture Partners, Fenbushi Capital, Mayfield, F-Prime Capital Partners, Great Point Ventures, Windham Venture Partners, Hemi Ventures, Mirae Asset, Hikma Ventures and Heartbeat Labs.

Nebula has also has forged a partnership with genome sequencing company Veritas Genetics.

Veritas was one of the first companies to sequence the entire human genome for less than $1,000 in 2015, later adding all that info to the touch of a button on your smartphone. Both Nebula and Veritas were cofounded by MIT professor and “godfather” of the Human Genome Project, George Church.

The partnership between the two companies will allow the Nebula marketplace, or the place where those consenting to share their genetic data can earn Nebula’s cryptocurrency called “Nebula tokens” to build upon Veritas open-source software platform Arvados, which can process and share large amounts of genetic information and other big data. According to the company, this crossover offers privacy and security for the physical storage and management of various data sets according to local rules and regulations.

“As our own database grows to many petabytes, together with the Nebula team we are taking the lead in our industry to protect the privacy of consumers while enabling them to participate in research and benefit from the blockchain-based marketplace Nebula is building,” Veritas CEO Mirza Cifric said in a statement.

The partnership will work with various academic institutions and industry researchers to provide genomic data from individual consumers looking to cash in by sharing their own data, rather than by freely giving it as they might through another genomics company like 23andMe .

“Compared to centralized databases, Nebula’s decentralized and federated architecture will help address privacy concerns and incentivize data sharing,” added Nebula Genomics co-founder Dennis Grishin. “Our goal is to create a data flow that will accelerate medical research and catalyze a transformation of health care.”

DNA shows girl had one Neanderthal, one Denisovan parent

One can be forgiven for thinking that the first modern humans who ventured out of Africa stumbled into a vibrant bar scene. DNA from just a single cave in Siberia revealed that it had been occupied by two archaic human groups that had interbred with the newly arrived modern humans. This included both the Neanderthals, whom we knew about previously, and the Denisovans, who we didn’t even know existed and still know little about other than their DNA sequences. The DNA also revealed that one of the Denisovans had a Neanderthal ancestor a few hundred generations back in his past.

But in almost all of these cases, the ancestry seems to have come from a single exchange of chromosomes many generations prior. There was little indication that the interbreeding was frequent.

Now, the same cave has yielded a bone fragment that indicates the interbreeding may have been common. DNA sequencing revealed that the bone fragment’s original owner had a mom that was Neanderthal and a father who was Denisovan. The fact that we have so few DNA samples from this time and that one is the immediate product of intermating gives us a strong hint that we should expect more examples in the future.

Read 15 remaining paragraphs | Comments

The nightmarishly complex wheat genome finally yields to scientists

Bread, like wine, is pivotal in Judeo-Christian rituals. Both products exemplify the use of human ingenuity to re-create what nature provides, and the fermentation they both require must have seemed nothing less than magical to ancient minds. When toasted, rubbed with garlic and tomato, doused with olive oil and sprinkled with salt like the Catalans do, there are few things more delicious than bread.

Wheat is the most widely cultivated crop on the planet, accounting for about a fifth of all calories consumed by humans and more protein than any other food source. Although we have relied on bread wheat so heavily and for so long (14,000 years-ish), an understanding of its genetics has been a challenge. Its genome has been hard to solve because it is ridiculously complex. The genome is huge, about five times larger than ours. It’s hexaploid, meaning it has six copies of each of its chromosomes. More than 85 percent of the genetic sequences among these three sets of chromosome pairs are repetitive DNA, and they are quite similar to each other, making it difficult to tease out which sequences reside where.

The genomes of rice and corn—two other staple grain crops—were solved in 2002 and 2009, respectively. In 2005, the International Wheat Genome Sequencing Consortium determined to get a reference genome of the bread wheat cultivar Chinese Spring. Thirteen years later, the consortium has finally succeeded.

Read 7 remaining paragraphs | Comments

Gene editing crunches an organism’s genome into single, giant DNA molecule

Complex organisms have complex genomes. While bacteria and archaea keep all of their genes on a single loop of DNA, humans scatter them across 23 large DNA molecules called chromosomes; chromosome counts range from a single chromosome in males of an ant species to more than 400 in a butterfly.

There have been indications that chromosomes matter for an organism’s underlying biology. Specialized structures within them influence the activity of nearby genes. And studies show that areas on different chromosomes will consistently be found next to each other in the cell, suggesting their interactions are significant.

So how do we square these two facts? Chromosome counts vary wildly and sometimes differ between closely related species, suggesting the actual number of chromosomes doesn’t matter much. Yet the chromosomes themselves seem to be critical for an organism’s genome to function as expected. To explore this issue, two different groups tried an audacious experiment: using genome editing, they gradually merged a yeast’s 16 chromosomes down to just one giant molecule. And, unexpectedly, the yeast were mostly fine.

Read 12 remaining paragraphs | Comments