.jpg)
Advantest Talks Semi
Dive into the world of semiconductors and Automatic Test Equipment with our educational podcast, Advantest Talks Semi, where we explore the power of knowledge in this dynamic field. Hosted by Keith Schaub, Vice President of Technology and Strategy and Don Ong, Director and Head of Innovation for Advantest Field Service Business Group, at Advantest, this series features insightful conversations with experts and thought leaders in the industry.
In today's fast-paced environment, continuous learning is essential for staying ahead. Join us in these thought-provoking discussions, where you can learn about the latest trends and cutting-edge strategies being used in the semiconductor industry. Explore how innovative technologies are revolutionizing testing processes and shaping the future.
Stay updated on the ever-evolving semiconductor industry with Advantest Talks Semi, and gain exclusive insights into the future of technology.
The views, information, or opinions expressed during the Advantest Talks Semi series are solely those of the individuals interviewed and do not necessarily represent those of Advantest.
Advantest Talks Semi
Data Management Frontiers: Navigating the Semiconductor Landscape with Joe Addiego of Brave Capital
Join us on "Advantest Talks Semi" as we explore the critical importance of data management in the semiconductor industry with our guest, Joe Addiego.
Joe Addiego is a seasoned operating executive and investor with over 20 years of operating experience in technology. His executive leadership contributed to two successful Initial Public Offerings (IPOs), showcasing his strategic acumen in guiding tech companies through critical growth phases. Joe’s operational expertise spans across control systems, real-time operating systems, and software development tools, with expertise in data management, networking, and cybersecurity.
Joe has spent the last 25 years as an investor, nurturing startups and establishing markets for innovative technologies holding prominent roles at In-Q-Tel, Alsop Louie Partners, and Brave Capital. As a lead investor for the real-time database company Aerospike, he addressed the data management industry's need for guaranteed low latency and low cost of ownership. More recent investments include Crunchy Data, an open-source PostgreSQL company that provides a trusted Postgres implementation for enterprises to use either on premises through Kubernetes, or in the cloud as a managed service.
Crunchy Data recently augmented its cloud implementation to include a next-generation Postgres-native data warehouse, enabling everyone to take advantage of high-performance data analytics without leaving Postgres. Throughout Joe’s career he has succeeded at creating new markets for innovative technologies.
In this episode, we dive deep into how data serves as the foundation of cutting-edge innovations across AI, machine learning, adaptive testing, and cybersecurity. This conversation leads us into a comprehensive look at the evolution of data management—from the early days of punch cards and magnetic tapes to today's sophisticated database systems and cloud solutions.
Throughout the episode, Joe shares his insights on the transformation brought about by the internet era, which has revolutionized how we store, manage, and leverage data in real-time across the globe. We'll examine how these advancements have fueled the explosion of data volumes and how modern technologies have adapted to handle this surge with unprecedented efficiency and security.
Tune in to "Advantest Talks Semi" for a thought-provoking discussion on the history, current trends, and the future of data management within the semiconductor industry, and gain valuable perspectives on navigating the complexities of today's data-driven world. This episode is essential for anyone interested in the intersection of technology and business strategy, providing a clear understanding of how data continues to drive innovation and success in the high-tech landscape.
Thanks for tuning in to "Advantest Talks Semi"!
If you enjoyed this episode, we'd love to hear from you! Please take a moment to leave a rating on Apple Podcast. Your feedback helps us improve and reach new listeners.
Don't forget to subscribe and share with your friends. We appreciate your support!
Don Ong (00:06.424)
Hello and welcome to another exciting episode of Advantest Talks Semi. Today we're diving into a topic that's central to the future of semiconductor industry, data management. Data isn't just numbers on screen, it's the backbone of so much innovation happening right now, from AI and machine learning to adaptive testing and cybersecurity. To help us unpack this, I'm thrilled to welcome Joe Addiego to the podcast.
Joe is an investment partner at Silicon Valley Venture Funds such as our software partner and Brave Capital. He has spent years in the technology world and has a deep understanding of how data impacts every part of the process, from development and production to solving some of the industry's challenges. Joe's career is nothing short of impressive. He has spent over 20 years as an operating executive, contributing to two IPOs and working on everything from control systems and real-time operating systems to cutting-edge data management tools and messaging applications.
Joe's been deeply involved in driving evolution of database as a service and also integrated data analytics, making him a true innovator in how we think about modern data infrastructure. We'll be diving into his career, his insights on the future of technology and the role data plays in shaping it all. Joe, welcome to the show.
Joe Addiego: 1:37
Thank you, Don. So, as Don mentioned, I have 20 years of operating experience as an executive and luckily participated in a couple of IPOs. I wanted integrated systems and wanted Talarian, and Talarian in particular was a data management company that focused on messaging applications in publish and subscribe. I'll talk a little bit about that later.
And then I've done 25 years as an investor and I'm currently at Brave Capital, but I've been an investor at All Supply Partners and In-Q-Tel, and during that, I was the lead investor for a real-time database, aerospike. That was used in the ad tech space and now has proliferated. We'll talk a little bit about that. At the time, the need that we saw as an investor was the need to guarantee low latency and low cost of ownership, along with being able to manage millions of transactions a second. I was also the lead investor in an open-source Postgres company called Crunchy Data, and the early users there were the government, finance, manufacturing and with an early implementation of Kubernetes. The system is used as a record for critical transaction systems, and the system has now evolved into one that allows for database as a service, as well as some very new news becoming a data warehouse itself with data analytics.
Don Ong (03:28.123)
So, before we talk about data management where it is today, let's unwind and rewind a little bit. Look at how it all started. So back in the days, managing data wasn't as easy or automated as it is now. We're talking about punch cards, tapes, and early file systems, things that laid the groundwork for everything we use today. So, as we're diving into the history of data management, why was it so important back then?
Joe Addiego: 4:04
Sure, Don, thank you. Well, the history of computing-I always see it as constantly moving upward in terms of the abstractions available. We've moved from assembly language to languages that were near the machine, to things like Fortran and now the eventual language here we're speaking to machines through AI. So, the things that followed those abstractions in processing. Also, data management was also a function that followed those abstractions. So, you talked about punch cards, for instance, and I'm old enough that I started my career with punch cards. When I was in high school, I worked for the Treasury Department, and we distributed hundreds of thousands of checks and those checks to social security recipients were on punch cards and those punch cards actually had data punched into them and those were early databases. I would say the room that those punch cards existed was probably 2,000 square feet and it was filled with cabinets filled with punch cards and that was their data management. So, we've moved from that to tapes, to files and index files and eventually then databases back in the 80s, and there were three types of early databases hierarchically based, network and relational. And really what these are trying to model is the relationship of data. So, when the relationship can be modeled as a one-to-many those, and so you can model that very easily in that database.
And relational databases were based on a relational model that allowed you to view all of the data through relational algebra at the time, and while that was another abstraction, it led to the rise of a structured query language, which was known as SQL, and that language, you can see, it's been 40 years and that language still is profuse throughout the industry, still is profuse throughout the industry, and the actual success and permanence of SQL is due to it being a declarative language rather than an imperative language. And declarative language means that you don't define exactly the steps that you need to take in order to implement the command, but you would just define what the result is that you wanted, and the underlying processing would take care of itself. And I think that's one of the things that we want to do, the threads that we want to follow today, and that is again the rise of abstraction. SQL was abstracting the data away, the need to implement the data lookup away from the data itself, and that's why it was so popular, we'll follow that vector through time and see where we are today because of that. And see where we are today because of that.
Another aspect of data back in the 1980s was that these became systems of record. They had to have something called acid properties, you know atomicity, consistency, isolation, durability. Those properties in essence gave you a reliable system, a system that you can count on, that for any physical transaction to the database, you would get out what you put in, so you'd be able to read the data, and the data would be consistent, it wouldn't be corrupt and if there was a problem, that all of the data would be backed out and the physical consistency of the database existed. So, this is what I consider the prehistory of data management before the internet.
Don Ong (09:53.518)
So, Joe, now that we've explored the early days of data management, let's move forward to one of the most transformative periods, the internet era. The rise of the internet completely revolutionized how we manage, store, and share data. Suddenly we went from local systems to a world of global connectivity where data can be accessed and shared in real-time . This was also the era where the sheer volume of data exploded from megabytes to terabytes and beyond.
Driving the need for faster, more scalable solution is the time where technologies like distributed system, SQL database and real-time analytics started taking shaping the future. In this segment, we'll dive into how the internet era changed the game for data management, the role semiconductors played in supporting this shift and how it laid the groundwork for data driven world we live in today. To start, how do you think the data, the internet fundamentally reshaped the way we think about managing data?
Joe Addiego: 9:14
Well, I think before we get to the lessons that would apply today, we should talk about the Internet era. And the Internet era really had a huge change in data management. And there you know I'll start with a quote from Hegel, a German philosopher, and he said “a change in quantity creates a change in quality”, and what you had with the internet era is a very dramatic change in the quantity, the speed of which those transactions were coming at you, the volume of transactions coming at you and the variety of the transactions coming at you were all exceedingly much larger than prior to the internet era, and so that really drove a lot of changes in the data management space. Let's just talk about some of those data management spaces.
Joe Addiego: 11:45
So, there was a tremendous need for speed, making databases go faster. One of my earliest jobs was helping to sell hardware and I recall doing many computer performance reviews and we were talking about 30 reads or writes per second, 30 database transactions per second. You know the internet era brought that number up tremendously and, as I mentioned, one of my investments was Aerospike. Aerospike now does close to a cluster, 3 million transactions per second. Billions of transactions per day are required for this and there were a huge amount of changes necessary to do that. So, restructuring the internals of the database, we went from simple B-trees to log-structured merge trees. We added caches on the front ends so that the data would be filled with memory first, and then there was a rise, as I said, in real-time databases such as Aerospike.
So, the creation of databases focused on real-time and large amounts, high volumes enabled the ability to approach things like Twitter. What is needed for? You know applications like Twitter. Scale, for instance, the scale from megabytes to terabytes and beyond. You needed to go to distributed data architectures in order to enable that, to distributed data architectures. In order to enable that, you had to put caches in front of systems and distribute the data across that data variety. What happened was photos, videos, documents, and large objects were now stored rather than just transaction-oriented systems.
So let me give you an example from an early investment that we made. A company called Cleversafe, as you recall, you know, in the dawning of the phones, where a phone became a camera as well as your computer that you carried along with you with all your apps, which necessitated the requirement for a lot more storage. All the photos that everybody was taking had to be stored someplace.
So, Cleversafe, an early investment for us, developed a distributed object storage system. They distributed that data across a plethora of servers all across the internet. They encrypted that data securely and also sharded that data in such a way that only pieces of the data would only a subset of the servers that you had posted the data to, because it had a networking concept called feed-forward information also included. They encrypted that data securely and also sharded that data in such a way that only pieces of the data would be on any particular server. But you could recover the data with only a subset of the servers that you had posted the data to because it had a networking concept called feedforward information also included.
So, there was a little bit extra amount on each server, so you didn't need to recover every server. You know that is the only way that you could really securely store billions of photos on the network, as is stored now. And if you look at the, know, IBM eventually acquired Cleversafe for billions of dollars. So that was a successful investment, but also it enabled, you know, it showed a new paradigm required for data management.
Joe Addiego: 16:15
The NoSQL databases were also a key when these changes were happening because of the internet. I mentioned the fact that document storage is something that came out of the need for data variety, key value storage. You know that was the Aerospike database, tabular storage, Cassandra, wide column, Google’s big table, graph databases, time series databases.
What you had is a proliferation of databases for the wide variety of data that you needed to store. The architecture of an application changed considerably. From a single transaction database that sat at the core of an application, what you now had is you now had the front ends, either caches or real-time databases.
You had your system of record, the transaction databases still at the core of an application, but you also had a document database like MongoDB that you connected to store those objects. You know your JSON files and that enabled you to deal with this plethora of different kinds of information. So, a single database was no longer sufficient.
Well, given that that was what the application looked like, you now also had to have a messaging system to pass that data between each of the different components that you had. So, as I mentioned, I was at Talarian. That was a publish and subscribe messaging system.
Joe Addiego: 18:13
In the early days the folks that use those things were mostly in the financial community. For instance, if you look at the Philadelphia Stock Exchange, the one that deals with options, every one time, you know. If you look at the Philadelphia Stock Exchange, the one that deals with options every one time, you know if you have a movement in the change in a price of a stock, you know there really isn't that much data in a transaction if it's just that particular data that you're worried about, the change in price.
But if you have options, puts and call options, and you have them for many different dates, the amount of data that has to change when there's a change in price is very dramatic. And at Talarian we were very successful, for instance, with the Philadelphia Stock Exchange and the Chicago Stock Exchange and given that we were able to handle these very high transaction volumes, we eventually then became the system for the American Stock Exchange and the NASDAQ as well. So, it showed that there was a need to move data.
Joe Addiego: 19:34
Messaging queues were another approach at messaging systems where you queued data in and then, of course, with an application running, you needed streaming data services and Kafka was developed. So, what, overall, what this showed an application itself would go, you know now having from one particular database to many and the need to move all that data around. In addition to that, there was the need then to analyze the data, generally not in real-time , and there were a need to do analytics or business intelligence on that data.
So, at the same time as this data proliferation was happening, you saw large data systems also rising and changing. So early systems in that era were things like Teradata and Greenplum, these data warehouses had very specific architectures that were necessary in order to help to manage very, very large pieces of data.
Joe Addiego: 21:04
But with the internet, what happened is people were able to use standard hardware and use just new data management approaches and distribute that processing. And the early view, or the early vision of that, the early instantiation of that, was really Hadoop. So, Hadoop came with the Hadoop file system. It came with a method in order to process that data map reduce and ways of accessing it high. So now not only did you have your application, which had several databases, but then it had another system, usually with something like ETL, to process the data between the application in batch, and then that system sat behind it and it was generally a non-real-time system where you would do a data analysis.
Joe Addiego: 22:12
We'll talk about how, because of changes in hardware and the evolution of software, these systems are being drawn back. We'll talk about a specific use case in the semiconductor industry where real-time is advantageous. So, you want to do your analysis much closer to the time when the data is flowing into the system. But there has been an evolution even in the analytics space as well, moving from Hadoop to Parquet, Iceberg and larger independent systems like Snowflake and Databricks based on Apache Spark.
Don Ong (22:57.443)
So how did the rise of distributed systems and real-time databases pave the way for today's AI-driven innovation?
Joe Addiego (23:07.896)
Well, those systems enabled the AI systems because very high-quality data and the need for these data pipelines, because the LLMs are based on petabytes of data for training. So, what you've seen is that movement towards very large data sets has continued and it is now a necessity for LLMs using AI for training.
Another thing that's happening, another new pattern in the database world that's happening today because of AI is the addition of a vector database. So, in order to do a similarity search, you really need to change the representation, or you need to convert that raw data that you have of text, images, audio into a vector representation. And if you do that vector representation using the same embedding model that an application that you have and that's important to you uses, you can then use that data and do a similarity search based on that vectorized data.
And that is used quite often for systems that do what's known as retrieval, augmented generation or RAG applications, so that you don't have to go back to the LLM for the local data that you're mostly concerned with would vectorize your own data, put it into a vector database and then use an application and directly search that prior to going back to an LLM.
So, what you're seeing is the integration of yet another database into the data model. The actual need, as I mentioned, for variety velocity, the fact that the data was needed to scale to such large extent with much faster performance, and that there was a variety of data they often talked about it as the three Vs variety, velocity and volume that forced changes in the way that you had to manage the data.
Don Ong (26:46.504)
I agree. and then we're leaning into the data today and also Gen AI. So, we've talked about how data management evolved from the early days to the internet era and now we're into a whole new phase where data isn't just stored and managed but actively drives some of the most advanced technology we've ever seen. Generative AI and along with other AI and machine learning technologies relies on massive amount of high-quality well-organized data to function effectively.
So, exploring into why data is considered the lifeblood of AI, the challenge of managing data at such massive scales and how industry like semiconductors are already applying generative AI to solve real world problems. So, let's look into why data is so essential for technologies like generative AI and other AI ML models.
Joe Addiego: 27:47
As we mentioned, it's really the fact that AI relies on data. Data is used to train these AI systems, and they need to rely on a large amount of high-quality data, we're talking about petabytes of data to train the LLMs that exist right now.
Don Ong (28:14.808)
So we talked a lot, we mentioned a lot about data quality. So, can you explain a little bit on what role does data quality and organization play in successfully training the large language models, the LLMs?
Joe Addiego (28:31.48)
Well, basically, there's been a rule in the data management space for many, many years. It was called GIGO and that is garbage in garbage out. That still applies 50 years later, so that when you're training LLMs, you really know want you want to train it.
It's not as important actually for AI systems because the neural networks actually, you know, can train just based on data itself. In machine learning, it's especially important. You've got to label data correctly and there's a lot of data engineering that's required in machine learning, and so you have to label data correctly and the data needs to be applied. You have to do engineering, function engineering, to understand what the machine learning algorithms are going to provide for you. There are new governance frameworks that are focused on building that management. Now that is working with frameworks that work.
Don Ong (29:38.946)
Yes, and we read a lot about ChatGPT and OpenAI spending a lot of effort, a lot of resources on trying to curate the data and making sure that data is clean, accurate, and free of bias. what steps, how do, in fact, how do companies ensure data quality, traceability, and compliance as their data set grow in complexity?
Joe Addiego (30:09.326)
There are new governance frameworks that are focused on building that. There is an entire focus on data management now that is working with frameworks that work with the metadata in order to provide that quality, traceability, and compliance.
Joe Addiego: 31:04
With the fact that there is a lot of data that's now used, the security model has changed. So, the security model now has to focus on your data.
I mean, it always has been important, but a way to change the algorithm that would, or the training weights for AI, would be to pollute the initial data. So, the security models now really have to take into account what is happening with the training data. Let's talk about, for instance, what security models currently exist.
You've got, right now, data in transit secured. You also have data at rest secured, so while the data is on your disk, it's secured. What you're looking at in the future is that you want to secure the data in memory and compute. That is right now an opening, and you know, in addition to that, you would like to be able to share data between AI and ML systems and with other members of a team. But you want to maintain data privacy. There are models for that that are just emerging right now, and I can introduce those to you. One of those is confidential computing.
Joe Addiego: 33:05
Confidential computing provides a trusted environment, a trusted execution environment for a secure enclave. It's not just a hardware security module for the cryptographic keys that you're used to that you're used to, but it's an enclave where the encrypted data would come into.
The keys would be managed within that encrypted enclave. The data is then unencrypted while it's in the enclave. It's not available. It was encrypted while it was in memory. It then is also computed in an unencrypted form and then, before it is put back to memory and before any of the results are put back, those are also encrypted no-transcript, and it would enable the training data to be maintained in an encrypted form.
It would enable the weights to be maintained in an encrypted form, to be maintained in an encrypted form, and if you used it in an inference system, the weights would be encrypted until they're used, and the data would be inferenced in an encrypted approach.
So, it gives you a full cycle of encryption and I think that that is what we're going to be looking at in the future.
Joe Addiego: 35:14
Some of the benefits of it allow you know it prevents unauthorized access, even by administrators. You know it provides regulatory compliance, and it also enables multi-party data sharing while maintaining data privacy.
You can have people share different data sets. The system could then compute that and keep the data private across all of the different areas. So, I think that we're looking at the future there for what's going to happen with data management. So, it's going to add the extra aspect of being able to be managed and managed securely throughout its lifecycle.
Don Ong (35:59.042)
That's fantastic because as we talk about generative AI and also in semiconductor world, we keep talking about there's a need to share data across different companies, across different players, especially in the semiconductor industry as we're approaching 2nm and Anstrong. There's a lot of challenges. There's a need for us to share data. But the key concern for a lot of players is losing our IP. So, we want to share data with each other, but we don't want to lose our IP.
Joe Addiego (36:37.736)
Absolutely, semiconductor test equipment space and you know that could be a capstone that shows us how it evolved and how it can then use confidential computing in the future.
So, in the semiconductor testing space there's been a data format called Standard Test Data Format STDF. It's a standard file for storing test information, lot information and the semiconductor test results. The latest version is version four out of 2007. As you can see, that is a would equivalent to one of the prehistories.
Advantest and Synopsys did a strategic development a couple of years ago and you introduced something called ACS, which is Advantest Cloud Solutions, Nexus, and that system connects through the cloud in a real-time basis. So, it will do that. Designed for outsourced semiconductor assembly and test, it will connect various outsourced assembly and test vendors throughout the world together and it enables automated data streaming via an API in real-time. You can then apply real-time analytics to that data and over time, what you'll find is there will be people that will want to connect those analytics and do machine learning and do AI on that data, but they will want that data to be private to their particular systems.
They will want that data to be private to their particular systems, and that would be the case for the confidential computing. So, you've seen Advantest kind of move from these file-based formats moving to the cloud, to real-time streaming, to AI and, in the future, potentially confidential computing.
All of this would then take this view of improving yield cost savings by reducing retests and providing better quality. So, this is streamlining the data management processes for the various companies that are actually doing the testing. So, this is an example of how data management improvements through time are being applied everywhere, including the semiconductor test equipment space.
Joe Addiego: 40:23
I think it's always good to understand how these changes through industry are being applied. We now have many different systems available and, as often happens, you know you go through a proliferation and then there's a time for everything to kind of be pulled back in together.
So, what you're seeing is a kind of a reintegration of all the different pieces that have been scattered about data managed through them, all of the messaging that happens. You allow the data to be inconsistent throughout the system, and there is a view that, to the extent that you can remove the potentials for inconsistency, you're much better off. So, you're looking at a much tighter reintegration of data in the future, I believe.
Don Ong (42:33.998)
So, Joe, throughout this conversation, we've explored the history of data management, its evolution through the internet era and how it powers cutting edge technologies like generative AI today. As we wrap up, I want to shift our focus to the present and future.
So, data management continues to evolve rapidly with trends like cloud production, edge computing, and confidential computing, shaping how organizations operate. The big question now is how do they turn their data into a competitive advantage while staying secure and scalable?
So, we'd to dive into the latest trend and cover opportunities for innovation and get your advice for companies navigating challenges like data growth and cybersecurity. So, let's start with this. What are some of the biggest trends you're seeing in data management today and how can businesses prepare for what's coming?
Joe Addiego: 43:35
Thank you, well, I think we've mentioned some of them, but again I'll go back to so that the silos of data that people have are brought together. So, the data analytics is being moved closer and not allowed to get out of sync with the data that exists in the transaction systems, the data that exists in the transaction systems.
That's, I think, the largest focus, and then the next focus is basically security. That's going to be applied across the data realms. Well, one of the unique aspects of the confidential computing is that it is not a lot of lifting to do that. It's more a function of whether or not the hardware has adopted that and whether or not you have all of the people in place that can put that together. Many of these systems don't require recoding, luckily.
Don Ong (45:08.763)
So, looking ahead, what do you think will be the next major breakthrough in data management and how should companies prepare?
Joe Addiego (45:19.646)
Well, that's a very good question.
Other than what we have talked about in terms of the integration, the confidential computing, I'm not really sure what the next breakthrough will be.
Don Ong (45:39.224)
So, let's move into the wrap up phase. So, Joe, thank you so much for joining us today and sharing your incredible insights from the early days of data management to the transformative impact of internet and now cutting-edge role of AI and generative AI. Your perspective has been inspiring. For our listeners, if you are as fascinated by today's discussion as I am, don't forget to subscribe, leave a review and share this episode with anyone who's interested.
So, Joe, before we wrap up, any final thoughts or advice for our audience as they navigate the future of data-driven innovation?
Joe Addiego (46:20.844)
Thank you, no, I think we've discussed many different aspects, and I think the future is one where you know the data will continue to proliferate, where the data will continue to proliferate, and you'll need to secure your data much more than you do today.
Don Ong (46:44.542)
I'm excited about that and also for what's coming with confidential computing.
Thank you so much Joe!
Joe Addiego (47:00.844)
Thank you, Don.