The aim of this article is to provide an overview of Digital Preservation, including its definition, key terminology, needs, benefits, drawbacks, challenges, and various methods and techniques. Additionally, the role of Artificial Intelligence in digital preservation will be explored, as well as some common applications and innovative infrastructure approach hints. Finally, a proposed strategy for how companies, organizations, and governments can get started with their own digital preservation strategy will be discussed.
- Digital Preservation
- Needs
- Archiving Terms, Methods, and Techniques
- Artificial Intelligence in Digital Preservation
- Benefits
- Drawbacks and Challenges
- Digital Preservation Strategy
- Infrastructure Hints
1. What is Digital Preservation
Digital Preservation (DP) is the process of maintaining, managing, and storing digital content in a way that ensures it remains accessible and usable into the future and used by future generations. It is a proactive approach to managing digital content, including taking steps to ensure that it will not be lost or corrupted over time.
To do this, preserving institutions must carefully select which materials to keep, manage using best practices, and store them in an appropriate environment.
It encompasses everything from backing up data to preserving websites and digitized collections and also allows us to keep pace with the ever-increasing volume of born-digital content being created every day. It also has the potential to help us solve problems in the present and future by giving us access to data that would otherwise be lost.
There are many ways to approach Digital Preservation, but one common thread is the need to create redundant copies of digital files and store them in multiple formats. This helps protect against data loss due to hardware failure or software obsolescence. Another key element is metadata: descriptive information about a file that can help identify it, describe its contents, and provide context for its use.
Many organizations are now turning to cloud-based storage solutions as a cost-effective way to preserve large volumes of digital data. The cloud provides an off-site backup that can be accessed from anywhere in the world, making it an ideal solution for disaster recovery planning.
Digital Preservation is a complex and multi-faceted endeavor, but it is essential to the future of our digital world.
2. Needs
In a rapidly digitizing world, Digital Preservation is essential for companies, organizations, and governments to preserve their digital assets. It is also important because it allows us to keep track of our history and culture in a more accurate and complete way than ever before and that our digital heritage is not lost as technology changes and file formats become obsolete.
There are many reasons why Digital Preservation is important.
- Data Loss: DP ensures that important information is not lost if a device fails or is damaged. As more and more information are stored electronically, the risk of data loss increases and once they are lost, they cannot be recovered. As we all know, digital data can be very volatile and can be easily lost or corrupted. By preserving digital data, we can make sure that it will be available for future use.
- Asset Value: Digital Preservation helps to preserve the value of digital assets. In many cases, the value of a digital asset lies in its uniqueness – such as a rare photo or video footage of an event. If this asset is not preserved, it may be lost forever, or its value may diminish over time as similar assets are created.
- Access over time: DP ensures that information remains accessible over time. As technology changes, it can become difficult to access older electronic files. By preserving these files, companies and organizations can ensure that they will be able to access them in the future – even if the technology used to create them is no longer supported.
- Cost Saving: Digital Preservation helps us save money in the long run. Although it may cost money upfront to preserve digitized data, it will ultimately save money because digitized data does not degrade over time like physical media does. In addition, once data is preserved in a particular format, it can be used over and over again without incurring additional costs. Another reason for preserving digital information is that it can be very costly and time-consuming to recreate it. For example, if an organization wants to digitize its archival records, but does not have a good plan for preserving them, it may have to do the work all over again in a few years when the records become unreadable due to format obsolescence or data rot.
- Enhanced Access: DP enhances access to information. When data is properly preserved, it becomes much easier and faster to search for and retrieve specific information when needed. This is especially useful for researchers who often need quick access to large amounts of information.
- Historical Record: Digital Preservation contributes to the overall goal of building an accurate historical record of our society and culture. By preserving digitally created content such as websites, social media posts, and email messages, we can create a rich store of knowledge that will help us better understand our past and present.
- Lesson Learned: preserving digital information helps to ensure that history does not repeat itself. If we do not preserve our collective memory in digital form, we run the risk of forgetting lessons learned from past events – both good and bad.
- Legal and Regulatory: DP can help organizations to meet legal and regulatory requirements. In some cases, companies or governments are required to keep certain records for a specific period of time. By preserving these records electronically, they can ensure that they will be able to meet these requirements.
Digital Preservation is therefore essential for companies, organizations and governments that want to protect their information assets and ensure that they remain accessible and usable into the future.
3. Archiving Terms, Methods, and Techniques
3.1 Archiving Terms
The life cycle of a document is the process that it goes through from its creation to its eventual disposal. The length of time that a document spends in each stage of its life cycle varies depending on the type of document and how important it is.
There are four main types of archiving: short-term, medium-term, long-term, and permanent.
- Short-term archiving is typically used for records that are not needed on a regular basis but may need to be accessed occasionally. These records are usually kept for a period of 1-5 years.
- Medium-term archiving is used for records that are needed on a more regular basis, but not necessarily every day. These records are usually kept for a period of 5-10 years.
- Long-term archiving is used for records that need to be preserved indefinitely. These records are usually kept in an archive that is separate from the organization’s main premises.
- Permanent archiving is used for records that have been identified as being of permanent value. These records are usually kept in a completely separate and fully restricted archive.
3.2 Methods and Techniques Categories
There are a variety of methods and techniques used for Digital Preservation, which can be broadly divided into three categories: technical measures, organizational measures, and user engagement.
- Technical measures involve taking steps to ensure the long-term viability of digital files. This can include format migration, which involves converting files from one format to another as technology changes; data normalization, which ensures that files are consistent and compliant with standards; and file fixing, which repairs or replaces damaged or corrupt files. Technical measures also encompass storage strategies like bitstream copy, in which an exact replica of a file is stored; emulation, in which software is emulated to allow access to older formats; and virtualization, in which a file is stored in a simulated environment.
- Organizational measures are designed to ensure that digital information is properly managed and maintained over time. This can include creating policies and procedures for curation and preservation; developing workflows for managing digital assets; establishing roles and responsibilities for staff involved in preservation activities; and training staff on best practices for preserving digital information. Organizational measures also encompass building partnerships with other institutions or organizations engaged in similar activities, as well as conducting outreach to raise awareness about the importance of preserving digital information.
- User engagement refers to efforts to promote the use of preserved digital content and increase its value over time. This can include creating metadata describing preserved content so that it can be more easily discovered and used by researchers; providing access to preserved content through online portals or repositories; developing tools or applications that make it easier to use preserved content; organizing events or workshops focused on using preserved content; and writing blog posts or articles about interesting ways that preserved content has been used. User engagement also encompasses efforts to solicit feedback from users about their needs and experiences using preserved content so that improvements can be made over time.
3.3 Technical Measures
Digital Preservation technical measures can be broadly divided into six categories: refreshing, migration, replication, emulation, encapsulation, and persistent archives.
- Refreshing is a Digital Preservation technique that involves periodically replacing outdated media with new media in order to avoid data loss due to media degradation. This helps to ensure that the content remains accessible and usable over time.
- Migration is another Digital Preservation technique that involves moving content from one format to another as well as from one storage medium to another. As technology evolves, older file formats can become obsolete and unable to be read by newer software. To this end, many organizations are working on developing standards for long-term storage of digital data. Some common formats used for Digital Preservation include PDF/A, TIFF, JPEG2000, and XML. This helps to ensure that the content remains accessible and usable as technology changes.
- Replication is a Digital Preservation technique that involves creating copies of digital content. This helps to ensure that the content remains accessible even if the original is lost or damaged in the event of hardware or software failure.
- Emulation is a Digital Preservation technique that involves emulating the original environment in which the content was created. This helps to ensure that the content can be accessed and used in future environments.
- Encapsulation is a Digital Preservation technique that involves packaging digital content in such a way that it can be easily transported and used in different environments. This helps to ensure long-term accessibility of the content.
- Persistent Archives are long-term storage solutions designed for archival purposes. These help to ensure that digital content can be stored safely for extended periods of time without risk of loss or damage.
- Metadata attachment is the practice of adding machine-readable information metadata to digital content. Metadata is vital for Digital Preservation projects. This data can help describe the content of a file, when it was created, who created it, and any other relevant information. This metadata can be stored along with the files themselves or in a separate database. This helps to improve the discoverability and usability of the content over time.
4. Artificial Intelligence in Digital Preservation
Artificial Intelligence, Machine Learning, Natural Language Processing and Robotic Process Automation are some cutting edge technologies applied on Digital Preservation.
- Artificial Intelligence (AI) is a field of computer science that studies the ability of machines to perform tasks which are traditionally associated with human intelligence.
- Machine Learning (ML) is an AI discipline that focuses on teaching computers to learn from data and make predictions based on patterns in the data.
- Natural Language Processing (NLP) is an AI discipline that focuses on the ability to understand human language. This includes both written and spoken language.
- Robotic Process Automation (RPA): RPA is a technology that enables organizations to automate repetitive, rules-based tasks.
AI, ML, NLP, and RPA can play a role in Digital Preservation through a number of different applications and technologies.
- Metadata Identification and Extraction: AI, ML and NLP can be used in Digital Preservation through the development of algorithms that can automatically identify, extract, and generate metadata from digitized content. NLP can be used to analyze unstructured text data such as PDF documents, scanned images of handwritten text, and extract relevant information for preservation purposes. This information can then be used to create structured data that can be more easily searched and analyzed.
- Predictive Models: AI can be used to develop predictive models that can help assess the risk of digital information becoming lost or corrupted over time. These models can be used to prioritize preservation efforts and make decisions about which content is most at risk and needs more frequent monitoring. ML can be used to develop predictive models of user behavior that can help guide decisions about when and how to migrate digital content to new formats or platforms.
- Developing new preservation methods: Digital Preservation also includes developing new methods to preserve born-digital materials, and here too AI has potential applications. For instance, Machine Learning could be used to develop new file format identification schemes that are more accurate than existing ones.
- Hardware and software emulation: AI could be used to create emulations of obsolete hardware and software platforms so that older digital content remains accessible even when the original platform is no longer available.
- Ingestion: AI intelligent document capture can automate the ingestion of documents into a repository. This can help reduce processing time and improve accuracy by reducing the need for manual data entry.
- Location: AI-based search engines can be used to more effectively locate documents within a repository. This is particularly useful when dealing with large collections of documents. By using Natural Language Processing, these search engines can interpret queries in order to provide more relevant results.
- Pattern Identifications: Machine Learning algorithms can be used to identify patterns in data that might otherwise be missed by humans. This type of analysis can be used to detect issues with digital objects or to predict future storage needs.
- Identify at-risk documents: Machine Learning can be used to identify documents that are at risk of degradation or obsolescence. By identifying these documents early, they can be given priority for digitization or migration to new formats.
- File Format Identification: ML is being used in Digital Preservation in the identification of file formats. File format obsolescence is a major challenge faced by those responsible for preserving digital information, as outdated formats can become unreadable over time. Machine Learning can be used to automatically identify new and unknown file formats, as well as predict when existing formats are likely to become obsolete.
- Detection of duplicate or near-duplicate content: ML can detects duplicate or near-duplicate content. This can be particularly useful in large collections of documents, where manually checking for duplicates would be impractical.
- Group similar contents: Machine Learning can be used to group together similar items, which can then be reviewed by a human curator.
- Predict Future access patterns: Machine Learning can be used to predict future access patterns and optimize Digital Preservation strategies accordingly.
- Preservation plans: Machine Learning can be used to automatically generate preservation plans for individual digital objects based on their characteristics and past behavior.
- Media damage detection: Machine Learning algorithms can be trained to detect features in images that are indicative of damage or deterioration, such as cracks, stains, or discoloration. Audio files can be analyzed to identify degradation and potential problems with playback quality over time. Video files can be analyzed for signs of degradation, such as pixelation, blockiness, or artifacts introduced by compression algorithms.
- Textual data Analysis: digital documents can be analyzed using Machine Learning algorithms for a variety of purposes, including identification of similar documents, topic modeling, and named entity recognition.
- Manipulated Content Identification: Machine Learning is being applied increasingly often to the problem of detecting fake or manipulated digital content.
- Route Documents to Archive: RPA can be used to scan documents for specific keywords or metadata values and then route them to the appropriate archive. RPA can also be used to extract data from documents and populate databases or other storage systems.
- Task Automation: RPA can reduce the human effort to perform time-consuming and labor-intensive tasks. RPA can also improve accuracy and consistency by eliminating human error from manual tasks. Additionally, RPA can free up staff time so they can focus on more value-added activities.
- Process optimization: Digital Preservation can use complex processes, involving a wide range of stakeholders and often large volumes of data. RPA can help to streamline these processes, making them more efficient and effective. In addition, RPA can provide an audit trail of all actions taken, which can be valuable in ensuring compliance with internal policies or external regulations.
Digital Preservation is an ongoing process that requires active management and the use of AI and emerging technologies to ensure the long-term accessibility of digital information.
5. Benefits
The benefits of Digital Preservation include:
- Data Protection: Digital Preservation can help to prevent data loss. This is because digital information can be stored in multiple formats and locations, which makes it less likely that all copies will be lost or destroyed.
- Data Access: Digital Preservation can make it easier to access archived data. This is because digitized information can be searched more easily than paper records.
- Ensuring long-term access: By preserving digital information and data, we can ensure that it will be available for use by future generations.
- Reducing costs: Digital Preservation can save money by reducing the need to migrate or convert data as technology changes over time.
- Improving efficiency: Automated processes can help speed up workflows and improve efficiency in retrieving stored data.
- Enhancing security: Storing data in a central location or distributed locations can help reduce the risk of loss or damage due to natural disasters or other events beyond our control.
- Supporting research: Data preserved for the long term can support new research initiatives and help answer questions that were not yet possible to ask when the data was first created.
- Fostering collaboration: By sharing preserved data, researchers from around the world can collaborate on projects more easily than ever before.
- Increasing transparency: Long-term storage of data makes it possible to track changes over time, providing a record of activities that can promote transparency in organizations.
- Enabling reuse: Data that has been preserved can be reused in new ways, such as for new studies or products.
- Improving user experience: A well-designed system for accessing preserved data can make it easier for users to find what they need
- Providing peace of mind: Knowing that your organization’s valuable digital information is being safeguarded for the future can provide peace of mind in an uncertain world.
6. Drawbacks and Challenges
Digital Preservation is a complex and ongoing process that requires careful planning and management to ensure the continued availability of digital information. There are some significant challenges associated with preserving digital information.
- Refreshing: there is a risk that data loss will occur during the transfer of digital files from one generation of storage media to another.
- Migration: Digital formats can become obsolete quickly, making long-term access to files difficult. As technology evolves, older file formats become obsolete and cannot be opened by newer software programs. This can make it difficult to access and preserve older digital documents. Migrating data to new formats as technology changes can be complex, time-consuming, and expensive. Additionally, there is no guarantee that all of the information in the original file will be preserved in the migrated file. File format conversions are best dealt with asynchronously in order to not drain system resources and bog down other functionality.
- Replication: When replication is only implemented in the application layer, it can use up system resources while replications are being created. Also, synchronous replication can slow down ingest workflows. If you only rely on bucket replication policies, the copies will not be independent, and corrupted data could overwrite good copies.
- Metadata Attachment: It can be difficult to create and maintain accurate metadata records. Metadata extraction is not often implemented during ingest, even if a microservices approach is implemented. This makes it tricky, if not impossible, to perform later on demand if local practices change or if new and better metadata extraction tools become available. The same can be said for file characterization.
- Data Integrity: Current strategies for fixing remain mostly at the file-level, which isn’t effective. Relying only on cryptographic digest algorithms makes the problem worse. As repositories get bigger, file-level fixity needs to change into aggregate practices.
- Managing the vast amount of data: Another challenge is managing the vast amount of data that can be generated by digitization projects. Without adequate management tools and processes in place, it can be difficult to keep track of all the digital assets created or acquired by an organization. This can lead to duplication of effort and wasted resources.
- Cost: cost can be a major barrier to effective Digital Preservation. The initial costs associated with digitization projects can be significant, and on-going costs such as storage, staffing, and equipment must also be considered.
- Technical competence: The technical skills required to manage digital assets can be difficult to find and retain.
- Hardware obsolescence: as new computers and other devices are released, older ones become less common and may eventually cease to function altogether. This can make it difficult to access data that is stored on outdated hardware.
- Software obsolescence: there is a risk that software used for Digital Preservation and document archival will become obsolete or unsupported, rendering archived materials unreadable or inaccessible.
- Data retrieval: it can be time-consuming to retrieve data from a long-term storage system especially if they are not well organized. This is because the data must be located, extracted, and then converted into a format that can be used by modern computer applications. In some cases, it may be difficult to locate specific pieces of information within large collections of stored data without robust indexing tools.
- Multiple locations: Digital objects are often stored in multiple locations, which can complicate preservation efforts.
- Security: hackers could target digitized archives in an attempt to alter or delete preserved content.
Despite these challenges, there are many ways to overcome them through proper planning and execution. By understanding the risks and challenges associated with Digital Preservation, organizations can develop strategies for mitigating them. This includes investing in digital archiving solutions that can help to ensure long-term accessibility and usability of digital content.
7. Digital Preservation Strategy
There are many factors to consider when developing a Digital Preservation strategy, including the types of content to be preserved, the format(s) in which they will be stored, and the tools and processes needed to ensure their longevity. While there is no one-size-fits-all solution, there are some general principles that all strategies should adhere to.
- Define what is Digital Preservation and document archiving. Digital Preservation and document archiving is the process of ensuring that digital information and documents are accessible and usable in the future. This includes protecting against data loss, format obsolescence, and other risks.
- Define your goals. Before you can develop a strategy, you need to know what you want to achieve with it. What are your goals for preserving and archiving digital information? Do you want to ensure long-term access to corporate records? Make sure historical data is available for research? Be able to comply with legal requirements? Once you know your goals, you can start developing a plan to achieve them.
- Identify your stakeholders. Who will be affected by or interested in your Digital Preservation and document archiving strategy? Identifying your stakeholders will help you understand their needs and how they can help or hinder your efforts. Stakeholders might include management, IT staff, lawyers, researchers, customers, or the general public.
- Assess your current situation. Take stock of what digital information and documents you have, where they’re stored, and in what format(s). This will give you a better understanding of the scope of work involved in implementing a Digital Preservation strategy. It will also help identify any gaps in your current storage and backup procedures.
- Develop policies and procedures. Establishing policies and procedures for managing digital information will help ensure that it is preserved effectively over time. These should cover everything from deciding which formats to use for storing data to setting up regular backups.
- Select appropriate technologies. There are many different technologies available for preserving digital information. Some common options include cloud storage, content management systems, digitization software, and encryption. Choosing the right technology depends on many factors including budget, scalability, compatibility, and security.
- Implement storage solutions. Once you’ve selected appropriate technologies, it’s time to put them into practice by setting up storage solutions that meet your organization’s needs. This might involve configuring cloud storage, installing digitization software, or setting up an encrypted backup system.
- Train staff on best practices. To ensure that everyone understands how to properly preserve and archive digital documents, provide training on best practices such as file naming conventions, metadata standards, and proper handling of media.
- Regularly review and update your strategy. As technologies and organizational needs change over time, it’s important to regularly review and update your Digital Preservation and document archiving strategy. This will help ensure that it remains effective and relevant.
- Seek external assistance when needed. Don’t hesitate to seek out expert help when needed, whether it’s hiring a consultant to assess your current situation or working with a vendor that specializes in Digital Preservation solutions.
With the right planning and execution, Digital Preservation can be highly effective tools for protecting an organization’s critical data and records. When implemented correctly, these practices can help to ensure that an organization’s information is accessible and usable for years to come.
8. Infrastructure Hints
The architectures of future Digital Preservation and document archiving systems are expected to be based on scalable, distributed, heterogeneous, and interoperable systems. The trend is towards more use of open standards, commodity hardware, and cloud-based solutions.
Digital Preservation infrastructure is critical for ensuring the long-term sustainability of digital resources. However, current infrastructure is often outdated and inadequate for the task.
Systems will need to be able to ingest large volumes of data at high speeds and from a variety of sources. Data will need to be stored in a variety of formats (including structured, unstructured, and semi-structured data) across a range of media. Data needs to be accessible for long periods of time, which requires the use of emulation techniques. Emulation involves making data readable by future generations of software and hardware, even as technology changes.
We need to adopt modern tools and practices such as Software-Defined Storage, Containers, and Serverless Computing. These tools are more efficient and scalable than legacy systems, and they also have a smaller carbon footprint. Additionally, they are easier to maintain and develop, which is important for preserving cultural heritage.
Data will need to be accessible for long periods of time (possibly centuries), even as technology changes. This will require the use of emulation techniques to ensure that data can be read by future generations of software and hardware.
In the past 20 years, software-defined storage, containers, and serverless computing have emerged and while some organizations are using them, the sector and many distributed Digital Preservation systems have not adopted them. These are some of the tools we need to evolve Digital Preservation to handle the scale and size of our collections.
Virtualization gave birth to cloud computing and storage. This resulted in apps being containerized into smaller, more efficient packages. Containers can operate in a dedicated, optimized environment which allows developers to do computationally demanding tasks on demand.
This evolution of Serverless Computing Models, such as functions‐as‐a‐service, enables on demand, efficient, microservices that can be called anytime, enabling asynchronous workflows. Additionally, moving these functions to a serverless platform also reduces the carbon impact.
There is a game changer called Software-Defined Storage which enables developers to break free from filesystem limitations. This game changer also natively includes features that support Digital Preservation. While there are other middleware storage abstractions, they have not had the same broad userbase.
Software-defined storage allows you to use different types of commodity hardware to create a scalable storage network that supports file, block, and object interfaces. External systems are easier to integrate with because of the flexible and extensible HTTP accessible APIs for object storage.
Most software-defined storage networks support data integrity through erasure encoding, data scrubbing, and CRC checks, similar to RAID at a massive scale. By using CRCs in the storage infrastructure instead of cryptographic digests in the application layer, environmental impacts are significantly reduced. The storage network can be optimized for the desired level of protection and failure handling.
Bucket Replication Policies can be used to replicate data to georedundant locations within the storage infrastructure layer. However, to retain the independence of the copies, this should be combined with other data integrity measures such as independent object‐level fixity checks comparing the original digest with all replications.
Asynchronous serverless functions can be used to extract metadata from images, videos, and text files. Ingesting large amounts of data can take a long time and performing metadata extraction as a serverless function can speed up the process. Additionally, on demand metadata extraction can be used to extract metadata from files that have already been ingested. This can be useful when new file formats are added or when changes to the extraction process are needed.
If we continue to use legacy stacks, we will have to put more labor into maintaining and developing them. However, if we switch to modern stacks, the logic will be simpler (and therefore easier to maintain), and the skillset needed is much more available since many Fortune 500 companies use the same infrastructure. In addition, it will be easier to recruit developers to work with these tools than for boutique software.
Modern stacks may seem to use more novel components; but they are lean, optimized, and efficient, which reduces the carbon footprint. Therefore, preserving cultural heritage should not be at the expense of the planet or people. Many physical repositories are in at‐risk locations given the effects of climate change.
The future of Digital Preservation will be shaped by advances in technology, changes in user expectations and needs, and the continuing evolution of best practices. New technologies will provide more effective ways to preserve digital information and data, while changes in user expectations will drive the need for more user-friendly interfaces and tools. The evolution of best practices will continue to promote the use of standards-based approaches to ensure long-term access to digital content.
This article was written by:
Giovanni Sisinna
Director of Program Management
LinkedIn