Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks
Add this skill
npx mdskills install sickn33/docx-officialComprehensive DOCX workflows with excellent redlining guidance and minimal-edit principles for professional document review
1---2name: docx3description: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks"4license: Proprietary. LICENSE.txt has complete terms5---67# DOCX creation, editing, and analysis89## Overview1011A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.1213## Workflow Decision Tree1415### Reading/Analyzing Content16Use "Text extraction" or "Raw XML access" sections below1718### Creating New Document19Use "Creating a new Word document" workflow2021### Editing Existing Document22- **Your own document + simple changes**23 Use "Basic OOXML editing" workflow2425- **Someone else's document**26 Use **"Redlining workflow"** (recommended default)2728- **Legal, academic, business, or government docs**29 Use **"Redlining workflow"** (required)3031## Reading and analyzing content3233### Text extraction34If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes:3536```bash37# Convert document to markdown with tracked changes38pandoc --track-changes=all path-to-file.docx -o output.md39# Options: --track-changes=accept/reject/all40```4142### Raw XML access43You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, you'll need to unpack a document and read its raw XML contents.4445#### Unpacking a file46`python ooxml/scripts/unpack.py <office_file> <output_directory>`4748#### Key file structures49* `word/document.xml` - Main document contents50* `word/comments.xml` - Comments referenced in document.xml51* `word/media/` - Embedded images and media files52* Tracked changes use `<w:ins>` (insertions) and `<w:del>` (deletions) tags5354## Creating a new Word document5556When creating a new Word document from scratch, use **docx-js**, which allows you to create Word documents using JavaScript/TypeScript.5758### Workflow591. **MANDATORY - READ ENTIRE FILE**: Read [`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with document creation.602. Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components (You can assume all dependencies are installed, but if not, refer to the dependencies section below)613. Export as .docx using Packer.toBuffer()6263## Editing an existing Word document6465When editing an existing Word document, use the **Document library** (a Python library for OOXML manipulation). The library automatically handles infrastructure setup and provides methods for document manipulation. For complex scenarios, you can access the underlying DOM directly through the library.6667### Workflow681. **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for the Document library API and XML patterns for directly editing document files.692. Unpack the document: `python ooxml/scripts/unpack.py <office_file> <output_directory>`703. Create and run a Python script using the Document library (see "Document Library" section in ooxml.md)714. Pack the final document: `python ooxml/scripts/pack.py <input_directory> <office_file>`7273The Document library provides both high-level methods for common operations and direct DOM access for complex scenarios.7475## Redlining workflow for document review7677This workflow allows you to plan comprehensive tracked changes using markdown before implementing them in OOXML. **CRITICAL**: For complete tracked changes, you must implement ALL changes systematically.7879**Batching Strategy**: Group related changes into batches of 3-10 changes. This makes debugging manageable while maintaining efficiency. Test each batch before moving to the next.8081**Principle: Minimal, Precise Edits**82When implementing tracked changes, only mark text that actually changes. Repeating unchanged text makes edits harder to review and appears unprofessional. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text by extracting the `<w:r>` element from the original and reusing it.8384Example - Changing "30 days" to "60 days" in a sentence:85```python86# BAD - Replaces entire sentence87'<w:del><w:r><w:delText>The term is 30 days.</w:delText></w:r></w:del><w:ins><w:r><w:t>The term is 60 days.</w:t></w:r></w:ins>'8889# GOOD - Only marks what changed, preserves original <w:r> for unchanged text90'<w:r w:rsidR="00AB12CD"><w:t>The term is </w:t></w:r><w:del><w:r><w:delText>30</w:delText></w:r></w:del><w:ins><w:r><w:t>60</w:t></w:r></w:ins><w:r w:rsidR="00AB12CD"><w:t> days.</w:t></w:r>'91```9293### Tracked changes workflow94951. **Get markdown representation**: Convert document to markdown with tracked changes preserved:96 ```bash97 pandoc --track-changes=all path-to-file.docx -o current.md98 ```991002. **Identify and group changes**: Review the document and identify ALL changes needed, organizing them into logical batches:101102 **Location methods** (for finding changes in XML):103 - Section/heading numbers (e.g., "Section 3.2", "Article IV")104 - Paragraph identifiers if numbered105 - Grep patterns with unique surrounding text106 - Document structure (e.g., "first paragraph", "signature block")107 - **DO NOT use markdown line numbers** - they don't map to XML structure108109 **Batch organization** (group 3-10 related changes per batch):110 - By section: "Batch 1: Section 2 amendments", "Batch 2: Section 5 updates"111 - By type: "Batch 1: Date corrections", "Batch 2: Party name changes"112 - By complexity: Start with simple text replacements, then tackle complex structural changes113 - Sequential: "Batch 1: Pages 1-3", "Batch 2: Pages 4-6"1141153. **Read documentation and unpack**:116 - **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Pay special attention to the "Document Library" and "Tracked Change Patterns" sections.117 - **Unpack the document**: `python ooxml/scripts/unpack.py <file.docx> <dir>`118 - **Note the suggested RSID**: The unpack script will suggest an RSID to use for your tracked changes. Copy this RSID for use in step 4b.1191204. **Implement changes in batches**: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach:121 - Makes debugging easier (smaller batch = easier to isolate errors)122 - Allows incremental progress123 - Maintains efficiency (batch size of 3-10 changes works well)124125 **Suggested batch groupings:**126 - By document section (e.g., "Section 3 changes", "Definitions", "Termination clause")127 - By change type (e.g., "Date changes", "Party name updates", "Legal term replacements")128 - By proximity (e.g., "Changes on pages 1-3", "Changes in first half of document")129130 For each batch of related changes:131132 **a. Map text to XML**: Grep for text in `word/document.xml` to verify how text is split across `<w:r>` elements.133134 **b. Create and run script**: Use `get_node` to find nodes, implement changes, then `doc.save()`. See **"Document Library"** section in ooxml.md for patterns.135136 **Note**: Always grep `word/document.xml` immediately before writing a script to get current line numbers and verify text content. Line numbers change after each script run.1371385. **Pack the document**: After all batches are complete, convert the unpacked directory back to .docx:139 ```bash140 python ooxml/scripts/pack.py unpacked reviewed-document.docx141 ```1421436. **Final verification**: Do a comprehensive check of the complete document:144 - Convert final document to markdown:145 ```bash146 pandoc --track-changes=all reviewed-document.docx -o verification.md147 ```148 - Verify ALL changes were applied correctly:149 ```bash150 grep "original phrase" verification.md # Should NOT find it151 grep "replacement phrase" verification.md # Should find it152 ```153 - Check that no unintended changes were introduced154155156## Converting Documents to Images157158To visually analyze Word documents, convert them to images using a two-step process:1591601. **Convert DOCX to PDF**:161 ```bash162 soffice --headless --convert-to pdf document.docx163 ```1641652. **Convert PDF pages to JPEG images**:166 ```bash167 pdftoppm -jpeg -r 150 document.pdf page168 ```169 This creates files like `page-1.jpg`, `page-2.jpg`, etc.170171Options:172- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance)173- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred)174- `-f N`: First page to convert (e.g., `-f 2` starts from page 2)175- `-l N`: Last page to convert (e.g., `-l 5` stops at page 5)176- `page`: Prefix for output files177178Example for specific range:179```bash180pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page # Converts only pages 2-5181```182183## Code Style Guidelines184**IMPORTANT**: When generating code for DOCX operations:185- Write concise code186- Avoid verbose variable names and redundant operations187- Avoid unnecessary print statements188189## Dependencies190191Required dependencies (install if not available):192193- **pandoc**: `sudo apt-get install pandoc` (for text extraction)194- **docx**: `npm install -g docx` (for creating new documents)195- **LibreOffice**: `sudo apt-get install libreoffice` (for PDF conversion)196- **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images)197- **defusedxml**: `pip install defusedxml` (for secure XML parsing)
Full transparency — inspect the skill content before installing.